Monday, August 15, 2011

Availability versus Fault Tolerant

From Quora...

What is the difference between a highly fault tolerant and a highly available system?

Edit

Add CommentFlag Question

3 AnswersCreate Answer Wiki

Edmond Lau, Quora Engineer
While availability and fault tolerance are sometimes conflated to mean the same concept, the two terms actually refer to different requirements. Designing for high availability is a stricter requirement than designing for high fault tolerance.

Availability is a measure of a system's uptime -- the percentage of time that a system is actually operational and providing its intended service. Service companies, when offering service level agreements (SLAs) to their customers, usually quantify their availability in nines of availability. Carrier-grade telecommunication networks claim "five nines" of availability [1, 2], meaning that the network should be up 99.999% of time and experience no more than 5.26 minutes of downtime per year. Amazon's S3 covers three nines of availability (99.9% uptime) in its SLA [3] and offers a service credit if it is down for more than 43.2 minutes per month.

Fault tolerance refers to a system's ability to continue operating, perhaps gracefully degrading in performance when components of the system fail. RAID 1, for example, by mirroring data across multiple disks, provides fault tolerance from disk failures [4]. Running a hot MySQL slave that can be promoted to a master if the master fails, or eliminating Hadoop's NameNode as a single point of failure [5] are other examples of making a system more fault tolerant.

Making individual components more reliable and more fault tolerant are steps toward making an overall system more highly available; however, a system can be fault tolerant and not be highly available. An analytics system based on Cassandra, for example, where there are no single points of failure might be considered fault tolerant, but if application-level data migrations, software upgrades, or configuration changes take an hour or more of downtime to complete, then the system is not highly available.

--------
[1] http://en.wikipedia.org/wiki/Car...
[2] http://www.windriver.com/announc...
[3] http://aws.amazon.com/s3-sla/
[4] http://en.wikipedia.org/wiki/RAID
[5] http://www.cloudera.com/blog/201...
Suggest Edits

Friday, August 12, 2011

What is a Software Product?

I have just completed another seminar at SEI and I want to return to the subject of the prior post. However I am going to recast it into an answer to the question "What is a software product?"

I think most people are likely to take the question to refer to a consumer product like Word or Windows 7. In this case, the product primarily consists of an executable module but no source code. In addition, to make it fit for its intended use, there are several other assumed artifacts associated or imbedded into the product. One is either an installer application or instructions on how the product is to be placed into the intended environment. In addition, there is an implied deliverable which will instruct the end-user how to perform the tasks for which the product intended using the product. These days that is often left implied more than explicit as many products are designed to be self-evident. This means that the intended end-user has sufficient experience with similar user interfaces that the use can be learned with some exploration and no explicit training. Of course this is not always the case and products may either embed the user manual into the product or provide it as a separate artifact.

There is another sense of software product though. If you consider the automobile factory that turns out cars, the cars are products to be sold to consumers. However the factory itself is an asset that can be sold to another auto maker and retooled for their use. In a similar way, source code is a major component in a product that allows a software vendor to create software products. This "software factory" is a tangible asset and is recognized as that through intellectual property laws and common sense. (To avoid some awkward language, from here on, when I refer to a software product, I am using this second sense of the term and will ignore products that only include executables. )But is the source code the entire asset? If we were to sell our intellectual property to another, what can reasonably be considered part of the asset?

In the rush to market, many software vendors are willing to sacrifice the quality of the allied artifacts of the complete system. While this helps to achieve the end goal of creating a working system, it creates technical debt in the deferred maintenance on the artifact with a measurable increase in cost to the total cost of ownership. If the product will never be modified, the complete lack of supporting documentation for the product is reasonable. Even if some was produced during the initial construction of the product, there is no reason to keep it since it will never be read. But how many software products are created that are never intended to be modified? The correction of latent defects, changes in the environment, new requirements, are just a few of the many reasons why the maintenance phase of a software product life cycle is typically the costliest. To ignore the needs of this phase of the product life-cycle are foolish and self-defeating. Therefore any technical debt incurred from the initial development will eventually need to be paid if the consequences of that debt are to be avoided.

The platonic ideal that is sought is some form of self-documenting code. While local comments, when done well, or even well written code without comments, can often be understood without additional commentary, large software systems cannot be self-documenting since many issues transcend individual modules. A software engineer may be able to reproduce the information needed from the source code alone, or at least find a way to insert the modification without it, but this is most likely a more expensive fix than it would have been had the recreation work not been needed. The lack of adequate support for the code base increases maintenance cost. Further, until a code base automatically includes self-documenting ways to explain design choices that transcend individual module, this kind of documentation will continue to exist outside of the source code itself.

This highlights a weakness in the current practice of software engineering management. The true value of this artifact is discounted by management. Since the increased support cost can never be quantified, there is no self-correction of management attitudes. The continued ratio of maintenance cost to total life-cycle cost will remain the same. Worse, there are two compounding effects from this. First, the best information about the design of the system often only exists in the minds of the developers. When they leave the project, it is usually lost for good. Even if they remain in the same organization, the fidelity of the information is lost over time. Second, without an adequate understanding of the design principles used in constructing the product, a maintainer is likely to undermine some of the design clarity that requires restraint on the part of the coder due to ignorance. It is as if the architect leaves no plans for a building behind and the maintenance engineer tries to knock out a wall only to discover an unexpected support post or beam. Even worse would be to not recognize the structural nature of the element during the remodeling and remove it, thus weakening the structure increasing the risk of structural failure.

As Agile methods gain wider use in industry, some inexperienced developers are likely to believe that producing good design documentation is just a bother. A lack of professional development, the desire to just get on with their career, an ignorance of how to best document the design, or some combination of all three likely contributes to the poor quality of this product artifact. Often professionals do not even consider the design artifact to be part of the product itself but view it as some form of construction artifact that serves no purpose after delivery. Certainly many management teams do not appreciate, nor know how to evaluate, the quality of these allied artifacts.

The lack of attention given to the maintenance phase infrastructure and staff needs is a significant blind spot. It has always been a function that has suffered from lower status than the initial construction. As the product life cycles of these software products has gotten longer, there is likely to be at least some far sighted management teams who will eventually realize that long-term profits can be improved through a reduced maintenance phase in their software assets. Once that begins to happen, the types of design documents that most efficiently support the maintenance needs will become the object of study. Until then, these support staffs will continue to be expected to do their jobs with less than ideal knowledge transfer and the need to continually read the minds of developers who are no longer around to ask.

Tuesday, August 2, 2011

Program Documentation and Software Architecture

A friend wrote me about the frustration of performing software maintenance in an environment largely devoid of good program documentation. He finds that he must spend a great deal of time just trying to understand how the various classes relate to each other before he can begin to focus on finding the source of a bug or trying to decide how to add a piece of functionality. I feel for him. I have been there and I suspect most programmers at some point in there career have as well. It is an interesting way to peel back some truths about the software development life cycle.

The first truth is that modern languages do a poor job of capturing the larger abstractions of their design in a way that is self-maintaining. It is possible to create a code base that has good supporting documentation of the design underlying the code. However this documentation is separate from the code. Without proper management discipline, the supporting documentation will deviate from the as-built system, if it ever matched in the first place. At the higher levels of abstraction in the design, this is exactly what is described as the system architecture.

What a maintenance programmer wants in the software system is the quality of maintainability. This is greatly enhanced by the presence of good supporting design documentation. One of the key attributes desired in this documentation is the tracability of a requirement to the specification to the code which implements that quality. When the only artifact available is the source code, this is rarely possible since the relationship is never one-to-one and usually not even many to one. Instead it is a many to many relationship between requirement and code. The consequence of this is unintended consequence and it has cost many programmers sleepless nights that they search for ways to undo the unintended consequence of their fix or enhancement. With current, accurate and complete documentation, the maintenance programmer has a far greater chance of more quickly understanding how the code implements the function (or non-functional quality) and making informed decisions about how best to make the change or fix the bug consistent with the original design and without introducing a new bug.

The sad fact is that the majority of shops do not have even inaccurate design documentation that reflect their production systems. There are a few reasons for this but in the end it always is indicative of poor management. First, there must be management discipline to enforce the delivery of design documentation with a system. Second, the professional staff must have the skills and discipline to create usable documentation. Third, the tradition of looking upon maintenance programmers as less-than developers is short-sighted when viewed from the perspective of the entire system life-cycle.

If the development staff and maintenance staffs are part of the same organization, management has an excellent opportunity to evolve a site standard for the design documentation that will be most helpful to the maintenance organization. When they are within the same larger organization this is a straightforward management task. Peer reviews by the maintenance staff of the documentation to be turned over and the empowerment of the maintenance organization to resist turnovers that are incomplete or inaccurate will allow the maintenance organization to provide the kind of efficient and quality service that is expected of them.

If the development organization is not within the same organization, many other problems occur. Often development is out-sourced to a professional organization. While they will offer a high-value service, they will be constrained by the terms of their contract. Sadly, the needs of the maintenance and operational organization are often not given proper thought if they are even considered at all. Yet the statistics show that many systems have extended production and maintenance periods and that the money spent in those periods far exceeds what is spent in the initial development. Providing artifacts that reduce the cost and increase the efficiency of these processes is enlightened self-interest. Assuming the maintenance organization has a standard way that they document their system design, the development organization should be contracted to provide an acceptable product at delivery.

Since the beginning of system development, the emphasis has always been on the finished, working product with no regard for the artifacts that can be associated with that. Few shops even have a filing system in place to keep the products of the system development in a way that allows their review. More often than not, those products are boxed and remain in the project manager's office until he leaves the organization and are then discarded with no review.

This emphasis on the end-product alone extends to management decisions regarding what is important when a project inevitably is pressed to deliver and the schedule has slipped. Rarely will the staff be driven to deliver documentation contemporary with the product. This separation of the product and the documentation subverts the review process (if it exists) and diminishes the quality of that product. Often errors in the documentation are only caught much later when the maintenance staff must use it to perform their work.

Since the design artifacts are never directly delivered, they often depart from the as-built system. Without the professional and management commitment to create and preserve high-quality design documentation, it will not happen. This is as much due to the lack of professionalism within the management of the maintenance staff. A seasoned maintenance staff with experience will push for and receive good documentation that will support their job function.