A common problem in any software project is failure and handling it properly. The term Failure is ambiguous, so to make things clear failure is when something unexpectedly stops working. A software bug is not failure. (Software bugs are poor logic or problem solving that results in unexpected outcomes.) Failure isn’t when a product doesn’t sell. Failure is when something outside of the program stops working. For example, if the network fails, or the hard drive fails or the memory fails then that’s failure. Failure at a large scale involves the destruction of a server farm.
Designing for failure requires a different mindset. For example, if a database fails then what happens? What techniques ensure the system remains operating despite failures? Is there a central method of detecting and reporting errors? (Something that’s understandable, so users aren’t left confused.) If you have a webserver and it suddenly can’t access the hard drive, what steps does it take? All of these and more must be considered.
Redundancy is the key to handling most failure. Never have a single point of failure in a system that must remain running. If a service normally runs on one server, have it run on two. This takes more work to design and build but it provides reliable software. Redundancy also impacts hardware. A server can have multiple power supplies, multiple network connections and multiple processors. If part of that system fails then the other parts keep going. Note, server redundancy means single servers don’t require as much redundancy and vice versa. To handle natural disasters, use location redundancy. Host server software on multiple server farms, so if one farm is wiped out then the other continues.
Once a severe failure happens, it often requires new hardware and a new installation of software. If data is involved then it has to be copied from a backup. Or, if it’s written by good programmers, there will be data redundancy built into the system. Add a new server and it will automatically copy the data over when it connects. You will never have a universal solution for this. Every project requires separate analysis and design. On occasion, third party backup and restore tools may be the best fit. Third party tools can be another form of failure though. A good example of this is tape drive backups. Getting through eighteen out of twenty backup tapes only for tape nineteen to fail is a common horror story. Third party tool failures apply to software tools as well. If you use a library, framework or toolset and that software suddenly loses support then it impacts your software. Or third party software may be updated and breaks your software. This is a common cause for all-nighters.
Third party tool failures are quite common. Suppose you have a product that’s written for a particular toolset, one provided by a popular web hosting company. If that hosting company goes under or decides it will no longer host your software, it may require a significant rewrite. That’s why experienced developers never use custom host-specific tools, even when they are convenient. New programmers make these sorts of mistakes all the time. Even when a company has policies in place, they’ll often ignore those polices. The only way to catch them is to periodically setup the software in a new server to ensure it works properly.
To properly deal with failure, software needs to be designed for it, libraries selected with that in mind and hardware requires planning. Not all projects require a high level of uptime. Simple games or information websites don’t need high levels of redundancy. Some projects, particularly those related to aircraft, military, space, and the medical field must be designed to reduce failure. In these fields, failure means loss of life. However this doesn’t mean no possibility of failure. Typically the loss of a life is given in terms of dollars. This lets engineers use equations to determine redundancy. It sounds inhumane but everything has a cost. The cost of redundancy may be more than the cost of failure.
This dollar value is nearly universal. Consider factory hardware and software. If a factory line shuts down due to equipment failure, it costs a certain amount per hour. Factories have good accountants that can accurately calculate these figures, which makes it a good example. Suppose a line failure costs $10,000 per hour. To make a line less prone to error, more expensive machines can be used. If that cost difference is $50,000 then it’s a good idea to improve the line. It would probably save more than five hours of downtime, so it will pay for itself. However if it costs $50,000,000 to improve a line, it would have to save 5,000 hours of down time to pay for itself. This is not cost effective. Cost compared to savings always impacts the level of redundancy.
When dealing with a software project, first calculate the downtime cost. How much will it cost if the software suddenly stops working? (Wages, lost revenue, repair costs, etc.) Balance this with the cost of software development and hardware redundancy. This will help guide project design.
Intelligently designed software, also called elegant designs, can greatly help with this. For example, suppose you’re writing a vertical business app. These are common because every business has unique requirements which makes for many vertical applications. Junior developers stick to what they know and create something with that. Intermediate developers look at the business requirements and choose software, like databases, that meet their requirements. Senior developers will consider more factors, such as redundancy, backup and restoring, disaster recovery, and more. Beyond senior developers, experts will look at each project from first principals. For example, if it was a small business then they would build a peer-to-peer database that auto-synchronizes. Adding new workstations would be easy. After installing, it connects with other systems and auto-synchronizes. If the network fails then local changes are saved locally and synchronized when the network returns. The software can be installed on a computer offsite, such as in the owner’s home, for disaster recovery. If the entire office is destroyed by a tornado, they just install software on new computers and it automatically synchs the data. Nothing is lost. There is no single point of failure and it intrinsically handles failure and disaster recovery. This is an elegant solution.