| |||||||||
Reliable system design is the design of systems with high levels of reliability and availability.
It should be noted that there is no such thing as a perfectly reliable system, and that reliable systems engineering cannot engineer out failure modes which are not anticipated by modelling. For this reason, reliable systems are generally engineered to a designed failure rate, not to a zero failure rate.
Typical reliable system design failure rates include "five nines" (99.999% availability) and "six nines" (99.9999% availability). Some life-critical systems are designed to even higher levels of performance.
Reliable system design attempts to create reliable systems by design, rather than by blindly over-engineering systems. The analytical tools for reliable systems design are root cause analysis and databases, communications networks and air traffic control systems.
However, mere use of massive redundancy does not make a system reliable, so long as there is even one single point of failure left in the system. For example, a network where power feeds, network connections, routers, and router interconnections have all been correctly made redundant can still have a single point of failure if both routers are housed in a single rack, allowing a single spilled cup of coffee to take out both routers at once.
Note that even eliminating every conceivable single point of failure is not by itself enough to make a system truly resilient, as the extra redundancy may make the system vulnerable to Byzantine failure modes.
See also: extreme value theory