Even though counting nines is an appealing way to measure raw availability, it does not entirely capture availability needs. A five-nines system allows 5 ¼ minutes of down time per year. However, a system that could be down for 5 ¼ consecutive minutes would be totally unacceptable as a carrier of emergency 911 phone traffic, for example, even if such downtime only happened once per year. Such systems have service level agreements (SLAs) that strictly limit the amount of continuous downtime.
A telco or data network application usually requires that downtime be limited to a sub-second duration. This is well below the time needed to reboot an application stack made up of operating system, middleware, database managers and applications. These systems must rely on redundant, hot-standby components that can be swung into play within tens of milliseconds of the detection of a fault, and with minimal or no loss of data.
Critical systems like these strive for "service continuity", in which downtime is imperceptible to the system user. Because these levels of availability exceed the reliability of individual hardware and software components, the systems must repair themselves on the fly, using standby components. In this scenario, failed components are replaced or repaired offline while the application continues operations on redundant hardware. Then the key to achieving a high score in service continuity is minimizing the time needed for the system to fail over onto the surviving hardware. This becomes the MTTR of the system, if we ignore the very unlikely event of multiple simultaneous failures. (Note that such multiple failures cannot be ignored in life-critical systems).


