AVAILABILITY AND RELIABILITY ~ SOFTWARE ENGINEERING

AVAILABILITY AND RELIABILITY:
System reliability and availability may be defined more precisely as follows:
1. Reliability: The probability of failure-free operation over a specified time, in a given environment, for a specific purpose.
2. Availability: The probability that a system, at a point in time, will be operational and able to deliver the requested services.

Availability and reliability are obviously linked as system failures may crash the system. However, availability does not just depend on the number of system crashes, but also on the time needed to repair the faults that have caused the failure. Therefore, if system A fails once a year and system B fails once a month then A is clearly more reliable then B. However, assume that system A takes three days to restart after a failure, whereas system B takes 10 minutes to restart. The availability of system B over the year (120 minutes of down time) is much better than that of system A (4,320 minutes of down time).

The disruption caused by unavailable systems is not reflected in the simple availability metric that specifies the percentage of time that the system is available. The time when the system fails is also significant. If a system is unavailable for an hour each day between 3 am and 4 am, this may not affect many users. However, if the same system is unavailable for 10 minutes during the working day, system unavailability will probably have a much greater effect.

System reliability and availability problems are mostly caused by system failures. Some of these failures are a consequence of specification errors or failures in other related systems such as a communications system. However, many failures are a consequence of erroneous system behavior that derives from faults in the system. When discussing reliability, it is helpful to use precise terminology and distinguish between the terms ‘fault,’ ‘error,’ and ‘failure.’ When an input or a sequence of inputs causes faulty code in a system to be executed, an erroneous state is created that may lead to a software failure.

There are 3 methods used to improve the reliability of a system. Those are:
1. Fault avoidance: Development techniques are used that either minimize the possibility of human errors and/or that trap mistakes before they result in the introduction of system faults. Examples of such techniques include avoiding error-prone programming language constructs such as pointers and the use of static analysis to detect program anomalies.
2. Fault detection and removal: The use of verification and validation techniques that increase the chances that faults will be detected and removed before the system is used. Systematic testing and debugging is an example of a fault detection technique.
3. Fault tolerance: These are techniques that ensure that faults in a system do not result in system errors or that system errors do not result in system failures. The incorporation of self-checking facilities in a system and the use of redundant system modules are examples of fault tolerance techniques.