-T / T / +T | Comment(s)

Monday, July 28, 2003

Mythical Safety

The basic standard in the industry is that the airplane and its systems should be designed such that catastrophic failure does not occur more than once in a billion flight hours. Consider that standard for a moment: one would have to fly continuously, 24 hours a day, every day, for more than 110,000 years. Yet no jet fleet has demonstrated such freedom from catastrophic accidents. For example, the B737 fleet has amassed nearly 95 million flights, which equate to about one quarter of a billion flight hours - with nearly 50 accidents categorized as full loss equivalents (see http://www.airsafe.com). Another database places the number of hull loss accidents for the B737 fleet at 104 (see http://aviation-safety.net/database/type/103.shtml).

Lu Zuckerman, an experienced reliability and maintainability engineer, explains that the disparity between theoretical safety and the demonstrated level of safety has its roots in the way safety is assessed by system, not necessarily for the whole airplane. His discourse is the product of a recent exchange of electronic messages, in which Zuckerman was asked to expound on his thesis:

"The mythical failure rate of 10-9 (one in a billion) can be addressed two ways. The FARs (Federal Aviation Regulations) require that a single point failure that can contribute to the loss of an aircraft can occur no more frequently than 10-9 and if at all possible should be designed out. The 10-9 figure that most people quote does not apply to the aircraft level but, instead, it applies to the system failure that can cause loss of the aircraft. The FARs and JARs (Europe's Joint Aviation Regulations) will specify the acceptable frequency of a system failure with effects ranging from a minor problem to loss of the aircraft. The upper limit is usually 10- 9.

"The analysis relies on the manipulation of numbers. If the regulations specify that a flap or slat system can lock up no more frequently than 10-6 (one failure in one million hours of exposure), the reliability engineer is forced to use non-realistic failure rates for the individual components that can cause lock-up, as there may be several hundred in the respective system whose failure can result in lock-up. Where do these failure rates come from? Mostly from government developed databases that contain several hundred items that may or may not be used on aircraft. In some cases the failure rate will have an upper, a median and a lower level of confidence. The analyst is free to pick whatever confidence level best fits the calculation and ultimately arrives at the desired failure rate. If the requirement is 10-9 for runaway or non-movement when flap/slat operation is commanded, then the search for usable failure numbers becomes even more ridiculous.

"So now after making the reliability calculation using non-realistic numbers the reliability engineer passes them to the systems safety engineer. The systems safety engineer then creates a FTA (fault tree analysis) which is made up of gates, the most common of which are 'AND' gates and 'OR' gates. The diagram is from the top down, meaning that the top gate is the actual failure resulting in breaching the 10-9 requirement.

"The top gate is connected to the lower gates by connecting lines, or failure paths. The failure paths leading to the top gate come from either the 'AND' gates or 'OR' gates, and each of these gates represents a failure that can lead upward to the breach in the 10-9 requirement. There can be as many gates of both kinds to reflect the system complexity. There may be as many FTAs as required to reflect all of the services that supply the system, such as hydraulics, electrical and electronics and the hardware elements of the system.

"Imagine the gates as being locks. On an 'OR' gate there can be several failures, each of which is a key to the lock and any one of these failures can pass through that gate. On an 'AND' gate, each of the failures is a key to the lock but all must be present in order for the collective failure to pass through the gate.

"This is a simplification but [is] easy to understand. Let's assume that an 'OR' gate has five failures, each of which can open the lock. The math is (1 x 10-6) + (1 x 10-6) + (1 x 10-6) + (1 x 10-6) + (1 x 10-6), with a result of 5 x 10-6 (five failures in one million hours of exposure).

"Using the same numbers, consider an 'AND' gate. The math is (1 x 10-6) x (1 x 10-6) x (1 x 10-6) x (1 x 10-6) x (1 x 10- 6), with a result of 1 x 10-30.

"These calculations are unrealistic because they bear little, if any, relevance to the operating environment or to actual recorded failures. Let me explain. Because the mean time between failure (MTBF) is dictated by the certification authorities at the system level, the MTBFs for system elements and parts thereof are apportioned downward. This means that failure rates at the piece part level must be selected in order to attain the necessary failure rate at the component level. In this way the ultimate number will meet the MTBF requirement but the original part failure rate has nothing to do with the aircraft application.

"This is not true for electronics because of the millions of histories generated for all types of avionics circuits and components.

"Here is the kicker. The FTAs are for systems and not the aircraft. Each FTA terminates in assessing the probability of failure of the specific system. This process should be carried one step further by making a FTA with an 'OR' gate representing the aircraft, with each of the systems feeding into that gate. Having a final 'OR' gate will provide a truer picture of the catastrophic failure rate at the very top level. Because it is an 'OR' gate, one would most likely come up with a catastrophic loss rate in the area of 1 x 10-8 (one in 100 million hours of exposure) or possibly lower - not 10-9 - which more truly reflects the crash rate of commercial aircraft. People fixate on the 1 x 10-9 failure rate thinking it is at the aircraft level when it is in fact at the system level.

"However, the Federal Aviation Administration does not require this assessment at the aircraft level. So much for safety."

Support for Zuckerman's argument comes from a 2001 presentation at the National Aeronautics and Space Administration's System Safety Center irreverently titled "A Charlatan's Guide to Quickly Acquired Quackery," subtitled "The Trouble With System Safety." The author of this paper, one P. L. Clemens, said, "The hazard inventory techniques ... view risk hazard-by-hazard. If individual hazards pose acceptable risk, system risk is judged acceptable. So, a large inventory of individual hazards can be disguised as a 'safe' system - even though in reality it may portend a grim disaster!"

A timely illustration of a final 'OR' gate to assess safety at the aircraft level is contained in a March 12 joint letter from the Aerospace Industries Association (AIA) and the General Aviation Manufacturers Association (GAMA) to the Aging Transport Systems Rulemaking Advisory Committee (ATSRAC). The letter's authors argued that the ATSRAC effort to define wiring - more precisely the electrical wiring interconnection system, or EWIS - as a separate system "effectively doubles the risk to the fleet." It is an objection to the final "OR" gate Zuckerman recommends for safety assessment at the aircraft level. >> Zuckerman, e-mail rmspdq@sprint.ca <<

Doubling the Risk

"If the term Extremely Remote [a reference to the one in a billion failure] is misinterpreted as a quantitative requirement, then by considering wiring separately under [advisory circular] 25.1705 this may effectively increase the risk to the airplane fleet.

"For example, under current [advisory circular] 25.1309(b) the wiring is considered as part of the overall system. That system, including wiring, would have to meet 1 x 10-7 (1 in 10 million) for a hazardous failure condition. However, if [the] wiring system is considered as an entity in itself separate from the other aspects of the system then it would be acceptable for the EWIS failures to meet 1 x 10-7 per 25.1705 and the other part of the system to also meet 1 x 10-7 per 25.1309. This effectively doubles the risk to the airplane fleet."

Source: AIA/GAMA letter of March 12 to ATSRAC, Enclosure 1, p. 3