Safety: Safety-Proofing Software Certification

By David Evans | January 1, 2006
Send Feedback

The nose pitch-up led to an emergency airworthiness directive (AD), but the tale is far from over and gets to the heart of software certification.

The Aug. 1 case involves a Malaysian Airlines B777-200 that experienced what is ostensibly a glitch from faulty software at 38,000 feet, or flight level (FL) 380. The word "glitch" may unnecessarily downplay the problem, which was really quite hairy from the account in the Australian Transport Safety Bureau (ATSB) preliminary report of investigation, released Sept. 16. The airplane was on an early evening flight from Perth to Kuala Lumpur. About 30 minutes after takeoff, according to the ATSB account, "The primary flight display (PFD) speed tape…indicated that the aircraft was approaching the overspeed limit and the stall speed limit simultaneously. The aircraft pitched up and climbed to approximately FL 410 and the indicated airspeed decreased from 270 knots to 158 knots. The stall warning and stick shaker devices also activated."

Accordingly, "The pilot in command (PIC) disconnected the autopilot and lowered the nose of the aircraft."

However, according to the ATSB account, "The autothrottle commanded an increase in thrust, which the PIC countered by manually moving the thrust levers to the idle position. The aircraft pitched up again and climbed 2,000 feet. The PIC notified air traffic control (ATC) that they could not maintain altitude and requested a descent and radar assistance. The crew was able to verify with ATC the aircraft speed and altitude."

In other words, the cockpit crew could not trust their primary flight display (PFD). It was the fly-by-wire, software-generated version of the 1996 AeroPeru B757 accident, where the static ports were left taped over after an aircraft wash/wax, rendering the speed and altitude indications useless.

The flight data recorder (FDR) was pulled after landing at Perth. The ATSB found that the FDR indicated "unusual and instantaneous acceleration values were recorded in all three planes of movement [pitch, roll and yaw]."

"The acceleration values were provided by the aircraft’s ADIRU [air data inertial reference unit] and were used by the aircraft’s primary flight computer (PFC) during manual and automatic flight. The PFC compared the information from the ADIRU with the information from the standby air data and attitude reference unit (SAARU). During the occurrence, this comparison function reduced the severity of the initial pitching motion of the aircraft," the ATSB explained.

According to an Aug. 9 Boeing message to operators, "The flight crew should cross-check the standby instruments if there is any doubt as to the accuracy of the primary airspeed, altitude and attitude."

That’s helpful advice, but what happened here? One scenario suggests that the minimum maneuver and overspeed margins on the speed tape converged. The indicated airspeed was rock steady, but when the minimum indication rose up the speed tape, it caused the autothrottle to think the aircraft was flying too slowly and constantly tried to increase power. The problem was with the indications, not the actual speeds. The ADIRU fault caused the wrong high and low speeds to be displayed on the PFD, when in actual fact they were not. It appears that the software generated the erratic flight control behavior.

Until the fix is found, minimum equipment lists (MELs) have been altered. Specifically, the SAARU is no longer an "allowable" inoperative discrepancy, since it provides essential back up to the ADIRU.

And on Aug. 29 FAA issued an emergency airworthiness directive that said, in effect, pending a permanent fix, operators should revert to previously issued ADIRU software. (The flaws in the original software apparently were considered the lesser of two evils.)

It appears that software proofing is no more inviolate a black art than the Food and Drug Administration’s testing of Vioxx, the anti-inflammatory agent. It now appears that taking Vioxx can lead to heart attacks and strokes.

So, too, with the faulty ADIRU software; its use can lead to the equivalent of a heart attack in an airplane–indication of an apparent stall.

The problem sounds like an error in a signal-select algorithm in the updated software release. A signal-select algorithm chooses which sensor outputs to use and which to avoid (as erroneous). The algorithms have become quite complicated. The complication results from two contrasting requirements: (1) not to fail a sensor for transient misbehavior, while (2) failing a sensor which has proven to be bad. In order to develop accurate algorithms, which are decision procedures, it is necessary to know in what ways sensor failures (transient and permanent) manifest themselves. Since there are many such ways, signal-select algorithms become complicated.

And that leads to a critical point. FAA has put out an emergency AD, saying that failure of the software could lead to loss of control. This is understood to mean "catastrophic failure." Computer code that can lead to such failure is known as DO-178B, Level A, software, which is paid huge amounts of critical attention, for it must be error-free. That’s the level against which the ADIRU software was certified, according to an FAA official (as opposed to, say, a cabin pressurization system, which is certified to rigorous but less demanding standard, Level B, because it doesn’t necessarily prevent continued safe flight).

But each software release is essentially a new system, although it does not have to be recertified from the ground up. So how does the regulator ensure that the software "upgrade" has the same sort of quality attained with the original certification? Logically, one should go through the same rigmarole all over again. So finding out which software was at fault in this B777 incident, and what its certification level was, is key to understanding what happened. The incident is arguably the most significant software-related flight control anomaly yet experienced on a fly-by-wire aircraft in revenue service. If certified to Level A, how did this failure occur?