Commentary By Brent Sorensen, President, Universal Synaptics Corp.
"Digital averaging" is a critical yet innocuous sounding problem. It has had a deleterious effect on aircraft safety and reliability for decades under the guise of no fault found (NFF) where fully half of all pilot-reported system failures are never duplicated on the ground or repaired. It is probable that several mysterious or suspected wiring and flight control related incidents - and possibly crashes - have at least in part been due to this major testing void that hides or masks problems rather than reporting them.
Digital averaging, achieved through the inherent technique of digital processing, more directly by engineering design to remove unwanted system noise, consequently removes information concerning age-related failure mechanisms such as glitches or intermittency, which also looks just like noise to an electronic system. Therefore, instead of delivering prognostic information or critical data reflecting the high probability of an impending failure, digital averaging (in comparison to a raw analog signal).delivers a message that "all is well," when the opposite may be true.
The details
Some 30 to 40 years ago, practically all avionics and avionics measurement equipment used to test and diagnose aircraft systems relied on analog technology to read the various input sensors and control signals and to compute the various output functions. Analog systems delivered pretty fair results but had a high susceptibility to stray electronic noise and glitches. Since then digital technology - which works fairly well in "noisy" environments via on-off operation, periodic sampling and data averaging - has taken over progressively and has established itself in nearly every avionics and measurement function.
Part of the success of this technology is due to miniaturization and its inherent noise rejection.
The change of technology has been dramatic and impressive, except where test measurements must be taken of unexpected, randomly occurring, rapidly changing, real-world phenomena that just happen to look like noise. Such noise as would be expected is created by aging or dry solder joints, oxidized, corroded and loose connector pins and wiring crimps, poor grounding or insulation breakdown allowing arc-tracking conductors to touch other conductors or aircraft structural components.
Digital measurements may be highly accurate if the source measured is stable and/or repetitive. "Aliasing" - the problem of missing data because it is linked in frequency to the sample rate has been addressed in most newer digital test equipment. However, if the source is intermittent, noisy or unstable, it is an entirely different matter. To smooth these noisy and randomly changing signals, digital averaging is employed. The end result is that one might see a portion of the unexpected intermittent defect, or one might not see any of it depending on the frequency and duty cycle of these glitches. A little reflection on how digital measurements are taken, as well as missing data on digital flight data recorders (DFDRs) that use "averaging," confirms this (see ASW, Aug. 5, 2002). The problem here is that "unwanted" signals that get averaged out by normal digital processing may be the critical indicators of an existing system flaw - or imminent failure. Digital processing does not have the capability to discriminate electrical (or electronic) nuance.
Some in the industry have suggested that faster sampling rates can catch these minor glitches. This may be a false hope. In long runs of aircraft wire harnesses, the inherent capacitance between the suspect wires, surrounding circuits and metal aircraft surfaces will slow the meter's stimulus charge time and will limit sample rates to about 200 per second, and oftentimes much slower. Faster sampling is not a good solution when microsecond or even nanosecond disconnectsare the targets of interest. They are occurring randomly and one quickly arrives at a point where the digital measurements will exceed processing rates and soon one is right back to where one started, with data being discarded or averaged out for lack of enough storage capacity. It is either that or the testing process must temporarily shut down while processing of the data catches up. Of course, during that time failures could occur and go undetected.
While the NTSB is understandably concerned about missing data as a result of "averaged" inputs to the DFDR, it also should be concerned about the source of these random glitches that likely would have been missed during any preflight, ramp, functional or depot testing as well. A simple analogy may help to illustrate the problem. Before a volcano explodes, tremors can be measured which help scientists predict when the main eruption likely will occur. The same principle applies to avionics suffering from age related inter- connectivity defects. They first become aberrant at low levels of activity and generally progress over time to higher levels of activity. However, at present digital instruments average these latent "tremors" right out of existence. The result is that important prognostic and diagnostic data is lost in the process and fault-finding then becomes the more costly exercise in post-failure diagnostics. At testing time, when one would like to know if a connector, crimp, splice, wire, solder joint, circuit breaker or other connectivity component is working properly, the unwanted spurious noise that under the right conditions could precipitate a full-blown system failure is simply filtered out by the technology and sometimes even the test programs. Especially in the early developing stages, digital measurement equipment simply cannot see all of these age-related failure modes. Technicians and test engineers, relying on the higher accuracy of digital instruments, walk away from the problem with a false belief in the safety and reliability of the systems they are testing.
These random intermittent glitches occur mostly in older aircraft as a result of the normal process of electronics aging, brought about by years of exposure to vibration, oxidation, heat cycling, spark-erosion, etc., which cause the micro-surface of the various connectivity elements to degrade gradually over time. The broadly held opinion that wiring and other electronics either works or it does not, and when it does not that the result is a testable hard failure, is false. Connectivity components become intermittent, generally at a reduced level, long before they actually fail hard.
The decades old NFF problem, where approximately 50 percent of all pilot reported system failures are never duplicated or repaired on the ground, indicates the scope of the digital testing void. No test method can state that a unit is reliable and safe to fly without first testing for random intermittency with equipment that can actually detect it at the low levels necessary. Single channel, scanning, digital sampling equipment, the kind almost universally in use today, simply cannot.
Most automatic test equipment (ATE) used to test avionics, including wiring testers, will generally not report the defect to the operators unless failure occurs over several repetitive tests, essentially by becoming a semi-permanent hard failure. If at any time during these repetitive tests any measurement should test "good," the original failure data is discarded and the testing proceeds as if nothing had happened. The conceptual blind spot is programmed into the testing software in an effort to reduce what test engineers fear might be "false failures," which can result in additional testing and repair costs.
Accident investigators as well as test technicians and engineers should never lose sight of the fact that an intermittent connection one moment, possibly seen as a small, one- time, inconsequential system glitch may, under the right stress the next moment, become a full-blown failure that could develop into an accident. This being the case, no level of intermittency should be tolerated.
With respect to the fatal Nov. 12, 2001, crash of American Airlines [AMR] Flight 587, involving an A300-600 twinjet, 12 incidents over the previous year in the airplane's rudder control system were reported, with four of those being NFF, and another NFF breaker reset just prior to takeoff. With the exclusive use of digital based testing, any intermittent defects on the accident airplane, as would be expected in an airplane with extensive use over 13 years of service, as in this case, would have gone undetected.
Back to the future
Analog technology does have the advantage of not using sampling, but it has always had its own limitations, including susceptibility to noise - requiring filtering and signal conditioning likely to cause high-end data loss as well. In addition, frequency response limitations would make the sensing and recording of nanosecond pulses quite challenging - and perhaps unreliable in affordable analog equipment.
Nonetheless, avionics testing experts agree that the only way to find these random intermittent connectivity failures is to use massively parallel and analog-based equipment, which is sensitive over a smooth and uninterrupted continuum and capable of sensing all critical points all the time, and at a high level of sensitivity. The increased probability of being able to detect these random defects in a multiwire system is some three million times greater than is possible with single-point-at-a-time digital based testing.
If any of these intermittent problems were lurking in the Flight 587 accident aircraft, they likely were not being properly tested for anywhere in the aircraft or in the chain of avionics maintenance. All the relevant flight control system (FCS) and BITE functions (built-in test equipment) are digital.
Related concern
Federal Aviation Administration (FAA) directives attempt to deal with the intermittency problem by requiring "repetitive testing." It would be a great leap forward if the FAA also required a testing process capable of actually seeing these expected failures at a threshold below that which would be required to trigger an actual system failure. Generally, there are many more micro-breaks that will occur in an intermittent situation than major breaks sufficient to cause a system to fail. By vaguely calling for "repetitive testing, the FAA pretty much guarantees that repair is no more of a guarantee of reliability or safety than the original test that obviously let the latent condition slip past to begin with.
Key question
The term expected has appeared numerous times in the foregoing discussion as it relates to aging and intermittency. The NTSB expected certain data to be on the Flight 587 DFDR and it was not. Intermittency is expected when the FAA calls for repetitive testing. Test engineers are sufficiently concerned about expected false failures to program software routines that, on an initial failure, loop back and test the wiring or LRU again and again to eliminate any so-called "false failures." These "false failures" of course could as easily be real intermittent failures occurring in the products they are supposed to be testing.
Why, then, with this insidious and potentially dangerous type of defect so universally expected, does routine in situ testing continue to be done exclusively with digital equipment that cannot possibly be expected to detect these kinds of problems. After all, many government agencies, task forces, advisory committees and whatnot have spent considerable time looking into the aging wiring problem, and a great deal of money over a long period of time has been spent looking for answers to this testing problem.
There is a saying that "you can't dig a new hole by digging the same hole deeper." The phrase is directly applicable, as the "aging wiring" problem has been around for a long time, and not one testing equipment upgrade over the last 35-year period seems to have even made a dent in it. In fact, most "upgrades" striving to take advantage of the advances in accuracy that digital technology affords may actually have made the problem worse by giving false-positive indications of functionality and reliability - masking the effects of aging rather than reporting it.
Consider connector pins, which may be mechanically "rubbed" together thousands of times. The ensuing fretting buildup [of small insulating particles] will be measured periodically with precision equipment, typically until "one ohm" of resistance is measured. These connector pins will then be rated, with the number of rubbing cycles necessary to reach the "one ohm" level. While the number of insertion cycles are useful for comparing the durability among various connectors, the level of "rubbing" at which they become intermittent or unreliable is unknown, but it is estimated to be considerably below the higher insertion ratings.
What's needed is a realistic testing plan that can incorporates a generalized set of standards for the testing of age-related intermittency. The design engineer of a given system would not approve any components that were intermittent, and that "standard" should apply all the way through maintenance and test. Right now, with no published standard, the de facto upper limit of 1 microsecond would be a good starting point for this effort. Beware of the tendency to segregate intermittencies into categories to help in scheduling for repairs. The problem of randomness makes such an idea untenable. A brief glitch one moment may easily be the cause of a crash the next.
Cost benefit
Using massively parallel analogue techniques, intermittency testing could be added to maintenance protocols for about $200 per year per aircraft, amortized over a five-year period. That's for the equipment, not the labor, but nonetheless the cost is less per year than many auto insurance rates per month. Such analog testing could be done during periodic overhauls (D checks). Portable testing equipment would facilitate trouble-shooting on an "as necessary" basis between overhauls. In any event, flight critical circuits should be tested periodically.
There is every reason to believe that this type of testing would contribute to profits rather than adding to an operator's maintenance overhead. With NFF rates of 50-90 percent the norm on aging aircraft, a great deal of money and effort presently are being wasted on testing that does not discover the intermittency.
Swapping out of components may result in a fault being cured, but might also disguise it - if the fault was in the unit's cannon-plug or in its wiring bundle. An unnecessary LRU swap-out can mean a shortage of one spare set and a bench-test overhaul cost at the very least. In the longer term, spares holdings have to be increased to cover such contingencies. The cure perhaps? Analog equipment tests directly for aging problems and finds them, making systems more reliable.
Line replaceable units (LRUs) also are ripe for more intensive testing at their repair depots. Reports from some LRU repair depots show NFF rates running more than 90 percent. When one looks at the sheer number of interconnections and sites inside these boxes that could become intermittent, it seems probable that the bulk of the overall NFF problem probably resides here.
The "digital averaging" technique and the NFF problem should be recognized for what they are:
1. Proof that airborne intermittency is as dangerous as wiring shorts and arcing, and can easily lead, at worst, to deadly accidents and, at best, to perplexing system aberrations.
2. Proof that testing methods employing digital averaging devices exclusively may be bogus and, like the problem of bogus parts, provide no proof of a component's reliability.
In any case, by eliminating NFF problems, the airlines could save half or more of their current avionics repair costs while enjoying the benefits of more flights departing on time and with a tremendous boost in safety, reliability, and confidence at a time when it is so badly needed.
Byline:Brent Sorensen spent 29 years working for the U.S. Air Force testing avionics and researching the causes and symptoms related to avionics aging and "No Fault Found" issues. Related to this effort he developed numerous testing and process improvements through the use of emerging computer and digital technology. Later in his career he worked on projects to reduce NFF and maintenance costs associated with fighter aircraft.
(ASW note: Sorensen's company manufactures analog testing technology, but the notoriously endemic and systemic nature of the NFF problem warrant discussion even from an advocate who has a business interest in a solution. It could be argued that intermittency and NFF are the same thing. Other views of Sorensen's thesis, will be presented in next week's issue) >> Sorensen, e-mail Brent.Sorensen@usynaptics.com <<
Syllogism of Risk
- Aging failure events manifest themselves as random intermittent failures for which there are no established testing protocols and no testing standards.
- With no testing standards, there is no testing.
- With no testing, there is no sustained reliability.
Source: Sorensen
Digital vs. Analog Analogy
"In a nutshell, the difference between digital and analog testing equipment/sensors parallels the difference between one of those three-way lamps that can be clicked to 75, 100 and 150 watts versus a rheostat where one can raise or dim the light smoothly over the full range of voltage (and read off the exact voltage with a sufficiently graduated dial). Or, why some music aficionados prefer analog vinyl records to CDs, in that the analog recording may capture the full uninterrupted range of musical notation vs. 'databursts.' "
Source: Lee Gaillard
Digital Averaging - The Smoking Gun Behind No Fault Found
In the realm of testing and diagnosing avionics equipment and failures, generally "higher" accuracy digital instruments are usually sought after as somehow being synonymous with higher quality. Quality in a measurement, however, may have more to do with delivering accurate and useful information than simply delivering more digits to the right of the decimal place.
A case in point may be the NTSB's (National Transportation Safety Board) investigation into the crash of Flight 587, an American Airlines [AMR] A300-600 that crashed Nov. 12, 200, in New York, killing 265 persons.
The NTSB examined the flight data recorder (FDR) and discovered that expected information ... was missing on some channels due to what [was] described as "digital averaging." Digital averaging also is employed heavily in the testing and maintenance of avionics equipment as well as the aircraft's wiring.
The trusty old analog meter has been replaced by a highly sophisticated 8-digit measurement device of phenomenal accuracy. Like most things in life, good things - including accuracy - come with a penalty. The penalty for aging aircraft systems is that as accuracy has been increased, mostly through digital averaging, these measurement devices have lost their ability to see age related failure modes such as random intermittency or glitches. Inconsistencies due to aging (e.g., "poor contact") have simply been "averaged" out.
In nearly all cases, the more accurately a meter can split a fraction of an ohm or other quantity, the slower it is likely to operate and, therefore, the less likely it will be able to respond to intermittency or glitches in a meaningful way.
The two desired measurement goals, high speed and high accuracy, occupy opposite ends of the measurement bandwidth spectrum. Hitting either end has its own opposing penalty.
Digital instruments using averaging deliver high accuracy readings, but only when the signal being measured is itself steady. If the signal of interest is intermittent, generally due to the ravages of aging, you have no idea what the measurement result will be. The problem with digital measurements is that they are based on a fixed sampling rate, while the intermittency is occurring randomly. The "glitches" generated when the electromechanical connectivity elements break down momentarily simply cannot be guaranteed to occur in synchronization with the sampling pattern/measurement window. The end result is that you might catch the glitch if you are really lucky, you might catch part of it, or more likely you might catch none of it. Accuracy, then, as well as repeatability, in the presence of age-related intermittency, is a myth and the information delivered by the instrument is to some degree a lie.
The following example illustrates how random intermittency or glitches seen - or not seen - by one test instrument can make a huge and important live-or-die difference in the results. In the item being tested, 4.1-volt glitches were introduced randomly in time. The meter's inherent digital averaging has averaged the glitches right out of existence, as far as testing is concerned. The meter has completely missed the series of approximately 70 intermittent faults:
All that needs to be done to make this digital averaging problem go away is to continue to use digital-based equipment for the hard failures and add analog-based equipment for the NFF or intermittent failures.
Source: Sorensen
What Exactly Are We Testing?
There are two types of electronic failures: hard failures and intermittent failures:
Hard failures are detectable every time the unit is used or tested (e.g., blown internal fuse or failed capacitor); everything else may be considered intermittent.
Intermittent: an intermittent is any temporary deviation from nominal operating condition of a circuit or device. This definition encompasses the media-popular "short circuits" as well as the more numerous yet less understood "open circuits." There are three basic types of intermittents:
1. Engineering or design intermittents are often related to component interactions and specific timing events. These defects include switching transients, induced EMI (electro-magnetic interference), load changes, cross talk, leakage through circuit boards and conformal coatings, software bugs, or poor initial design (e.g., inadequate heat sinks).
2. Test voids involve malfunctions detectable by high-sensitivity testing but not detectable at a lower sensitivityof testing. This type of failure should be thought of more as a "hard failure" below the test equipment's threshold, appearing therefore as an "intermittent" in the testing program.
3. Connection intermittents are caused by a temporary change in a circuit's conductivity path. The root causes of these changes, or breaks, range from contact fretting to the more familiar types seen as loose (cold/dry) solder joints, oversized or worn connector pins, corroded or oxidized connections, noisy components, loose terminal screws and a host of similar causes. These defects can occur at any stage in a product's life. For example, fretting corrosion can be caused by small slippages of a connector's micro- contacting surfaces, triggered by as little as a few degrees of thermal expansion and exacerbated by the ever-present high frequency vibration in flight.
Source: Sorensen