How hardware failures and data misinterpretation can affect safety

8 mins read

Shortly before being gunned down by a robot played by Yul Brynner in the movie Westworld, a tourist is bitten by a mechanical snake. "That's not supposed to happen," he wails as the software failsafes built into a holiday camp meant to simulate the Wild West start to break down around him.

Later, technicians test the snake and conclude: "Logic circuits on the snake simply failed to respond." As the robots run riot, those in charge suddenly realise their risk analysis was way off the mark. The implication of Michael Crichton's film, like a number of his novels and movies, is that engineers failed to understand the complexity of the systems they built – and fell victim to the consequences. We haven't yet seen intelligent robots go wildly out of control, but insidious and mystifying faults do plague safety critical embedded systems. These incidents occur not so much because the software goes out of control, but largely because of the unforeseen ways in which the virtual world interacts with the real world. The natural assumption is that software will work correctly as long as it was written correctly: software does not wear out nor age. In 2000, Bev Littlewood, of the Centre for Software Reliability at City University, proposed that software could be regarded as 'perfect' – that is, it could be free of design faults. If it goes wrong, it is because the writers of the software did not account for some combination of inputs that might lead the algorithm to do something crazy. As a result, extensive simulation and, increasingly, formal verification are being applied to software in an attempt to prevent logic 'sneak paths' from making it through to production. Microsoft now has tools resulting from research at its labs that will perform static checks on device drivers to catch bugs. The Static Driver Verifier is a compile-time tool that uses a model of the Windows operating system and a set of interface rules. Microsoft recommends using the tool early and often, running it once even the basic skeleton of the code is ready and then continuing its use to spot changes in behaviour as the device driver is fleshed out. Using a tool like this during development makes it easier to spot logic faults that might be introduced as the driver code becomes more complex, involving many more branches and loops. Experts such as John Rushby of the computer science laboratory at SRI International have recommended a move towards greater use of formal verification to allow safety critical systems to be checked with much greater levels of automation and to stop logic circuits from doing the unpredictable. But the assumption that software always runs the same way only holds true when the hardware that feeds it data and which executes its instructions is 100% dependable. That's a reasonably safe assumption in desktop computers although, even then, there's a chance that a soft error in the memory array will bring an application – even the operating system – to its knees. While server processors are gradually acquiring error correcting memory circuits for even their growing onchip caches, most desktop pcs dispensed with this kind of protection years ago in favour of cheaper memory modules. Embedded systems, on the other hand, are far more vulnerable to hardware failures. The consequences can be far more serious than an application that needs to be restarted and they are more vulnerable to failures in peripherals that deal with the real world: the sensors and actuators. Carmakers and machinery designers have struggled for years with failing or dirty sensors upsetting the feedback loops that are used to control various systems. For example, an optical sensor used to determine whether a protective shield is closed before a circular saw will operate can easily be fooled if the sensor depends on ambient light indicating its state. Dirt on the sensor can make it seem as though the shield is closed when it is, in fact, not in the right place at all. An alternative sensor design is likely to give better results under real world conditions. As a result of concerns over the interaction between software complexity and the vagaries of hardware, the number of safety related standards has gradually increased. An example is IEC 60730, which has been developed for the white goods market. It has a section devoted to software based control because of the way that mechanical systems and interlocks have gradually been moved over to systems based on microprocessors and microcontrollers (MCUs). IEC 60730 has received a lot of attention from MCU makers because the needs of safety and the market for white goods are somewhat at odds. Ideally, embedded controllers in these machines would be redundant so that, if one fails, another is ready to take over and, at the least, put the appliance into a safe mode. The two devices do not run the same code: one acts as the supervisor for the other. Although it is possible to use a cheaper, simpler MCU as the supervisor and have it perform typical watchdog operations – to check for infinite loops, for example – this approach, naturally, adds cost and complexity to the design. The standard allows for the case where a single controller can be used and is extensively tested during manufacture. But this does not account for potential failures in the field if hardware starts to fail. So, the option used most commonly is to have one MCU with either a built in or external watchdog timer and to run self checking code on itself periodically. IEC 60730 calls for a number of standard tests to be performed on MCU core circuits and peripherals. For example, the registers and memory paths need to be checked for stuck-at faults. The clock needs to be tested to ensure that it is running at the correct frequency. And A/D and D/A ports need to be checked for I/O faults. Although it is possible to use parity protection to test for register faults, such as stuck bits, hardly any MCUs feature built in support for this. Adding it in software is more expensive and time consuming as it demands multiple reads for each data word. Instead, the usual technique is to run a small self test routine – which should only take a few microseconds – periodically. Main memory checks will take longer, but running algorithms such as March C periodically will find static faults in writeable memory. Stuck-at faults in the program counter can be tested by sampling its value on the stack during periodic interrupts, such as the usual timer interrupt. If the program counter does not change between interrupts, it indicates the counter is caught in an infinite loop. A fully stuck program counter would trigger a watchdog timer, resetting the MCU. Because these functions are common to many MCUs, their vendors have developed libraries and code samples for incorporation by users into machinery expected to go through IEC 60730 certification. Other tests are more application specific. For example, A/D converters will tend to go through plausibility checks determined by knowledge of expected values. For example, an A/D fed by a temperature sensor that sees a value outside a normal range of, say, -40 to 150°C can generally be regarded as faulty. Switching to a voltage reference periodically can show up problems with the converters. As device geometries shrink, aging effects will become more prominent in embedded systems, causing an increasing number of digital circuits to fail with time. These effects include negative bias temperature instability. This effect, caused by voltage stress on a transistor, causes the threshold voltage to shift, potentially to the point where the device no longer operates. These kinds of effects can be found by watching the switching performance of a transistor slow. An aging sensor monitoring a critical path can measure how the timing of that path changes over time and provide early warning of a potential failure, rather than waiting for it to happen. There is a clear area overhead incurred by aging sensors of this kind, but in advanced systems that depend on deep sub micron semiconductor processes, this approach may prove to be a useful addition to software self tests and redundancy in safety critical systems. Hardware failures are not the only sources of problems. Assumptions about their behaviour, even when most are working correctly, can prove to be faulty. An example that Rushby uses in his analyses is of an Airbus A330 that, on its way from Singapore to Perth in Autumn 2008, suddenly started pitching violently, throwing some passengers into the ceiling. Twelve were injured seriously before the aircraft was brought back under control and landed at the closest airport. The culprits were three angle of attack sensors and the control algorithm that used their data (see fig 1). The designers chose three for redundancy in an attempt to ensure that spurious data from a single sensor could not throw a control loop into chaos. Sampled a 20Hz, each sensor's value was compared to the median of the three previous samples. If the difference of any sensor was found to be larger than a preset threshold for more than 1s, that sensor would be flagged as faulty and ignored for the rest of the flight. If the spike was shortlived, the system would simply use the previous mean angle of attack for 1.2s before it was set to reflect the current average value. To reduce the effects of turbulence, the values were also passed through a rate limiting algorithm that smoothed sudden changes in values. If all three sensors were deemed to be working properly, the mean of just two sensors would be used – one on each side of the A330's nose. The third sensor was on the right hand side. A typical angle of attack for this sort of sensor is around 2°. On this flight, the single left hand sensor registered spikes anywhere between 1° and 51° – these largest spikes appearing just before the aircraft pitched down violently. However, the other two sensors appeared to function correctly. Testing on the left hand sensor unit after the incident did not find a permanent problem. The maker of the subsystem found that, under specific circumstances, it was possible to generate an unwanted nose down command. There had to be at least two short, but strong, spikes where the first lasted less than 1s, but the second was still present at least 1.2s after the detection of the first spike. Under these conditions, spikes would make it through the filtering algorithms and tell the autopilot to force the nose down by as much as 10° through the combination of two effects. Air accident investigators thought electromagnetic interference from a nearby military communications station might have triggered the sensor spikes; similar incidents were seen in 2006 and later in 2008. But the installation thought to be the culprit was not transmitting at the time of the worst accident. Despite extensive tests, the sensors all seemed to work properly. In practice, components and algorithms were working as expected and did not fail. The manufacturer was confident enough to claim the aircraft would not have behaved this way at low altitude because the control code would not allow a nose down command at less than 200m – with potentially catastrophic consequences. However, Airbus started work to develop more robust filtering for data spikes. In this case, the sensor inputs were not unexpected; it was simply the way in which data was filtered that allowed the deleterious spikes through. Assumptions about what a sensor's output will deliver can also be way off target. An example is the crash of the Mars Polar Lander, where noise caused by vibration as it descended was misinterpreted by the system responsible for firing the retro rockets as indicating the probe had settled safely on the planet's surface. NASA concluded in its investigation: "It is not uncommon for sensors involved with mechanical operations, such as the lander leg deployment, to produce spurious signals. For [the Mars Polar Lander], there was no software requirement to clear spurious signals prior to using the sensor information to determine that landing had occurred. During the test of the lander system, the sensors were incorrectly wired due to a design error. As a result, the spurious signals were not identified by the systems test, and the systems test was not repeated with properly wired touchdown sensors. While the most probable direct cause of the failure is premature engine shutdown, it is important to note that the underlying cause is inadequate software design and systems test." Understanding how a sensor will respond to spurious inputs is one of the most difficult aspects of safety critical design, but it is one that designers need to master. Researchers have been developing sensor fusion techniques to help mitigate the problems of spurious or misleading signals, but it is an area where development continues. For an algorithm without the right kind of filtering or fault tolerance, the garbage out generated in response to garbage in could be the difference between life and death.