Implications for interface design when software and hardware disagree

8 mins read

When Toyota was forced to recall thousands of its vehicles after a spate of accidents and unexplained surges in acceleration, suspicion soon fell on the electronic controllers and the software that ran on them. US transportation officials drafted in engineers from NASA to work out why Toyota cars would suddenly hit the gas.

The explanation turned out to be the obvious one – an accelerator pedal that got stuck under a floor mat and the result of poor mechanical, rather than electronic, design. But it was hard to rule out software or other electronics related problems as the cause without an expensive investigation. The NASA engineers reviewed close to 300,000 lines of code and hit the control systems with large doses of electromagnetic radiation to see whether interference from mobile phones or other electronics could trigger the fault. Despite the investigation, some victims of the sudden acceleration effect feel that electronic controllers have latent problems, illustrating how hard it is to identify potential problems in complex circuits and software and the interactions between them. Software causes a major headache in safety critical systems because of its unusual properties. Software does not err – it does exactly what the programmer wrote, not necessarily what the programmer intended. But that can be fatal for the systems that run it, because software is brittle. If everything is going to plan, it's behaviour is predictable; when fed with data from outside the anticipated range, anything can happen. Take the failure of the Mars Climate Orbiter, for example. It crashed on the planet's surface through a misunderstanding during software integration. The investigators found that a key file used to calculate the craft's trajectory employed figures that assumed the use of imperial units. But the rest of the software assumed the use of metric units. In principle, more extensive testing should have revealed the error. The interactions between software and hardware can be even more subtle. Another mission to Mars – which has seen more than its fair share of dead probes – ended in failure because of an interaction between the hardware and the software. The control software on the Mars Polar Lander misinterpreted noise from a Hall effect sensor as an indication that the craft had landed. Unfortunately, it was still hundreds of metres above the Martian surface. The rockets were cut by software on the main control board and the lander suffered a catastrophically hard landing. However, some of the most insidious and difficult safety critical errors are found in the interface between the user and the system. Toyota's problem took a long time to confirm because its user interface fault was entirely mechanical in an environment where software's understanding of what is going on is often thought to be suspect. The ultimate cause for the crash of BirgenAir flight 301 in the Caribbean Sea in 1996 has never been confirmed, but investigators believed the root of the problem lay with a damaged sensor and its impact on the cockpit instruments of the Boeing 757. The plane had stood for days in tropical conditions while being repaired. During this time, it was possible for insects to build a nest in the pitot tube – a simple instrument that uses air pressure differences to estimate airspeed. As soon as the aircraft was speeding down the runway at Puerto Plata in the Dominican Republic, the problem surfaced. Airspeed displays gave wildly inconsistent readings, then seemed to work properly again. Believing the systems were functioning, the flight crew engaged the autopilot. But, fed with faulty data, the autopilot started to reduce thrust, slowing the 757 to the point of stalling. Another computer determined the airspeed indicators were not working properly, but the warnings it flashed up seemed to have little to do with the problem. The aircraft operations manual, at the time, did not say that when the warnings 'MACH/SPD TRIM' and 'RUDDER RATIO' appeared together, the plane's other sensors had found a discrepancy of more than 10 knots from each other. Not realising their true meaning, the cockpit crew pondered the meaning of these cryptic messages. At a point where the aircraft needed full power to climb, the autopilot inexplicably cut power. The pilot finally disengaged the autopilot, but it was too late: the plane hit the ocean 20km from northeast of Puerto Plata. Because user interface faults lay behind many of the accidents investigated by Everett Palmer in the 1990s at NASA Ames Research Center, he gave them a name: 'kill the capture bust'. These are transitions between internal system states that are different to the ones the operator – or pilot in these cases – expects. The name comes from the way that autopilots are controlled – when pilots want to climb to a certain height, they enter it and at that point the altitude capture setting is armed. But if the autopilot then moves into a mode in which it does not need to look at the altitude, then the capture has been 'killed' by the system. Very often, these situations are noticed by the crew because other alarms go off, so John Rushby of computer science laboratory SRI International modified the term to 'kill the capture surprise'. Palmer used simulator based studies to work out why aircraft were frequently moving to the wrong altitude – increasing the risk of mid air collisions. One case, which took less than 20s to play out, showed confusion over autopilot settings could easily cause things to go wrong. The crew had just missed an approach to land and had climbed back to around 2000ft before an instruction was received from air traffic control to climb to 5000ft. The pilot set the autopilot to climb to that height but, as it did so, he changed some other settings. One of them caused the autopilot to switch from climbing to a specific height to climbing at a constant speed. A light showing the target altitude was approaching lit up – and stayed lit as the aircraft sailed past. "Five thousand ... oops … it didn't arm," remarked the captain. Finally, the altitude alarms went off as the craft approached 5500ft. Palmer noted that one important aspect of these aviation incidents is not so much what went wrong, but how disaster was averted – largely because the flight crew is controlling and monitoring a number of different systems. "The aviation system is very much an error tolerant system," he wrote, "with almost all errors being rapidly detected and corrected." The problem, Palmer found, was that pilots – presumably by virtue of the way in which they are trained – are adept at reading normal instruments, but do not tend to notice changes in the state of the automation, generally revealed by lit annunciators somewhere on the panel. "The crews were apparently aware of the state of the aircraft, but not aware of the state of the automation," Palmer noted. The reason why the aircraft climbed past its target height is because the captain inadvertently selected the wrong type of thrust – a setting that effectively confused the autopilot. He failed to notice, according to Palmer, that he had selected go around thrust – appropriate for the burst of power needed after a missed approach – rather than regular climb thrust. For the autopilot to control a climb correctly, the pilot is meant to check what the thrust reference panel says before it is engaged. Then the pilot should check the flight mode annuciators say what mode was actually selected. The result of this design means the button to engage the autopilot has a 'meta meaning' – a term coined by cognitive scientist Edwin Hutchins to describe controls that change behaviour based on the state of the system. In the case of this control, that state was shown on a panel some distance from the button itself. Professor Nancy Leveson of the Massachusetts Institute of Technology, who worked with Palmer on a later paper on automation surprises, has logged a number of software problems that have interfaces, both between the user and the computer and between software modules, at their heart. But, in either case, a surprise happens when the automated system behaves differently to what its operator expected (see figs 2 and 3). According to Rushby, who published a paper on the problem and potential ways round it in Reliability Engineering and System Safety in 2002, it is possible to describe the operator's mental model of a system and actual system behaviour as finite state machines. The problems occur when those two models go out of step. The solution, for Rushby, was to use model checking to check for those situations. Palmer argued that user interfaces need to improve. "What is needed is a 'what you see is what you will get' type of display of the aircraft's predicted vertical path", in place of the cryptic messages relayed by the flight mode annunciators. Unfortunate interactions between the user interface and the underlying automation are far from isolated to aviation; they are simply more noticeable in aviation because of the open way in which the results of investigations into disasters and near misses are published. Prof Leveson covered the story of the Therac-25 radiation therapy machine in her 1995 book Safeware. Six people suffered massive radiation overdoses from the Therac-25 between June 1985 and January 1987. No formal investigation was ever carried out into the accidents – Prof Leveson pieced the story together from lawsuits and depositions and government records. Controlled by a PDP-11 minicomputer, pieces and routines from earlier machines were incorporated into the Therac-25 – something the quality assurance manager on the project was unaware of until a bug encountered on the Therac-25 was also found on its predecessor, Therac-20. The computer was used to position a turntable so the powerful xray beam from the machine could be attenuated correctly (see fig 4). But this attenuator was only positioned for one of the two modes the machine could adopt. As Prof Leveson pointed out in her account: "This is the basic hazard of dual mode machines: if the turntable is in the wrong position, the beam flattener will not be in place." Before Therac-25, electromechanical interlocks were used to ensure the xray beam could not strike the patient unless the attenuating beam flattener was in place. In Therac-25, many interlocks were replaced by software checks. To start therapy, operators had to set up the machine manually in the right position and then match both patient data and system settings from a computer console. Operators complained and the manufacturer AECL agreed to modify the software to make the process easier: a carriage return would copy the necessary data. When things went wrong, the machine would flag up a malfunction and a code, but no additional information. A memo from the US Food and Drug Administration claimed: "The operator's manual supplied with the machine does not explain nor even address the malfunction codes ... [they] give no indication that these malfunctions could place a patient at risk." One operator admitted that she became insensitive to machine malfunctions: the messages were commonplace and most did not affect the patient. In one case, the therapist delivered several bursts of radiation unwittingly, believing that malfunctions interrupted all but one of the attempts. The patient later died of a virulent cancer and an AECL technician estimated she received as much as 17,000rad – a single dose should be around 200rad. Problems with the Therac-25 seemed to revolve around race conditions in the software, according to Prof Leveson. In one accident, it was found that an error allowed the machine to activate, even when an error was flagged. If the operator hit a button at the precise moment a counter rolled over to zero, the machine could turn on the full 25MeV beam without any attenuator in the way. As operators were running through the setup sequence increasingly quickly, this previously hidden problem suddenly surfaced. The result was a devastating, highly concentrated electron beam that would lay undocumented: there were insufficient checks in the software to record what the machine did. Prof Leveson wrote: "The Therac-25 software 'lied' to the operators and the machine itself was not capable of detecting that a massive overdose had occurred. The ion chambers on the Therac-25 could not handle the high density of ionisation from the unscanned electron beam at high current; they thus became saturated and gave an indication of a low dosage. Engineers need to design for the worst case." And, without any backup indication of what was going on, the operators were completely in the dark as to what the Therac-25 was really doing. So, unlike the pilots, they were unable to rectify the situation. As Prof Leveson pointed out: "Safety is a quality of the system in which the software is used; it is not a quality of the software itself."