Space electronics - Failure is not an option

4 min read

Meeting the redundancy, flexibility and testing requirements demanded by space electronics. By Ken O'Neill.

Late last year, more than 30 years after its launch, Voyager 1 began to move into interstellar space on its journey toward the centre of the Milky Way. Originally expected to survive only long enough to capture images of the outer planets, Voyager 1 has survived longer than many space probes and satellites. Yet many such devices do not even begin their missions or fail at an early stage. A study performed by the Canadian Space Agency in 2008 concluded that better testing, redundancy and flexibility provided the keys to more reliable satellite operation. Although space agencies place reliability at the top of their priorities – since the failure of just one component can lead to the loss of a multimillion dollar mission – a clear counter trend lies in the use of commercial off the shelf (COTS) components. While these parts are generally more advanced in terms of processor performance and logic or memory density than those designed specifically for use in military or spaceborne systems, COTS devices do not have the background of design and extensive testing that ensures reliability. Parts designed for military and aerospace use often have a wider range of materials at their disposal than COTS components, particularly when it comes to the materials used to connect the chips to the pcb. The more restricted materials now used in COTS tend to suffer from greater reliability problems under heat stress. Military grade parts used for spaceflight will also be tested to a much greater degree than their COTS counterparts, not just over a wider temperature range, but also with a greater concentration on faults that cannot be picked up by generic wafer probe and package level tests. Yet even military grade qualification may not be enough for critical systems. The QML-V testing flow adopted by a number of space agencies, and used by Microsemi, extends the MIL-STD-883B standard tests with static burn in of more than 100hr to catch parts that do not fail initial tests, but which would otherwise lead to infant mortality. FPGAs provide an important advantage in terms of testing over asics. A key issue with any asic flow is that of testability. To allow for cost effective testing, dedicated chains of logic need to be inserted into the design to allow the many internal nodes to be probed for typical faults, such as gates being stuck at one value. An asic cannot be completely tested without increasing the number of test scan chains to impractical levels. Testability of less than 90% is common in all but the simplest asics. FPGAs are normally tested in their uncommitted state during manufacture, allowing access to all of their internal logic nodes. This ensures maximum testability of all physical aspects of the fpga so that only a functional in system test is needed for the final system once the fpgas have been assembled and programmed. When a space bound component passes its test, there remains one big problem: radiation. A review conducted by NASA in 1996 of 100 failures and problems on its spacecraft found one third were caused by ionising radiation leading to single event upsets (SEUs) – state changes in logic or memory – or permanent degradation in performance. High energy particles ionise the medium through which they pass, leaving behind a wake of electron-hole pairs. These pairs can change the state of a memory cell or a logic flip-flop. For SEU effects, neutrons have the highest probability of interfering with electronic system operation, but are greatly attenuated by the atmosphere below an altitude of 10km. The effect that charge build up has on a circuit depends on its response. The capacitance (C) of the node determines approximately how large a voltage swing (dV) results from the increase in charge (dQ) according to dV = dQ/C. The positive feedback loops in latches and srams will cause a bit to flip once the collected charge reaches a value (Qcrit) that is large enough to drive the node's voltage past the switching voltage. Qcrit has reduced in line with Moore's Law scaling and SEUs in static latches and srams became problematic after feature sizes dropped below 10µm and Qcrit dropped to less than 1pC. As a result, a neutron impact may change the state of the memory, which can have consequences on the operation of the chip containing the affected memory cell. Error detection and correction circuitry can address the problem for data cells, but the configuration cells in sram based fpgas are typically not protected in this way. As a result, a strike may change not just the state of a memory cell, but also the design of the circuit it controls, potentially leading to catastrophic failure. Nonvolatile fpgas still contain flip-flops, so parts aimed at space applications, such as Microsemi's RTAX family, use triple modular redundancy to guard against changes caused by radiation. If the particle has high enough energy to flip the state of a flop, the other two flip-flops in the redundant circuit will outvote it, ensuring correct operation. The probability of two strikes hitting two flops in the same redundant circuit is extremely low, even in periods of high solar activity, when large numbers of particles may be encountered. The effect of ionising radiation on hardwired logic circuits is less pronounced. These errors are typically transient and often nondestructive. As feature sizes scale down and operating frequencies increase, however, charge produced by ionising radiation in a combinatorial logic circuit is increasingly likely to be latched. The RTAX-DSP family uses two levels of single event transient (SET) mitigation. One is protection against transients on the output buffers of each of the three registers in a group. If an SET hits one of the buffers, the others will subdue the effect, producing the correct result. The other level of mitigation employs filtering to prevent transients from reaching the latches in the fpga's dsp blocks. The SET filter splits the signal path in two. One path has an inverter to delay the signal. Both paths are then fed to a guard gate, which functions as an AND gate when the two paths agree or as a latch when they differ. It will then only pass transients that have widths longer than the delay incurred by the second path's inverter. Furthermore, each guard gate is tripled to prevent a strike affecting its result. COTS components do not contain such fine grained protection against SEU and SET events. Engineers only have the option of using triple modular redundancy within the subsystems they design using these parts, which will increase cost and development time and still leave gaps in the test methodology. Space oriented components can provide greater levels of protection; not only simplifying system design, but also in improving overall reliability – the key criterion for space agencies and satellite operators as they seek to minimise damaging failures. Author profile: Ken O'Neill is director, high reliability product marketing, with Microsemi's SoC product group.