Researchers show reliable chips can be created from an unreliable starting point

4 mins read

It has been a continuing trend over the last few decades that electronic devices at the leading edge are designed to be as complex as the manufacturing technology allows. But while the level of device complexity is increasing, so too is the possibility that the parts will not operate entirely as the designer intended.

There's a good reason for that: statistics. At 20nm and beyond, it's becoming possible to count the number of atoms needed to create a layer in the transistor gate stack. And as the number of atoms gets smaller, the chances increase things aren't going to be exactly right. It's something that researchers have been working on for a while now, with a range of solutions being put forward. Amongst the ideas being considered are redundancy, recovery, repair and mitigation. While there's a range of ideas, they all point in one direction – the ability to design and manufacture fault tolerant systems. The level of interest in fault tolerance could be seen at last year's Design Automation and Test Europe – DATE – conference in Grenoble, where it was standing room only at 8.30am for a tutorial addressing fault tolerance in multicore processors. One of the latest projects to take a look at this increasingly challenging problem is i-RISC – or Innovative Reliable Chip Designs from Low Powered Unreliable Components. While the three year project, funded through the European FP7 framework, starts work officially in February 2013, the foundations are already being put in place. The project is being coordinated by Grenoble based research establishment CEA-LETI and there are five other collaborators: ENSEA, from France; TU Delft, from The Netherlands; UP Timosoara, from Romania; UC Cork, from Ireland; and University of Nis, from Serbia. CEA-LETI's Dr Valentin Savin is the project coordinator. He said the i-RISC project was bringing a number of communities together to look for solutions. "Some are working from the IT perspective; some are working in electronic design. But the difference is that, until now, both sides have not been used to collaborating." According to the i-RISC project, the nanoscale integration of chips built from unreliable components is one of the most critical challenges facing next generation circuit design. It believes a solution lies in the development of efficient and fault tolerant data processing and storage elements. What does Dr Savin consider to be an 'unreliable' component? "It's a circuit element that has some kind of probability of failure," he explained. "It's a device which doesn't operate in a deterministic fashion as it should due to miniaturisation and lower voltages. As devices get smaller, their supply voltage has to be scaled down and it's getting closer to the noise level, which reduces noise immunity and increases unreliability." The collaborators also think they will find the answers through the application of mathematical models, algorithms and information theory techniques. "One solution to the problem," Dr Savin continued, "is to look for some new kind of process technology that would allow components to be defined in a reliable way, but this might not be possible. Our idea is to consider whether we can apply other techniques from IT theory so that, even if it is unreliable, the chip will work." While some of the work will be focused at the device level, the collaborators are also planning to look at system level solutions. But it is likely that the project's work will build on expertise in encoder/decoder architectures that will provide reliable error protection, even if they are running on unreliable hardware. Dr Savin compares the project's approach to similar concepts in wireless communications. "It will be like using error correction code," he claimed. However, while the comparison is drawn, the approach will be different. "Coding theory assumes the encoder and decoder operate on reliable hardware," Dr Savin pointed out, "so we will need to design completely new encoder/decoder systems which can provide reliable error protection – it's a completely new paradigm." Current theory takes the view that, in wireless communications, randomness is introduced by the channels and not by devices. "Our work will accept that randomness is introduced by the device, there will be errors and the decoder will operate unreliably," Dr Savin noted. The i-RISC project says its main objective is to develop solutions that not only have significant noise immunity, but which also ensure correct logic operation at supply voltages and signal to noise ratios at which standard cmos designs cannot work. The partners believe i-RISC is a first step toward a new generation of low energy embedded chips. It notes: 'We expect the solutions developed in this project will enable a significant chip robustness gain and/or energy consumption reduction'. "We will also need to develop error models," Dr Savin accepted, "and work on energy validation techniques. We will also need to develop specific error models for voltage scaling to sub threshold regions." He also accepts the work will be more theoretical than physical. "The error correction codes developed by the project will be used for two things," he said. "Firstly, we want to use fault tolerant codecs to ensure the reliable storage and transfer of information. Our idea is to store encoded data on chip and to use this encoded data for processing. "But we also want to develop reliable computing methods. In order to do this, we will need to create a new framework for chips made from 'elaborate' components. In this area, we will be looking to include the logical functionality of circuits and error correction. At some point, we will have to merge the two approaches." What does Dr Savin anticipate the project delivering once it has completed its work? "We don't plan to have developed demonstrators," he noted, "but we hope that we will have developed a proof of concept – for example, a processor implemented on an fpga that will emulate the error models created during the project. "We will have proved that error correction is possible, even on unreliable hardware," he concluded. i-RISC project ambitions The i-RISC project is likely to have five elements: •The design of error correcting codes with fault tolerant encoder and decoder architectures.This building block will be used to address the problem of reliable memories and interconnections. •The integration of error correcting codes. The project plans to develop a framework for the synthesis of chips made from unreliable components, through the concept of error correcting code driven graph augmentation. •The design of fault tolerant error correcting modules. This element aims to use encoded blocks for all interconnections in the chip, both in memories and inside the buses. •The development of error models, energy measurement and validation tools. CMOS specific error models will be developed to support aggressive voltage scaling to the near or sub threshold region. The project will also develop simulation techniques at different abstraction layers. •Circuit design optimisation. This will take into account the design of the fault tolerant decoder architecture, its integration into the structural description of the circuit and the error models.