More in

How the Automata architecture could boost processing efficiency

4 mins read

Micron Technology appeared at the 2013 Supercomputing conference, where it claimed the development of a massively parallel architecture built around its memory technology could provide a huge speed boost for some of the applications that today need supercomputers. At the same time, the memory maker said it is setting up a research centre based at the University of Virginia to focus on what it calls the Automata architecture.

Automata distributes tiny processing elements throughout an array, such that each is associated with a small amount of local memory. The processors work semi independently to crunch through operations locally, the results of which can be collated and further processed by a host machine. Each processor is a programmable state machine that moves from state to state based on the data it accesses. The most familiar form of this kind of cellular automaton is a cell in the Game of Life in which the state of neighbouring cells determines whether the cell in the centre lives or dies on that particular cycle. Automata based systems have been proposed before and met with limited success. Aspex Technology commercialised automata-like associative string processor technology from Brunel University, refocusing on media processing IP before the company was acquired by Ericsson in 2012. Micron says its approach exploits the natural bit-level parallelism of DRAM and has applications in processing regular expressions – lending itself to network packet inspection. An initial application was an implementation of part of the Snort network intrusion detection software using regular expressions, similar to those used in languages such as Perl (see fig 1).

Automata is far from being the first smart memory to appear. Two decades ago, Mitsubishi and Sun Microsystems developed what they called the 3DRAM, an early form of 3D coprocessor aimed at frame buffers. Unveiled at SIGGRAPH in 1994 as the FBRAM, or frame buffer memory, the intention was to reduce the number of writes that a 3D processor needed to make to the frame buffer: one of the main bottlenecks at the time. In 3D graphics, updating a pixel demands attention to its position on the Z axis. The pixel should only be overwritten if the stored pixel has a Z axis position further away from the proposed replacement. In traditional systems, this meant reading the pixel value, comparing it in the processor and then writing the new value if necessary. This resulted in high bus traffic and did not benefit from the use of caches because, although data would often be read in a reasonably predictable way, data would most likely be displaced from the cache well before it would be needed again. By moving the comparison operation into the memory array itself, the 3DRAM allowed the host processor to simply write pixels and Z-axis values and let the memory decide what needed to be kept (see fig 2).

A couple of years after the development of the first 3DRAM, Professor David Patterson of the University of California at Berkeley and co-author of the seminal textbook Computer Architecture: A Quantitative Approach, put forward the idea of the intelligent memory, or IRAM, that could perform much wider ranges of operations than pixel comparisons. When IRAM was first conceived, it looked as though processors would massively outpace memories in terms of performance, particularly when it came to access latency. Then, amid talk of processors with energy densities reaching those of a nuclear reactor, clock speed increases suddenly halted and talk about the processor-memory gap dried up almost overnight as the gap began to close. Memory is still a long way from closing the gap – leading to the use of complex cache hierarchies in most computers bigger than a tablet – but the latency gap is not increasing (see fig 3). Because processors have started multiplying, instead of increasing in single thread performance, a second problem with memory has emerged: its power demands. Although memories run much cooler than processors, accesses to off chip memory account for a significant amount of the total computational energy of a program. Analysis performed at Stanford University found more than 40% of the power consumed by a processor was due to fetching instructions from main memory. This even included the energy to fetch the associated data, which also lives off chip most of the time. The fraction needed for computation is almost insignificant. For example, a 32bit addition in a 45nm low power CMOS technology consumes about 0.52pJ; a 16bit multiplication consumes 2.2pJ. A full addition instruction performed by a typical RISC processor – with the source and destination data all held in registers – requires 5.3pJ. Taking processors into the memory can reduce the power consumption required for computation significantly as it removes the need to drive highly capacitive buses, which often use power hungry termination schemes, at high speed. For similar reasons, high end computer architects are considering a move to 3D integration as this would also reduce the size of bus drivers for accesses to memory chips within the stack. The problem with moving computation off chip is the software legacy built over decades by the von Neumann hegemony. However, the rise of the general purpose graphics processing unit has shown a way in which computer designs might evolve to allow processing to be distributed into memory. Although cellular automata processing lends itself to massive parallelism, developing programs to make efficient use of it can be difficult. However, environments such as OpenCL make it easier to use processing arrays by letting programmers access ready made libraries called from conventional languages, such as C and C++. Another way to exploit smart memory that follows in the conceptual footsteps of the 3DRAM has been developed by IP supplier Memoir Systems. Like Micron's example with massively parallel packet sniffers, these memory cores are focused on high-speed network switches and focus on commonly used accessed patterns that can lead to processing bottlenecks. The embedded processors include counters for monitoring network usage, read-modify-write processors similar to those on the 3DRAM and buffer controllers that take care of finding free memory for packets on behalf of the host processor. Although Micron's Automata may find a place in network switches and high performance computing applications that can use state machine approaches, true in memory computing may not become a mainstream technology without a change in the core technology of electronics to incorporate memory. Memristors and magnetic or 'spintronic' devices can remember their state, providing a strong incentive to develop architectures that can exploit this property. In one experiment, researchers at the New York University Polytechnic School of Engineering developed a way to perform matrix computations on memristors using data stored in that array. They claimed a 100 fold energy consumption advantage over conventional techniques. Although in memory computing remains in its infancy, some researchers believe it is inevitable, pointing not just to the energy efficiency it could offer not just in principle, but also that our brains operate in much the same way. And they demonstrate much higher energy efficiency than traditional von Neumann computers.