13 March 2012
Architecture optimisation as a tool for low cost low power solutions
The need for low cost and low power consumption drives semiconductor companies and their customers towards devices made on advanced processes.
But it is possible to reduce cost, size and power consumption through architecture size optimisation. These non expensive optimisation methods are often overlooked and this is a missed opportunity for further cost and power reduction.
Cost and power reduction
According to IBS, the average cost of developing a new chip rose from $45million at 65nm to $150m at 22nm. This higher investment eventually pays for itself by lowering chip cost. The main reason that semiconductor companies developing chips on advanced process node is the promise of cost and power consumption reductions.
In general, device cost decreases in the more advanced smaller process nodes, although it takes time until the process node matures and achieves better yield performance than previous process node. For this reason, it takes time for the minimum cost point to move the last two state of the art advanced process nodes.
In fig 1, Fp is the process cost optimisation factor and, in this example, is the process cost optimisation factor between a device made on a 65nm process (point A) and the same part made on a 45nm process (point B) without architecture optimisation.
Meanwhile, Fa is the architecture cost optimisation factor. In fig 1, this the cost of a device made on a 45nm process (point B) without architecture optimisation and with architecture optimisation (point D). The best way of minimising cost is to switch to a 45nm process and employ architecture optimisation.
Figure 2 shows power consumption optimisation. The minimum power consumption is achieved through moving from a 32nm process to the 22nm node (points A and B) and through architecture optimisation at 22nm (point D).
In many wireless and video algorithms, application special characteristics can be exploited for architectural optimisation. For example, it is common to have finite impulse response (FIR) filter based algorithms and applications with symmetrical coefficients (fig 3).
Symmetrical FIR filters have an even number of taps and each two coefficients at the same distance from the centre have the same value. It is therefore possible to do one multiplication with the sum of the two samples related to the same two coefficients in a specific time. This halves the number of multipliers and the related logic required for the FIR filter implementation. While FIR filters with odd symmetry have an odd number of taps, each two coefficients still have the same distance from the centre and the same value; only the centre coefficient doesn't have another tap with the same value. The number of multipliers required to implement an odd symmetry FIR filter can be reduced by 2n/(n+1), where n is the number of taps. This reduction factor closes on 2 as the number of taps grows.
In some applications, symmetric FIR filters have an additional characteristic that can be exploited. In half band symmetric filters, each second coefficient, except the the centre coefficient, is zero. Since these filters are also odd symmetric FIR filters, it is possible to reduce the number of required multipliers by up to a factor of four.
Operations with 2d symmetry are very common in video applications. As described in fig 4, this 5x5 2d FIR filter's vertical and horizontal symmetry means each circle with same colour represents a coefficient with the same value. It is possible to exploit 2d symmetry so the same multiplier can be used for all pixels with the same coefficient value. As described in fig 4, there are cases of 4, 2 and 1 pixels with the same coefficients. Therefore, the potential exists for the size of a generic matrix to be reduced by up to four times.
Significant portions of wireless and communications applications are based on algorithms featuring complex numbers, such as I and Q channels and symbol modulation.
Complex multiplication between complex sample (a + jb) and complex coefficient (C + jD) could be implemented with four multiplications:
(a + jb)*(C + jD) = (a*C – b*D) + j(a*D + b*C)
With some algebraic manipulation, the same result can be obtained using three multiplications.
= a*C – b*D + (a*D – a*D) + j(a*D + b*C + (a*C – a*C))
= a*(C+D) – (a+b)*D + j(a*(C+D) + (b-a)*C)
Since the silicon area required to implement a multiplier is significant larger than that required to implement adders and subtractors, this allows for size reduction.
Double data rate
Double data rate throughput optimisation is a special architectural technique designed to overcome fabric throughput issues common to fpgas. While the dsp slices in an fpga have similar size efficiency as those in an asic made on a similar process node, they have a flexibility advantage. The problem is that, in many cases where fpga utilisation is high, the fabric's relatively lower operation frequency creates a throughput bottleneck.
The LatticeECP4 fpga has innovative throughput boosting interfaces embedded into the dsp slices. This enables the part to offer double the throughput of other fpgas. Using this feature means the LatticeECP4 can implement complex dsp functions using half the number of multipliers that would be required by other fpgas. This optimisation enables a
significant system cost and size reduction, as well as decreased power consumption.
In many cases, different architectural optimisation techniques can be combined to achieve higher levels of cost and power reduction. Most FIR filters implemented in wireless or video applications are symmetric.
Implementing a 64 tap symmetric FIR filter with an input data rate of 245.76Msample/s in a typical fpga would require 64 18 x 18 multipliers. The LatticeECP4 can implement the same FIR filter using 16 18 x 18 multipliers, approximately four times smaller. This provides other benefits, including an approximate halving of power consumption and the opportunity to fit the design into a smaller fpga, which reduces cost.
Similarly, half band filters and double data rate optimisation could be implemented in other digital up and digital down converter interpolation or decimation filters. In these cases, the cost and power savings are even higher – approximately eight and four times respectively.
Architecture optimisation is relatively inexpensive way to reduced for silicon device cost and power consumption. There is no reason why semiconductor companies or their customers should not take advantage of those 'low hanging fruits' instead of investing in developing devices for manufacture on expensive advanced process nodes.
In many cases, a combination of different methods of architecture optimisation – such as data flow optimisation, algorithm characteristics optimisation or/and algebraic optimisation – can result in cost being halved and power consumption being reduced by a factor of eight.
Asher Hazanchuk is product planning manager with Lattice Semiconductor.
Lattice Semiconductor UK Ltd
This material is protected by Findlay Media copyright
One-off usage is permitted but bulk copying is not.
For multiple copies contact the