24 May 2011
Chipmakers move to 3d to create successor to multichip module
Chipmakers move to the third dimension to create successor to multichip module
Moore's Law has been in danger for some years now, but claims of its demise remain exaggerated. Even with optical lithography, manufacturers reckon there is a reasonably clear path to 16nm and evidence that structures such as finFETs could be good down to 10nm.
But, after that, things get fuzzy. More than six years ago, NEC developed a transistor that was just 5nm long. Its leakage was so high that it basically switched from on to a little less on. But it had a performance curve that confirmed transistor behaviour, demonstrating that a 5nm transistor could be made, even if it might not demonstrate desirable behaviour.
Professor Mike Kelly of the Centre for Advanced Photonics and Electronics at the University of Cambridge believes such a device lies in an area of 'intrinsic unmanufacturability', something that will happen as we approach 7nm design rules. The big problem, according to Prof Kelly, is variability – already an issue in chip manufacture. Today, it is largely related to the way that dopants are arranged in a crystal.
At sub 50nm design rules, random variations in dopant concentrations can shift properties, such as threshold voltage, to the point where many transistors no longer switch. At less than 10nm, the size and shape of the core structures themselves become problematic. In a dot that nominally measure 3nm across, the variation between neighbours is more than 10%, making it very difficult to guarantee correct operation without some form of tuning – which would take up extra space and defeat the object of making the core devices smaller.
On this basis, 2D silicon scaling has around another 10 years – if you can justify the humungous cost of designing a multibillion transistor chip and its associated masks. It will take guaranteed volumes measured in tens or hundreds of millions to achieve a worthwhile payback. Variants for subtly different applications are out of the question.
In 1998, Bob Payne, then strategic technology vice president for VLSI Technology, raised the prospect of 'deconfigurable' design as a way to fix the so called design gap that appeared in that decade. In an industry that strove to find ways to squeeze more onto a chip, most competitors thought the idea outlandish. Yet platform based design, using IP blocks picked from a widely applicable superset, is now comparatively commonplace and cited by people such as Synopsys' CEO Aart de Geus as the prime solution to closing the design gap.
In platform based designs, unwanted IP cores never make it onto the final silicon. But if you have a market where silicon built on the latest processes can only be supported by massive markets, it makes sense to use the available space to build in everything you might need, along with the kitchen sink. All this assumes that circuits scale linearly with process generations.
Analogue circuitry most definitely does not: in fact, it can get bigger. Companies generally work around this by shifting more of the burden into digital [Digital Takeover, NE, 22 November 2010]. However, if the analogue section forms a significant fraction of the die, it provides an argument for not shrinking the design . The shrinkage of the digital section will yield a smaller die, allowing more to be packed on a single, fixed cost wafer.
But wafer processing costs increase with each new generation and if you are not getting a two fold density increase from a move to a more advanced node, it can be hard to justify the decision. The digital section could still benefit from the shrink, if you have good, short connections to an analogue section built on an older, cheaper process. This is where you can use the successor to the multichip module, introduced more than 20 years ago.
Now, it is known as 3d integration or 'more than Moore, rather than more Moore', as it is known by some European research programmes. 3D integration is far from new. Demand from the phone makers for denser memories convinced packaging companies to invest in equipment that made it possible to stack chips on top of each other. A dizzying array of stacked chip techniques was developed, some even stacking packaged chips.
This package on package (PoP) construction made it easier to test devices fully before they were assembled into the stack – overcoming one of the critical problems faced by all types of 3d packaging: how do you maintain a high yield? If you assemble untested dice into a multichip module, the probability of each one causing the assembly to fail rises dramatically. With a die yield of 90%, the probability of a critical failure within the stack rises to one third with four devices.
The industry responded by selling known good die (KGD) – unpackaged chips that have passed wafer level tests and a series of screening options. These parts can be forwarded to regular single die packaging as the cost of those failing after test and burn in is much less. Memory has one further advantage in multichip modules: you can use redundant bit and word lines to recover parts that would fail if you needed the memory cells to be 100% functional.
Because effective yields are relatively high in volume memories, it makes sense to stack them. The question then becomes whether it is worth adding processors and analogue devices to the stack. Good reasons are appearing. Today's memories are not designed with many I/O pins. Instead, they use heavily pipelined, high clock rate interfaces that transfer multiple bits per cycle. This comes at a cost: power consumption.
At the 2007 Design Automation Conference, Intel's microprocessor research director Shekhar Borkar painted a stark picture of what would happen with power consumption in future massively multicore processors. Interprocessor communication would dominate the power equation but some 25W – significant, even in today's server processors with a power envelope of 150W – would come from powering the memory bus I/O pins.
Most power is consumed by driving relatively long, high capacitance PCB traces. If memory is closer to the processor, the parasitics are cut dramatically. IBM has done this in its current generation of mainframe processors, using embedded DRAM instead of SRAM. The smaller memory cell makes it possible to double the amount of cache memory local to each processor core for the same die area.
According to IBM fellow Subramanian Iyer, the largest computers save up to 1.5kW in power from this larger cache – as it reduces the number of accesses to main memory. The next step is to bring off chip memory closer. You could take existing memories and place them side by side on a conventional chip package substrate with a redistribution layer or stack them using today's wirebonding technology.
The reduction in wire inductance from that alone would bring substantial power savings. But there is not enough space along the chip edge to support higher bandwidth. One method might be to use the entire top surface of the memory die for processor I/O and flip it so it sits on top of the processor. This provides an entire DRAM's worth of additional tightly coupled memory that can take advantage of more I/O pins to run the memory interface more slowly.
There is a catch: processors run much hotter than memories. DRAMs do not cope that well with heat and the arrangement would reduce the effectiveness of the heatsink on top of the processor-memory stack massively. Borkar proposed an alternative: sit the memory underneath the processor and drill holes to take conductive vias through the DRAM to support connections from the system to the processor (see fig 1).
Once one set of holes has been drilled through a die, you could go further and drill them through multiple DRAMs, providing the ability to stack a large amount of memory in one package. This is where IBM is looking to move its mainframe processors; Intel's future high end x86 devices may follow. Companies such as Samsung have proposed using the same concept for mobile phone processors.
According to Samsung, a shift to stacks that use through silicon via (TSV) connections could halve power consumption. As a result, Jedec subcommittee JC-42.6 is working on standards that will allow memories made by different vendors to be assembled in stacks, without demanding custom designs for each one. The effort, part of the DDR3 and DDR4 standards initiatives, will lead to the use of much wider I/O buses – as you can, in principle, get many more TSVs onto a die than wirebonded connections.
DDR4 is likely to be the first memory interface standard that supports 3d stacks from the start. There is another catch: if you want to maximise the die-area efficiency, the last thing you want to do is drill 5µm holes through the chip at regular intervals. These place big roadblocks in the way of the metal word and bit lines that wire up the individual bits of a DRAM or flash memory.
You can work around this by splitting the memory into smaller blocks so they fit between the TSVs but, although this helps reduce word- and bit-line parasitics, it reduces memory density. Iyer claims it is possible, with some work, to reduce the die-area overhead to around 5%, but this cost needs to be factored into the overall price of the 3d stack. On top of this, you have to factor in the cost of processing wafers for TSVs so they are ready for stacking.
Estimates vary, but the most optimistic place it at around 5% of the cost of a fully processed CMOS wafer, not including the cost of testing something with a lot of hidden I/Os. That, in turn, will place the burden of test on built in self test (BIST) and more advanced scan based algorithms. Although companies such as Toshiba have successfully demonstrated the manufacturing viability of TSVs in image sensors – putting the processing electronics directly underneath a CMOS sensor – via yield is a major concern.
Reactive ion, or dry, etching has proved startlingly successful. It can bore deep holes through a silicon wafer with very little spreading at the top of the hole. Unfortunately, it is not so easy to fill the hole with a conductor. Although dry etching can produce holes with extremely high aspect ratios, filling those holes with conducting materials means the wafer has to be thinned, which makes handling more difficult.
When the 130nm node was introduced, improperly filled contact holes quickly became one of the most likely sources of chip failure. The response was to double up on vias wherever possible – using statistics to fix a manufacturing problem on the basis that two neighbouring vias were unlikely to fail. Redundancy is likely to be a major concern for TSV based stacks, which will further eat into die area efficiency.
How and when you create vias presents the industry with a headache. It is not just a matter of drilling holes, filling them with copper and wiring them up. The decision of when in the process the TSV is created has a major impact. There three basic approaches: via-first; via-middle; and via-last (see fig 2). On cursory inspection, via-first looks easiest – take a virgin wafer, drill holes in it, fill them with metal, then go about making the transistors that surround them.
Unfortunately, it is not that simple; CMOS transistor fabrication involves very high temperatures that will disrupt any metal interconnect. In conventional chipmaking, interconnect fabrication uses much lower temperatures after the core transistor structures are formed. So, instead of copper, you have to use a material such as polysilicon, which has lower conductivity. The filled holes put additional stress on the wafer, making it hard to keep it flat.
This, in turn, makes it hard to image patterns on the wafer because the depth of field of an optical stepper is so small [Fractured Future, NE 10 May 2011]. The via-middle process works around this by constructing the transistors first, then forming TSVs before the metal interconnect layers are laid down. However, this means the foundry has to deal with processes that are more familiar to packaging houses.
The via-last option makes it possible to have wafers made at a front end foundry, then take them elsewhere for the TSVs to be inserted as a final step before packaging. Although it makes introducing TSVs to an existing manufacturing flow easier, the contacts can only be connected to the top layer of metal – so the via acts as a 10µm obstruction to any routing in lower, much finer metal layers.
This, again, will reduce the effective density of any device, particularly logic chips which are often limited by routing congestion. Although challenging, via-middle looks to be the best compromise. Even then, you have to take into account the effect of implementing keep-out zones around the TSVs to avoid stress changing the properties of transistors. Work by IMEC and Synopsys, amongst others, has shown these zones can be quite large – and need to be much bigger for analogue circuitry.
Digital keep-out zones can be from 5µm to 20µm around the TSV; the analogue no man's land is up to 200µm. However, if analogue circuits are on older processes, their relative die-area cost will be much lower. You then have the question of how 3d stacks are formed. For maximum throughput and, in principle, minimum cost, assembly is done on a wafer by wafer process.
There are two main objections to this. One is that you need the dice to all be the same size – this is not going to work for standard memories unless all chipmakers addressing the mobile phone market suddenly agree on a common die size for their processors. The other obstacle is not being able to screen out failed devices until the end of the process.
You are, once again, back to the yield collapse that faces any stack that does not use KGD. As a result, although throughput is lower, die to die stacking is likely to be the approach most commonly used for all but the most cost sensitive applications. When you consider the extra costs – all those extra 5% costs soon add up – 3d integration has some big hurdles to overcome.