He only projected as far as 1980 but his estimates pointed to die area at least doubling in five years: hitting 60 square millimetres by that time. In the mid-2000s, Intel did its best to keep up with the extrapolation courtesy of its reticle-busting Itanium processors that clocked in at almost 700sq mm. With just a handful of customers, it was clear that most of the industry was happier with smaller devices and using more of them.
A little over a decade later, the burst in activity around artificial intelligence has put big chips back on the menu.
“Chips are getting larger and the area is mostly consumed by the AI blocks,” said Mustafa Badaroglu, principal engineer at Qualcomm, in his outline of the changes to the International Roadmap for Devices and Systems (IRDS) ”More Moore” section during the organisation’s May preview seminar. Whereas SoCs for mobile devices might continue to opt for die sizes below 100 square millimetres, AI is taking them to 500 and beyond.
Cerebras has to some extent kept Moore’s extrapolation alive with a processing array that covers almost 50,000sq mm of silicon. As the reticle area of a scanner limits how much of the wafer can be exposed at one time to less than 1000sq mm, the company worked with TSMC to find a way to easily stitch the component chips together into a larger array.
Poor yield is a major problem for many superchips, a factor that is not helped by random defect counts that are going up as processes become denser and far more complicated to make.
One option used by Cerebras is to use of redundancy to avoid having to bin devices that suffer major failures in one part of the die. Dividing arrays into smaller chiplets greatly increases the average yield in the face of random defects if redundancy is hard to implement.
At the International Solid State Circuits Conference (ISSCC) in February, AMD product technology architect Sam Naffziger said that assembling a 32-core Epyc processor from four chiplets instead putting everything on one die saved around 40 per cent in cost, even taking into account the redundant I/O logic each chiplet contained.
Another factor that points to the use of more, smaller chips is that lithography is moving in the wrong direction for monolithic silicon. The next generation of EUV scanners will use optical enhancement techniques that call for the area they can cover in one exposure to be halved. Chips with die sizes similar to nVidia’s A100, which clocks in at 826sq mm, would need more than one pass for each mask.
What type of multichip integration?
The problem that faces chipmakers is deciding on what kind of multichip integration they should choose. The catalogue of multichip integration options from foundries and packaging houses keeps expanding. It started with CoWoS, which uses a large passive silicon interposer to support more than chip. Xilinx adopted it for larger members of the Virtex-7.
If the interconnect does not have to run underneath entire chips, another option is a silicon bridge. Altera, later acquired by the chip giant, used Intel’s EMIB technology to provide short, low-capacitance links between the core programmable-logic array and I/O chiplets for its larger FPGAs.
In the mobile market, fan-out wafer-level packaging (FOWLP), such as TSMC’s InFO and ASE’s FOCoS, are the options OEMs tend to pick. With the interconnect formed in an organic substrate, the cost is lower than for interposers, at the cost of fewer I/O lines.
Even within FOWLP, there are multiple choices for integration. The main option today is so-called chip-first where the silicon is embedded in the polymer and the circuitry formed around it. Chip-last is now appearing on some packaging menus to handle situations where the manufacturer wants to be sure the package works before inserting an expensive SoC into it.
Intel, TSMC and others are now looking to 3D stacking of devices. Though it’s been a popular research topic for well over a decade, stacking has largely been confined to image sensors or package-on-package techniques where ultra-thin memories are combined in a single unit and integrated with the SoC using FOWLP.
High-bandwidth memory (HBM) uses through-silicon vias (TSVs) to interconnect DRAM die in a single stack but has found it hard to gain traction because of its high cost. Again, the AI industry is turning to it, but primarily using side-by-side integration using interposers. That trend is leading to a growth in the size of the interposer.
At the company’s symposium early this month, Jerry Tzou, director of the advanced packaging business at TSMC, said interposers that cover the area of two reticles are now ready and that a three-reticle option is likely to be qualified for production in 2022.
Thermal compatibility is a problem if DRAM is stacked directly on top of high-performance silicon. The hotspots need a higher refresh rate, which increases energy consumption and the need for cooling.
Instead, Badaroglu sees a brighter future for SRAM being stacked on top of processors for low-latency caches and working memory with DRAM used as bulk data memory mounted to the side. “We are seeing a significant trend on silicon interposer. With the larger area you pay more for the interposer but save on cooling cost,” Badaroglu says.
Die-level yield and its effect on the final packaged yield remains the big issue for those looking to cut costs with disaggregation. The Microelectronics Packaging and Test Engineering Council has run a series of seminars on heterogeneous integration with one topic underlining the problem that manufacturers of multichip modules have faced for a long time. In one of those seminars last year, Jan Vardaman, founder and president of TechSearch International, asked rhetorically “after 30 years why are we still talking about known good die?”. She pointed to a 1993 survey that identified the biggest barrier to multichip modules as being unable to identify defective unpackaged chips before assembly.
One way to deal with the doubt is improved wafer-level testing, though this has its own problems. The pads are so small that misalignment with a probe is inevitable. One response being pursued in proposed standards such as IEEE P1838 is to put more test logic into the chiplets so that they can test most of their connections and use the probe contacts to report their success or failure rather than rely on the contact measurements directly.
Built-in self-test coupled with redundancy will deal with many failures where the target device is a processor or memory array though there will inevitably be situations where redundancy cannot help.
Process management will likely come into play with machine learning being used to track how well certain wafers or batches fare in the field and to tune fab operations to minimise problems that keep turning up. Among others, the IRDS team is developing recommendations to minimise the kinds of contamination that cause random defects. Dan Wilcox, director of process engineering at Page Southerland Page, says the yield group is using test wafers with circuits created by the More Moore group to identify which contaminants are the most problematic.
The packaging process need not wait until the final chips are in place to start testing. AMD’s recent work has involved progressively applying tests and measurements during assembly to weed out failures before they have committed too many expensive chips.
It is a learning process that has been going on for decades but which has suddenly accelerated and which, if successful, will provide an economically viable alternative to monolithic integration for a wider array of SoC projects.