08 January 2013
SoCs with more powerful cores need a more powerful interconnect
When mobile phones were used only to make voice calls and to send texts, the communications world was a much simpler place. But things have changed. Mobile phones have got smart and many people now use them as their main communication device, not only making voice calls, but also downloading and uploading videos to and from social media sites and browsing the internet.
The result is that a vast amount of data is now passing around communications networks and that means more intelligence and more processing power needs to be applied.
Troy Bailey is director of marketing with LSI. He said the company is seeing a 'data deluge'. "There is more and more data driven by video and mobile use. By most projections, the amount of data will outstrip the capacity of the infrastructure in the future, so what's needed is faster devices to handle more data."
Bailey also says there is a need for smarter devices. "We have to develop better ways to handle data; for instance, not moving data that doesn't need to be moved. One of the ways we can do that is to add intelligence and processing throughout the network, rather than at gateways."
The way to do this, in LSI's opinion, is to add more and faster general purpose cores to network processors, but also to add acceleration engines to do those tasks with which general purpose cores struggle. "For example," Bailey said, "there's a lot of activity on a per packet basis – classifying, deep packet inspection. If you do these tasks with a general purpose processor, it will be slow and expensive."
He has an analogy: "A mechanic with a basic set of tools can fix your car, but a specialist who works on one part of the car will have special tools and special knowledge."
But, as he noted, traffic management is an important element in designing the architecture of a network processor. "If you can avoid sending data over the network, you're better off and particularly so if you can cache it or put processing capacity closer to the network edge."
LSI has a range of devices either available or in the planning stage. "We have single and dual core devices that perform the same tasks," Bailey explained, "but we also have devices with dozens of cores. We see a strong opportunity to handle data in special purpose hardware, so devices will have more engines and more cores. This will need a balance between general purpose and special hardware."
And the question of which cores to use has been under discussion. Until recently, LSI has based its network processors on PowerPC cores, but an announcement early in 2012 revealed ARM cores are now on the road map. "Some of these discussions are driven by customer requirements," Bailey said. "The ARM architecture is strong and there's a good ecosystem, so the move makes a lot of sense. LSI's approach is based on hardware acceleration, which also makes sense, and we are not looking to use proprietary cores. So while a Cortex-A15 doesn't necessarily bring more performance, it is more power efficient."
Then comes the challenge of linking all these cores together. And LSI has turned again to ARM, taking a lead license for ARM's CCN-504 interconnect. "We have helped ARM to define what's required in such an interconnect. As you add more cores – particularly accelerators – you end up with a lot of compute elements and when that happens, there's opportunities for bottlenecks. You could end up adding more cores, but getting lower performance," Bailey contended.
Neil Parris is ARM's interconnect product manager. He said CCN-504 had been developed specifically to address the issue of more cores. "It's about providing coherency between the cpus and the I/O and about using data on chip."
In some respects, it's a consequence of integration. "There used to be a range of chips which needed to be connected," Parris observed. "Now, it's a single chip with multiple cores which is power critical and which needs to interface to the latest technology."
CCN-504 – CCN stands for cache coherent network – is the first in a family of interconnects being developed to support future complex devices. "It supports four cpu clusters," Parris said, "and each cluster can comprise up to four cores. It also supports ARM's 64bit architecture, which is important for those people building servers.
"Each cpu cluster has an L2 cache, which is configurable to 2Mbyte, or 4Mbyte in the case of the Cortex-A15. The interconnect's purpose is to join all the processors in a coherent manner, making sure all cores have a consistent view of memory."
But CCN-504 isn't ARM's first cache coherent network. "That was the CCI-400," Parris said. "That's aimed at mobile applications with two clusters, including Big.LITTLE."
Caching is an important element and one which supports Bailey's view that you shouldn't have to move data if you don't have to. "Caches are important contributors to power efficiency and performance," Parris pointed out. "The more data you have on chip, the fewer the accesses needed to external memory. It helps with power consumption and performance."
CCN-504 has also been built with cores other than ARM's in mind. The network can support up to 18 AMBA interfaces, which allows designers to take advantage of such functions as 40Gbit Ethernet, USB and serial ATA links. But it also features PCI-Express connectivity. "Companies will use this facility to add their own IP into an SoC," Parris explained. "For example, they may wish to add their own accelerator, and it's our aim to provide them with a scalable platform on which they can build."
All 18 AMBA interfaces are connected to the cache coherent network through an I/O virtualisation block which provides unified system memory. "AMBA defines interconnect," Parris said, "and CCN-504 builds on the AMBA interconnect. It has an integrated L3 cache, which can be configured from 8 to 16Mbyte, and a snoop filter." The snoop filter basically keeps an eye on all caches to ensure coherency and reduce bus traffic.
If the SoC does need to access external memory, ARM has developed the DMC-520 memory controller for 72bit wide DDR3/4. This supports a maximum bandwidth of 25.6Gbyte/s per channel and features buffering to optimise reads and writes. It's the fifth generation DMC and includes error checking and correction features.
Overall, CCN-504 supports a system bandwidth of around 1Tbit/s and operates up to the cpu clock rate. "This network scales the performance of the CCF400 significantly," Parris noted, "with more ports and a larger cache. At the moment, it's 128bit wide, but future devices will move up, including bandwidth," he added.
Bailey said LSI needed a strong technology partner for interconnect. "It's not our point of differentiation," he said, "so the licensing approach made sense. When you think of an SoC with 16 cores, there may be a total of 30 compute elements. It's a complex design and that's why it needs a robust networking solution."