How multiple cores are improving data processing efficiency

4 mins read

Multiple cores on a chip is no longer deemed a novel way to boost computational performance. Chip firms recognise that using multiple cpus in parallel gives a better return than complex architectural tweaks to the cpu itself. By doing so, each elemental core can be tailored in terms of chip area and power consumption, especially when the core's clocking frequency is no longer the overriding concern.

But multicore devices bring challenges – how to best program a multicore architecture and avoid data pinch points between cores and between the cores and external devices. Cavium Networks has long adopted multicore architectures for its communication processors. Its first generation Octeon Plus features a 16 cpu device. Now, with its Octeon II family, it has hiked the number of MIPS64 v2 risc cores it uses per chip to 32. According to market research group, The Linley Group, this should provide a 400% processing performance increase over the 16 core Octeon Plus. Octeon II was first detailed a year ago with the CN63xx family. Two additional families have now been announced: the CN67xx, with 8 to 16 cores; and the CN68xx, with 16 to 32. "The CN67xx has two memory controllers, whereas the CN68xx has four," said Venkat Sundaresan, senior product line manager at Cavium Networks. "If you need 40Gbit/s [line rate], you take the CN68xx. If you need 20Gbit/s and you don't want to pay for the extra performance, you go for the CN67xx." The CN67xx and CN68xx are pin for pin compatible families which target common applications, including cloud computing, packet networks and enterprise. Security and wide area network (WAN) optimisation are examples of cloud computing tasks. For mirroring applications, where data is stored in more than one data centre, Octeon II can run lossless compression algorithms. Such techniques reduce the amount of data sent and ensure better use of the WAN's capacity. The devices' packet processing role includes uses within 3G and emerging LTE cellular networks. Here, the CN63xx is used in basestations, whereas the CN68xx would be used by platforms deeper within the network, such as the 3G's Radio Network Controller gateway. Tasks include packet aggregation, applying IP Security (IPSec) encryption and protecting against malicious software. The processor is also aimed at the 40Gbit/s and 100Gbit/s line cards used in high end switches and routers. Here, the CN68xx would perform Layer 3 and higher packet processing tasks. "Typically, the [CN68xx] processor is tied to an fpga or an asic," said Sundaresan. "The asic does the switching, while all the exception packets, IPSec or any application level processing, is offloaded to this processor." Besides enhanced MIPS cores, the latest Octeon II families have dedicated hardware blocks, more memory and a variety of new interfaces. The MIPS64 v2 core is clocked at up to 1.5GHz, has a 32kbyte Level 1 cache and a 4Mbyte Level 2 cache. Cache associativity – how data is stored – has also been enhanced, as has the core's micro architecture in terms of predicting which branch the program code will take and making associated decisions ahead of time. A new functional block, dubbed the HFA, has been added for deep packet inspection (DPI). Cavium already has devices tailored for DPI and has used its 'know how' to enhance Octeon II. With DPI, not only is packet header information looked at, but footer, source, destination and payload information are also examined, with the latter used to identify applications and protocols. Such information is useful for fixed and mobile operators, in terms of how they manage their networks and for the services that run over them. A key functional block is the application acceleration manager. The on chip packet processing engine – packet input v2 – processes incoming packets, queues and tags them. The application acceleration manager inspects these processed packets, schedules a task to a free core and determines what occurs to the result. "The manager acts as a control block, not a data block," said Sundaresan. For example, the manager could set up a direct memory access transfer, assigning to which core the data stream is sent, before the core executes Layer 3 and higher packet processing tasks. "We have a [MIPS] architectural license and we can add specialist instruction sets to accelerate packet processing," said Sundaresan. The Octeon II family also features an 8Tbit/s crossbar switch. "When you have 32 cores, each running at 1.5GHz, there is a lot of I/O traffic. Traditional bus architectures will not handle that," said Sundaresan. "We have a [on chip] switched network that enables the traffic to run in a non blocking way." To tackle device power consumption, Cavium switches off unused parts of the cores. It also employs dynamic power management, whereby the chip's voltage is altered based on processing load. In this way, higher voltages are used only in critical parts of the chip. "If the applications load is low, it [the chip] throttles down the speed," said Sundaresan. The chip's interfaces have also been enhanced in terms of the I/O performance and protocols supported. The device includes the high speed PCI Express Gen 2, which can be used to link to another Octeon II processor or to an asic. It also has 10 gigabit Ethernet (GbE) ports and Interlaken interfaces that can either link two CN68xx processors, or the processor to a ternary content addressable memory. Interface options include up to eight Interlaken lanes at 6.25Gbit/s each; up to five 10GbE ports, up to two PCIe Gen2 controllers, and up to 12 GbE ports. The resulting processing and interface architecture means the CN68xx is capable of 40Gbit/s duplex line rate processing, can compress data at 20Gbit/s and perform DPI at 15Gbit/s. According to the Linley Group, Cavium's chip will deliver several times the peak instructions per second of NetLogic's forthcoming XLP832, although the NetLogic device will offer similar memory and I/O bandwidth and should have better single thread performance. Cavium's Octeon designs have also proven more power efficient than RMI's (now NetLogic), claims The Linley Group. It expects the same to be true with the CN68xx, even though, at 65nm cmos, it lags the XLP832 by one process node. As for software, the Octeon families share a compiler. "Anything that runs on the two core device should run on the eight core and on a 32 core," said Sundaresan. "The compiler and software development kit abstract away the number of cores." Both Octeon II families will sample in the final quarter of 2010. Sundaresan is confident Cavium will meet the deadline, despite the six core CN63xx having slipped two quarters. "The complication going from Octeon Plus to Octeon II was primarily the new core," said Sundaresan. "We introduced [on the CN63xx] a lot of new high speed interfaces – PCI Express Gen 2, DDR3 [memory interface], compression and new blocks; these are the same as found on the CN68xx." Adding more cores to boost packet processing performance will continue. "We have ensured that the packet and control planes can scale with the number of cores," said Sundaresan. "Adding more cores is still possible."