Micro engine magic

4 mins read

40 micro engines help maintain line rate in computationally demanding packet processing applications.

Depending upon the tasks their devices are to perform, designers of network processors typically take one of two approaches – they either design processors with dedicated hardware functions or they adopt more flexible, programmable architectures. Dedicated hardware is favoured for tasks such as bit encoding and decoding, and packet routing – layer 2 and layer 3 tasks, respectively, of the seven layer OSI model. In contrast, programmability is required for more computationally demanding layer 4 to layer 7 tasks, such as IP security and deep packet inspection. Here, designers typical use multiple general purpose processor cores. However, the challenge here is to avoid the line rate – the device's packet throughput – tailing off with more demanding higher layer tasks. Netronome has designed its NFP-3200 network flow processor family to be adept at both. "It [the NFP-3200] can perform deep packet inspection at 10Gbit/s full duplex data rates," said Bill Duggan, Netronome's product manager for the network flow processor. The NFP-3200 executes lower layer packet processing tasks at line rate, although it is oriented to layer 4 to 7 packet flow applications. With packet flow, the network processor not only manipulates individual packets, but must also understand their sequence or flow. Unlike general purpose processor designs, the NFP-3200 uses risc cores – 'micro engines' - that maintain the line rates independent of processing load. "It [NFP-3200] is a unique product, distinct from network processors and general purpose multicore devices," said Bob Wheeler, senior analyst at The Linley Group. "With network processors, there is a continuum of products spanning all the way up to Netronome's; a highly programmable, flexible device with little hardware acceleration." Netronome was originally a software company. But, in 2007, it acquired the rights to Intel's IXP network processor architecture and associated hardware and software tools. The NFP-3200 family is thus a third generation design, following Intel's original IXP1200 and second generation network processor family that includes the IXP2800. The NFP-3200 is being aimed at what Netronome calls unified computing applications: servers and networking infrastructure spanning broadband and wireless access, and network security. An important part of server design is the use of virtualisation technology that separates a software function from specific hardware. Using virtualisation, multiple software usages can be executed on a server, increasing significantly the usage of the platform's processors. For computing, the NFP-3200 can be used as a centralised processor to balance a server's workload across its multicore cpus. "With a data centre server, the network flow processor can take the 10Gbit/s stream and parse out the workload, directing network traffic to the [server's] virtual machines," said Duggan. But Wheeler believes servers will not be the main market for the NFP-3200 since 10Gbit/s Ethernet controllers with on chip hardware already exist for virtualisation. Rather, he sees networking equipment as the device's strongest target market, pointing out that the likes of Cisco Systems already uses the IXP2800 on its application control engine (ACE) card used with its Catalyst 6500 Ethernet switches and 7600 router family. Cisco's ACE is used for application load balancing, data encryption and decryption, and deep packet inspection to detect malicious cyber attacks. "No other 10Gbit/s device includes a crypto engine accelerator for IPSec," said Wheeler. IPSec is an encryption protocol used to secure IP traffic. For fixed and wireless broadband access, the device can indentify flows and shape and rate limit traffic depending on its quality of service. The NFP-3200 can also be used to perform packet inspection to block or monitor peer to peer traffic, for example, or identifying applications in wireless for billing purposes, said Wheeler. The NFP-3200 family has two members – the 3216 and 3240 – that represent a significant upgrade to the IXP2800. The 3216 is a direct replacement for the IXP2800, providing a roadmap for the Intel design that will come to the end of its life in 2012. The NFP-3216 has 16 micro engine risc cores clocked at 1.4GHz. The 3240, in contrast, uses 40 micro engines and is available clocked at either 1GHz or 1.4GHz. Besides faster clocked cores, the NFP-3200 family also has an upgraded general purpose on-chip processor and peripherals. An ARM11 is used instead of the IXP2800's XScale core and includes two 32kbyte layer 1 data and instruction caches, and a 256kbyte layer 2 cache. Additional NFP-3200 changes include replacing rdram with cheaper DDR2/3, two programmable interfaces: one supporting SPI-4.2, 10Gbit Ethernet, four 1Gbit Ethernet MACs (or Interlaken); the second, is identical, except without the SPI-4.2 interface. In turn, the NFP-3200 supports eight lanes of second generation PCI express. Each micro engine is a 64bit instruction, 32bit data risc processor with an 8kword instruction store. The NFP-3200 thus has a restricted code store compared to multicore general purpose processor designs where instructions are held in main memory. Netronome has tackled this by doubling the store to 16k, through pairing two micro engines. While the code store is limited, having on chip code brings processing performance advantages. "The NFP-3200 is good at maintaining line rates, but so are layer 2 and 3 network processors," said Wheeler. "But for layer 4 to layer 7, more general purpose designs get bogged down in packets and run into memory stalls," he said. Each core can process eight threads, with a packet typically assigned to a thread. "Each time the core goes to dram for a read or a write, the thread is put to sleep and another thread is processed such that memory access latency can be hidden," said Duggen. The 40 core NFP-3240 can process 320 threads and clocking the cores at 1.4GHz equates to 56billion instruction/s. Netronome refers to 1800 instructions per packet at a processing rate of 30m packet/s. The network flow family has a power consumption that ranges from 15W to 35W, though Netronome claims the ARM11 and micro engines are optimised for low power operation. The simplest programming model using the network flow processor is sharing common code across the cores such that multiple packets are processed in parallel. But more sophisticated partitioning can be used where cores have dedicated roles. For example, several can be linked in a pipeline: a micro engine can be used to receive packets, two others for switching and forwarding tasks, another for load balancing and a fifth for transmitting the packets, said Duggen. Device software can be written in microcode to squeeze the most out of the processing performance but a 'quasi C' high level language is normally used. According to Wheeler, programming such a multicore device is a challenge. "Intel worked a lot on developing tools that model applications up front before a designer writes any code and does the partitioning," he said. There are also performance profilers that highlight how well the network flow processor's cores are being used. While the tools all help, a skilled engineer is needed to map applications onto the device. "There is a certain amount of hand tuning required," said Wheeler, explaining why designers favour common code executed by all the cores. Netronome is taping out the NFP-3200 design now and expects to provide chip samples to early adopter customers during this quarter.