AMD Opteron Embedded processors deliver performance, scalability and efficiency

6 mins read

The launch of the AMD Opteron 6200 and 4200 Series Embedded Processors has delivered two 'firsts' for demanding multithreaded embedded applications: the first 16 core x86 embedded processor and the lowest power server processor. The enhancements provided by the Bulldozer architecture provide embedded systems developers with improved performance, while meeting tight power budgets at a competitive cost.

Imagine 16 cores optimised for use in embedded systems performing parallel processing in highly threaded embedded applications. Picture those 16 cores driving new heights of data throughput in industrial cloud computing appliances, medical imaging storage systems such as PACS (Picture Archiving Communication Systems), security appliances in networking edge nodes, and carrier grade communication solutions. All of those applications demand predictable multithreaded performance within tight power budgets and a long term availability that is cost effective as well. To meet these ever increasing demands, AMD Embedded Solutions has recently introduced the next generation of AMD Opteron Embedded processors based on the new Bulldozer architecture, designed to deliver a performance boost for these kinds of applications. The AMD Opteron 6200 Series Embedded Processor is the first 16 core x86 embedded processor, providing a 33% increase in core count and a dramatic boost in processing efficiency. Meanwhile, the AMD Opteron 4200 Series Embedded Processor is the lowest power embedded server processor, at less than 5W per core. What is it, exactly, that AMD Embedded Solutions put into the new processors' architecture to make them attractive for such a diverse array of embedded applications? New design methodology The design goal for the AMD Opteron CPUs was simple – drive greater scalability and higher density of cores through a design methodology that shares some components of the processor die to maximise use of the die, but which also keeps other parts discrete to avoid bottlenecks. Many modern processor designs include redundant components because of their multicore design; in many cases, these redundancies merely consume more die space, which can increase cost and increase power consumption – without necessarily adding incremental value to the processor. By moving to a design in which two integer cores share a common front end (fetch and decode) along with a large shared L2 cache and an floating point unit (FPU) complex, AMD has created a modular processor design. These modules allow AMD to have a flexible design that not only serves as a blueprint for the AMD Opteron 6200 and 4200 Series Embedded Processors, but also for the next generation of 20 core and 10 core products that are expected to reach the market in by the end of 2012. Each new module features two full integer cores and each can execute a single thread. So every module can handle two threads per cycle across each cores' dedicated integer pipelines. The integer core itself has four pipelines – two for the (arithmetic logic unit (ALU) and two for the address generation unit (AGU), allowing greater processing throughput. The operating system and applications will not see these modules, but will instead see only a pool of integer cores that can be used for processing without any further requirements. New Flex FP floating point unit Floating point operations represent about 10% of the processing load in a typical server, with the other 90% being integer. The challenge with floating point technology is that it requires more silicon and power to solve problems. The new Flex FP floating point unit is designed to help address this by applying next generation floating point capabilities to support 256bit AVX instructions. This is particularly relevant for applications like High Performance Embedded Computing (HPEC), which can take advantage of 256bit floating point processing with more throughput for both 128bit and 256bit applications. With up to eight 256bit Flex FP units – which can also act as 16 128bit FPUs – the AMD Opteron 6200 Series Embedded Processor can deliver up to 332.8 GFLOPs in a 2P node. In addition, the Flex FP features FMAC units that can handle a Fused Multiply Accumulate (FMA) instruction, as well as the standard multiply (FMUL) and add (FADD). An FMA operation is more powerful because it allows a calculation like A = B x C + D to be undertaken in one cycle. This helps to boost the computational horsepower of HPEC applications, driving more performance by executing FMA4 instructions that execute complex calculations in half the cycles as the competition. Performance boost on demand Another new feature of the AMD Opteron processors is AMD Turbo CORE, a technology that allows processors to run above their base clock frequency, provided there is additional power headroom available. When processors are rated for clock speed, it is typically done based on a worst case scenario, resulting in a speed rating that is very conservative for most workloads, potentially leading to a lost opportunity in clock speed for less strenuous workloads. This is one reason why processors generally consume far less power than their rated thermal design power (TDP). AMD Turbo CORE technology captures this additional power headroom and turns it into higher clock frequency, allowing the processor to run up to 500MHz faster than its rated base frequency with all cores active and used. In environments where all cores are not being used, the frequency increase can even be more than 1GHz with half the cores active. This is particularly relevant for scalable embedded server and carrier grade telecommunications applications with occasional peak loads, as the AMD Turbo CORE technology allows processors to independently boost their clock speeds automatically to respond to the need for more application performance. Improved memory controller When designing the AMD Opteron Embedded processor, AMD took the opportunity to redesign the memory controller, increasing throughput by up to 51% compared to the Opteron 6100 Series. Part of that throughput gain comes from optimisations to the memory controller – new algorithms and new ways to address reading and writing data that speed up access to information. In addition, Opteron Embedded processors support DDR3 1600MHz memory, as well as Load Reduced (LR) DIMM, which allows for more memory to be installed. Memory intensive environments, such as core network routing and storage deduplication, can take advantage of the four channel memory controller, 20% faster memory and Northbridge enhancements to reach a throughput of 70Gbyte/s (2P) and 140Gbyte/s (4P). The enhancements in memory controller circuitry and the higher clock speeds also lead to higher throughput for virtualization, HPEC and database applications. In these areas, platforms based on Opteron Embedded processors can handle significantly large memory footprints for large problems and map/reduce activities. Meanwhile, support for the emerging 1.25V low voltage DIMMs (today's LV DIMMs run from 1.35V) will lead to even more power efficiency. Cache structure Each processors' modules includes two levels of cache: an L1 cache focused on execution; and a 2Mbyte L2 cache focused on being the 'working area' for data being processed. Meanwhile, at the die level, an 8Mbyte L3 cache is provided. For the AMD Opteron 6200 Series, there are two dies inside the processor, gibing a total of 16Mbyte of L3 cache. The integration of up to 16 cores, with 2Mbyte L2 cache per core (up to 16Mbyte of L2 cache per socket), and a shared 16Mbyte L3 cache per socket offers improved performance and performance/W for multithreaded embedded environments like virtualisation, database and web serving. Virtualisation features With more environments moving from dedicated servers to virtualisation, integrated virtualisation is becoming an essential part of any processor design. A benefit of the Opteron Embedded processors in virtualised environments is the greater number of cores and greater memory scalability – core density and memory addressability are key performance drivers for a virtualised platform. For customers who use the '1 VM per core' methodology, the AMD Opteron 6200 Embedded Processor allows them to deploy up to twice as many VMs than on comparable platforms. Furthermore, AMD Embedded Solutions' work with virtualisation providers like VMware, Microsoft, Xen, Citrix and Parallels, as well as with the Linux KVM community, helps to ensure a highly supportable virtualisation solution. Power efficiency features The continued focus on power efficiency brings new enhancements to the architecture of the new processors. At the core level, the C6 power state has been designed to power down a complete module when it has been in idle for a predetermined time. I/O C-stages and shared logic, combined with TDP Capping and Dynamic Power Capping, help to drive down core power consumption. The enhanced memory controller now features C-states (power states) that allow it to fluctuate in power consumption just like the P-states for the processor clocks. These power features are wrapped up in a 32nm Silicon on Insulator (SOI) architecture that has been tuned for power efficiency. These features help modulate power in such a way that, even though the new design features 33% more cores, higher core frequencies, 20% higher memory clock speed and the inclusion of AMD Turbo CORE boosting technology, the processors still operate in the same TDP ranges as the AMD Opteron 4100 and 6100 Series Embedded Processors. This is particularly relevant for applications that need more performance, but which have strict Size, Weight and Power (SWaP) requirements. Platform compatibility and longevity As the AMD Opteron 6200 and 4200 Processors are socket compatible with the Opteron 6100 and 4100 Processors, existing applications only need a BIOS update. Migrating software and virtual machines should be easy, as platform level components, like chipsets and peripherals, are the same. Besides upgradability, AMD Embedded Solutions has addressed the need for longevity and selected Opteron 4200 and 6200 Series Processors will be available for seven years. Summary The AMD Opteron 6200 and 4200 Embedded Processors present a new way of addressing the need for high performing, highly scalable processors. The innovative modular architecture helps pack in more cores, making it the natural choice for embedded applications that demand more cores and more memory. As an added benefit, they're designed to save cost through efficient design and power efficiency – something that most products designed for the data centre don't focus on nearly enough. Author: Cameron Swen is a strategic marketing manager with AMD Embedded Solutions. The AMD Opteron processor in embedded The first AMD Opteron processor, introduced in 2003, featured an innovative design with an integrated memory controller, scalable core architecture and innovative power efficiency features. In 2005, it was the first 64bit processor to enter AMD Embedded Solutions' portfolio of long term available embedded processors. AMD Embedded Solutions has since improved on this design, providing devices with up to 12 cores. High end processors feature multiple dice and performance has scaled significantly higher – yet they have essentially remained in the same power and thermal envelope, with few socket and platform changes to help maintain the long term stability of the servers. Learn more about the AMD Opteron 6200 Series and AMD Opteron 4200 Series Embedded Processors in this animated video: Learn more about the AMD Turbo CORE technology in this animated video: