24 May 2012

Efficient GPGPU programming with OpenCL

Today's graphics processors are highly programmable, massively parallel compute engines. In this role, they are commonly called general purpose graphics processing units, or GPGPUs. You can program them with the open and standard based OpenCL framework, distributing compute chores to CPUs, GPUs, and DSPs to optimise a system's overall performance.

What makes GPGPU computing so enticing is the availability of extreme floating point performance in cost effective GPUs. AMD's top of the line GPU in the AMD Radeon™ HD 6970 desktop graphics card delivers 2.7 single precision TFLOPs (theoretical peak) at a retail cost of about $369. By leveraging economies of scale in the PC industry, GPUs continue to drive higher performance per watt and dollar every year. To benefit from these performance gains, embedded systems need to meet three more challenges: lower power consumption, open standards, and parallel algorithms.

Low energy consumption
In addition to excellent GFLOPs per dollar, GPGPU computing also delivers high performance per watt. Although the AMD Radeon HD 6970 card has a thermal design power (TDP) of 250W, many server based applications can handle that power to achieve more than 10GFLOPs/W. On the other hand, embedded applications typically have more modest thresholds for TDP. Embedded applications are faced with size, weight and power (SWaP) constraints. Portable ultrasound machines benefit from small size, yet demand high performance compute capabilities for real time imaging. GPGPU offers new compute capabilities within limited power budgets for telecom infrastructure. Many defence and aerospace applications (sonar, radar and video surveillance, for example) require high performance compute capabilities delivered in embedded form factors. To meet the growing demand for embedded GPGPU, the AMD Radeon E6760 embedded GPU delivers 16.5GFLOPs/W at a TDP of about 35W. At such a modest power consumption, the AMD Radeon E6760 GPU is suitable for all common embedded systems based on slot based rack mounts, such as PICMG 1.x, CompactPCI, VME, VPX, MicroTCA or AdvancedTCA.

Efficient algorithms
Achieving the performance gains available to embedded GPGPU will require the deployment of key algorithms as optimised kernels for specific GPU architectures. Key algorithms include solving systems of linear equations, stencil operations, matrix multiply, fast Fourier transforms, random number generation, data primitives (e.g., reduce, sum, partition or sort), and domain specific algorithms in image processing, such as edge detection, erosion and Sobel operations.
The final challenge is parallel algorithm development. Some algorithms are 'embarrassingly' parallel and show almost an order of magnitude performance improvement over multicore CPU implementations. Other algorithms require a bit more effort to port to a massively parallel environment, but the results can be well worth it. Researchers have shown that, with careful algorithm design, execution of the single precision general matrix multiply routine (SGEMM) on an ATI Radeon HD 5870 GPU can achieve 73% of the GPU's theoretical single precision floating-point performance. With its high compute density and parallel nature, SGEMM is well suited for implementation on a GPGPU, especially for large matrices.

Open standard
The fact that there is potential for parallel programming has long been known. But early pioneers in GPGPU computing had no language that could access that compute power. Instead, they selected graphics operations in OpenGL that used the same math they needed and then copied the results from the frame buffer. Proprietary GPGPU languages, such as CUDA and Brook+, were a big improvement to ease of use at the expense of portability. That situation led to development of the OpenCL computing language, which ports across GPU architectures and between CPU and GPU components to provide a heterogeneous programming environment.
It was created by an industry consortium, bringing together chip vendors, software companies and research organisations. First introduced in 2008, OpenCL enjoys wide industry support today. With continued industry adoption of the open standard, non proprietary OpenCL parallel programming language, heterogeneous computing is leveraging efficiency gains by enabling application developers to effectively partition serial tasks for execution on the CPU and parallel tasks on the GPU. As OpenCL has matured, it has become the API of choice for code that is portable across different hardware and also different operating systems.

In particular, the OpenCL language delivers more efficiency in the development of heterogeneous computing applications: it enables the application developer to effectively split serial tasks for execution on a CPU, putting parallel tasks onto a GPU, for instance. GPUs are extremely adept at parallel processing, especially at doing similar computations on large quantities of data (i.e., data parallel). Let's take a very simple example to illustrate the point — a simple element wise addition of two arrays, a and b, with the result written to another array, c. Instead of adding one pair of elements at a time, as happens in CPU code, we can use OpenCL to execute many additions in parallel on the GPU. The following table is a typical code snippet for performing the addition on a single core CPU, which looks very similar to the OpenCL kernel that will do the same thing on a GPU.

The operation for each item i is called a work item. Conceptually, all work items in a kernel are executed in parallel — we cannot tell if work item i = x is physically executed before, at the same time or after another work item, i = y. (However, we do know that on GPUs, hundreds and even thousands of work items can be in the midst of execution at any time.)
OpenCL provides a way of grouping batches of work items into work groups. In general, work items cannot synchronise or exchange data with each other, but work items belonging to the same work group can. This allows OpenCL kernels to be written that are more sophisticated than this example. For the sake of completeness, it should perhaps be mentioned that in addition to the code executed on the GPU, it is also necessary to write a host program to control and use the GPU. This host program will find and initialise the GPU(s), send data and the kernel code to the GPU, instruct the GPU to start execution and know when the results are ready and then read them back from the GPU. A number of companies offer in depth OpenCL workshops and training. Developers at both the entry and advanced level can learn about OpenCL programming on AMD platforms directly from AMD and its partners.
Understanding heterogeneous computing and OpenCL is the key to widespread adoption of GPGPU technologies in the embedded market. Many algorithms map well to GPGPU architectures and show sizeable performance increases in comparison to traditional multicore CPU solutions. GPGPU delivers real benefits to the embedded market — compelling GFLOPs per watt plus attractive GFLOPs per unit costto enable new capabilities for size, weight and in power-constrained embedded applications.

The use of OpenCL promises many application benefits. Many existing algorithms are already a superb match for GPGPU architectures, producing a substantial gain in performance compared to traditional multicore CPU implementations.. But leaps in performance alone do not justify the extra development effort and expense; the price must be right too. GPGPU technology offers attractive GFLOPs performance per watt and dollar, showing the way to new embedded applications that are sensitive when it comes to size, to weight, and to power consumption. OpenCL also makes a major contribution to enhanced performance and system efficiency, and ultimately a more competitive product. In OpenCL, engineers have an efficient, standard-based, parallel development environment enabling them to turn these advantages into powerful applications.

About the authors
Peter Mandl is Senior Product Marketing Manager, Embedded Client Business, Advanced Micro Devices. Udeepta Bordoloi is Senior Member of the Technical Staff at Advanced Micro Devices.

Peter Mandl and Udeepta Bordoloi

Supporting Information


AMD (UK) Ltd

This material is protected by Findlay Media copyright
See Terms and Conditions.
One-off usage is permitted but bulk copying is not.
For multiple copies contact the sales team.

Do you have any comments about this article?

Add your comments


Your comments/feedback may be edited prior to publishing. Not all entries will be published.
Please view our Terms and Conditions before leaving a comment.

Related Articles

Hardened DSP blocks for FPGAs

Responding to the increasingly demanding task of designing floating point DSP ...

Four IDEs for Kinetis support

Freescale has named four tools as featured integrated development environments ...

GrammaTech unveils CodeSonar 4

The latest version of GrammaTech's flagship software tool for C/C++, Java and ...

Software tools: Cost vs. value

Depending upon who you talk to and the scale of the design, software ...

Safe start to software test

Embedded software testing is a discipline that is both easier and more ...

Cutting context switching

Real time embedded systems typically use a collection of application tasks or ...

Using Linux in medical devices

This whitepaper explores the issues that software developers and medical device ...

High speed board design

Istvan Nagy, electronics design engineer at Blue Chip Technology, a leading UK ...

Software development paper

The white paper illustrates, by way of a practical example, how a modular ...

OpenCL development kit

Altera has announced a new software development kit for OpenCL, a move which ...

Embedded World: Mitsubishi unv

Mitsubishi Electric has announced new 8.4in SVGA (AA084SC01), 10.4in SVGA ...

Embedded World: Mitsubishi

Mitsubishi Electric is launching new 8.4in VGA and 10.4in VGA colour TFT lcd ...

Device Developers' Conference

20th May 2014, Holiday Inn, Bristol

Device Developers' Conference

22nd May 2014, Menzies Hotel, Cambridge

Device Developers' Conference

3rd June 2014, Cheadle House, Manchester

Altium design secret 18

Dimensions are objects you can place within the pcb design to show distances ...

Altium design secret 17

In this design secret, senior applications engineer Colby Siemer shows a novel ...

Altium design secret four

This video shows how a powerful PCB Library Editor and the ability to add mask ...

FPGA market developments

The programmable logic market is notable not only for its products, but also ...

Software developers' lives

A friend once asked me what software developers do when they're not creating ...

Design reuse

It's become a cliché in news or science reports. A water treatment plant ...

Cyrille Comar, AdaCore Europe

Cyrille Comar, co founder and managing director of AdaCore Europe, speaks to ...

Aurelius Wosylus, AMD

Chris Shaw discusses AMD's latest low power processors with Aurelius Wosylus.

Rick Clemmer, ceo, NXP

Rick Clemmer believes high performance mixed signal is just one of the sectors ...