Efficient GPGPU programming with OpenCL

4 mins read

Today's graphics processors are highly programmable, massively parallel compute engines. In this role, they are commonly called general purpose graphics processing units, or GPGPUs. You can program them with the open and standard based OpenCL framework, distributing compute chores to CPUs, GPUs, and DSPs to optimise a system's overall performance.

What makes GPGPU computing so enticing is the availability of extreme floating point performance in cost effective GPUs. AMD's top of the line GPU in the AMD Radeon™ HD 6970 desktop graphics card delivers 2.7 single precision TFLOPs (theoretical peak) at a retail cost of about $369. By leveraging economies of scale in the PC industry, GPUs continue to drive higher performance per watt and dollar every year. To benefit from these performance gains, embedded systems need to meet three more challenges: lower power consumption, open standards, and parallel algorithms. Low energy consumption In addition to excellent GFLOPs per dollar, GPGPU computing also delivers high performance per watt. Although the AMD Radeon HD 6970 card has a thermal design power (TDP) of 250W, many server based applications can handle that power to achieve more than 10GFLOPs/W. On the other hand, embedded applications typically have more modest thresholds for TDP. Embedded applications are faced with size, weight and power (SWaP) constraints. Portable ultrasound machines benefit from small size, yet demand high performance compute capabilities for real time imaging. GPGPU offers new compute capabilities within limited power budgets for telecom infrastructure. Many defence and aerospace applications (sonar, radar and video surveillance, for example) require high performance compute capabilities delivered in embedded form factors. To meet the growing demand for embedded GPGPU, the AMD Radeon E6760 embedded GPU delivers 16.5GFLOPs/W at a TDP of about 35W. At such a modest power consumption, the AMD Radeon E6760 GPU is suitable for all common embedded systems based on slot based rack mounts, such as PICMG 1.x, CompactPCI, VME, VPX, MicroTCA or AdvancedTCA. Efficient algorithms Achieving the performance gains available to embedded GPGPU will require the deployment of key algorithms as optimised kernels for specific GPU architectures. Key algorithms include solving systems of linear equations, stencil operations, matrix multiply, fast Fourier transforms, random number generation, data primitives (e.g., reduce, sum, partition or sort), and domain specific algorithms in image processing, such as edge detection, erosion and Sobel operations. The final challenge is parallel algorithm development. Some algorithms are 'embarrassingly' parallel and show almost an order of magnitude performance improvement over multicore CPU implementations. Other algorithms require a bit more effort to port to a massively parallel environment, but the results can be well worth it. Researchers have shown that, with careful algorithm design, execution of the single precision general matrix multiply routine (SGEMM) on an ATI Radeon HD 5870 GPU can achieve 73% of the GPU's theoretical single precision floating-point performance. With its high compute density and parallel nature, SGEMM is well suited for implementation on a GPGPU, especially for large matrices. Open standard The fact that there is potential for parallel programming has long been known. But early pioneers in GPGPU computing had no language that could access that compute power. Instead, they selected graphics operations in OpenGL that used the same math they needed and then copied the results from the frame buffer. Proprietary GPGPU languages, such as CUDA and Brook+, were a big improvement to ease of use at the expense of portability. That situation led to development of the OpenCL computing language, which ports across GPU architectures and between CPU and GPU components to provide a heterogeneous programming environment. It was created by an industry consortium, bringing together chip vendors, software companies and research organisations. First introduced in 2008, OpenCL enjoys wide industry support today. With continued industry adoption of the open standard, non proprietary OpenCL parallel programming language, heterogeneous computing is leveraging efficiency gains by enabling application developers to effectively partition serial tasks for execution on the CPU and parallel tasks on the GPU. As OpenCL has matured, it has become the API of choice for code that is portable across different hardware and also different operating systems. OpenCL In particular, the OpenCL language delivers more efficiency in the development of heterogeneous computing applications: it enables the application developer to effectively split serial tasks for execution on a CPU, putting parallel tasks onto a GPU, for instance. GPUs are extremely adept at parallel processing, especially at doing similar computations on large quantities of data (i.e., data parallel). Let's take a very simple example to illustrate the point — a simple element wise addition of two arrays, a and b, with the result written to another array, c. Instead of adding one pair of elements at a time, as happens in CPU code, we can use OpenCL to execute many additions in parallel on the GPU. The following table is a typical code snippet for performing the addition on a single core CPU, which looks very similar to the OpenCL kernel that will do the same thing on a GPU. The operation for each item i is called a work item. Conceptually, all work items in a kernel are executed in parallel — we cannot tell if work item i = x is physically executed before, at the same time or after another work item, i = y. (However, we do know that on GPUs, hundreds and even thousands of work items can be in the midst of execution at any time.) OpenCL provides a way of grouping batches of work items into work groups. In general, work items cannot synchronise or exchange data with each other, but work items belonging to the same work group can. This allows OpenCL kernels to be written that are more sophisticated than this example. For the sake of completeness, it should perhaps be mentioned that in addition to the code executed on the GPU, it is also necessary to write a host program to control and use the GPU. This host program will find and initialise the GPU(s), send data and the kernel code to the GPU, instruct the GPU to start execution and know when the results are ready and then read them back from the GPU. A number of companies offer in depth OpenCL workshops and training. Developers at both the entry and advanced level can learn about OpenCL programming on AMD platforms directly from AMD and its partners. Understanding heterogeneous computing and OpenCL is the key to widespread adoption of GPGPU technologies in the embedded market. Many algorithms map well to GPGPU architectures and show sizeable performance increases in comparison to traditional multicore CPU solutions. GPGPU delivers real benefits to the embedded market — compelling GFLOPs per watt plus attractive GFLOPs per unit costto enable new capabilities for size, weight and in power-constrained embedded applications. Conclusion The use of OpenCL promises many application benefits. Many existing algorithms are already a superb match for GPGPU architectures, producing a substantial gain in performance compared to traditional multicore CPU implementations.. But leaps in performance alone do not justify the extra development effort and expense; the price must be right too. GPGPU technology offers attractive GFLOPs performance per watt and dollar, showing the way to new embedded applications that are sensitive when it comes to size, to weight, and to power consumption. OpenCL also makes a major contribution to enhanced performance and system efficiency, and ultimately a more competitive product. In OpenCL, engineers have an efficient, standard-based, parallel development environment enabling them to turn these advantages into powerful applications. About the authors Peter Mandl is Senior Product Marketing Manager, Embedded Client Business, Advanced Micro Devices. Udeepta Bordoloi is Senior Member of the Technical Staff at Advanced Micro Devices.