07 February 2011

Multicore processors: How can you programme them?

  • Multicore processors are all very well but how can they be prgrammed to provide the best performance?
  • multicore processor performance
  • multicore processor performance
  • multicore processor performance

Chipmakers want you to buy multicore processors that, on paper at least, deliver much greater performance than their single core predecessors. The question is: can you make use of that performance? Parallelisation is not all that easy to say and it's usually harder to do, but things are getting better.

A good number of applications are already 'embarrassingly parallel', as computer scientists put it. These are pieces of software where you can, almost blindly, spawn thread after thread. And, if each one has a new piece of hardware to run on, it provides greater performance.

At 1980s trade shows, Inmos and the band of companies which sold Transputer based systems would show arrays of Transputers racing through forms of the Mandelbrot set and other images made from fractal equations while pcs struggled to build the image line by line. Each pixel in the Mandelbrot image can be calculated without worrying about the value of even the nearest neighbour – the only parameters that matter are the pixel coordinates. So the more processors you have, the better performance gets, with the only practical limitation being access to the image framebuffer.

Another favourite 1980s Transputer demonstration – ray tracing – has finally found a renaissance among graphics processor unit (GPU) makers as a useful technique. The technique traces the path of light from a target to a source, making it possible to calculate each path without worrying about any others until the point comes to add all the lighting components together in the framebuffer.

Many communications applications are 'embarrassingly parallel' because multiple processors can process independent packets – just as long as there is something that can separate the packets quickly enough. Tensilica systems architect Grant Martin has described this as 'single application, multiple data' in a nod to the 'single instruction, multiple data' parallelised instructions now found in many digital signal and media processors.



Unfortunately, not everything is as straightforward. Making use of multicore, multithreaded processors can take a lot more work. Even the specification of an algorithm can work against you in subtle ways. Professor Wen-mei Hwu of the University of Illinois has used the example of a video decompression algorithm to illustrate the problem. It demonstrates a simple 50:50 choice that can remove a major opportunity for exploiting parallelism.

The Mpeg4 video decoder algorithm relies on a motion estimator in part that searches for a close match between 16 x 16pixel blocks within a frame. It could be a block from a previous frame or the previous block in the same frame: it doesn't matter, they just have to be a close match.

If the algorithm chooses preferentially to analyse blocks from the current frame, then parallelism is much more limited than if blocks from the previous frame are used. This is because the previous macroblock in the chain has to finish processing before the next one can start if the the current frame is used as a source. If you use the previous frame, you know that all the processing has completed on it. Both methods are just as effective in terms of compression, but the latter can run much more efficiently on a many core processor.

Even with a well designed algorithm, a lot of performance can be wasted when individual threads have to synchronise. Data needs to be shared in an organised way or, like asynchronous hardware, risk race conditions forming. For example, x may depend on y, so y needs to be updated before x can be assigned its new value. Any code that uses x before this happens will have the wrong data.

Software in a single thread is executed so that statements follow the order defined by the programmer. Even if the processor runs some of the code out of order, it will check for dependencies to ensure that writes are always made in the order specified in the code.



With two threads – one updating x and one updating y – anything can happen. And, without protection, it will at some point. You cannot rely on threads running with high priority blocking others performing an update out of order, because a low priority thread on a different processor could conceivably perform the update before the high priority thread gets a chance to do what it has to do. So threads have to synchronise, usually through mutual exclusion semaphores or 'spinlocks', named because a thread may have to try the lock several times, 'spinning' on it before it successfully acquires the lock.

A further blow to performance is the use of coarse grained locks in multithreaded code. It is very difficult to keep track of what is happening in code with a lot of spinlocks – it is still possible to get race conditions and deadlocks if threads attempt to acquire a number of locks in series and do so in different orders. So, the safest approach is to put all of the shared items behind one lock and take the hit on performance when multiple threads then contend over the same lock.

Not only do they waste time, thread synchronisation locks are not exactly straightforward to program in widely used languages such as C or C++, prompting companies such as Intel to offer thread libraries that aim to ease some of the burden.

Intel, among others, has looked at building hardware support into processors to help synchronise threads through a scheme called transactional memory. The idea behind transactional memory, originally developed at Digital Equipment and the University of Massachusetts at Amherst, is for hardware to monitor special areas of memory that contain known shared variables. The technique adds specialised load and store instructions to the processor that ensure a group of variables that needs to be updated in one step is written atomically. If two threads attempt to write values to the same group of memory locations, one will succeed. The other will fail and will have to restart the whole operation, using the newly written values from the shared variables. Although the technique makes some elements of programming concurrency easier, it can still be hard to conceptualise how threads interact.

One option is to try to stop forcing threads to synchronise midstream. A number of companies and researchers have claimed success for stream oriented programming, in which tasks suck data from pipes and transfer the results into another one, ready for the next piece of work to be done. The worker threads are generally 'busy waiting' – each thread is a worker set up to consume data. If there is nothing in the queue, they wait. Once data appears, they can get to work. This concept underpins some multicore architectures such as the picoArray developed by Picochip and accelerators made by Mercury Computer Systems.

A related technique intended to make this kind of stream processing operate across a network is the active message, first proposed by researchers at the University of California at Berkeley in the early 1990s, at roughly the same time as transactional memory. With active messages, tasks move around a network of processors with their data. The header contains the address of a handler task that will process the data stored in the message body. Even then, the concept was not entirely new. The Norsk Data computer, on which Tim Berners-Lee built an forerunner of the World Wide Web software at CERN in the early 1980s, employed a similar technique. But it was never widely used.



With active messages, the network can be viewed as a pipeline; its maximum rate determined by the communication overhead and the latency incurred by reading and writing the messages. Conceptually, the message is similar to a remote procedure call, but it's a non blocking operation so that communication and processing can proceed more or less independently.

The bottlenecks caused by synchronisation do not go away with stream oriented processing, but many of the operations needed to coordinate the results from many threads can be absorbed into a smaller set of processes. This may be more manageable than trying to distribute the synchronisation problem across a huge number of threads.

By delaying synchronisation it is possible to use speculation to squeeze more performance out of a processor array. This is the concept behind programming systems such as the Sieve C++ toolkit, developed by Codeplay. In many applications, threads will not usually attempt to overwrite each other's data – but doing so just one time out of a million can kill an application. So, they need to be protected from the consequences of doing so. If you work on that basis, you can fire off more work to worker threads before any checks need to be made. In Sieve C++, for example, parallelisable operations are put inside specially marked blocks of code.

Code inside the blocks can be divided and distributed among parallel processors. This will read from shared memory, but not write to it directly. At the end of the block, the software will attempt to commit its results to shared memory: this is where compiler generated code checks whether there are conflicts and, if there are, force some of the work to be redone. As long as not too many blocks have to be restarted, it is possible to see performance gains that cannot be attained using conventional techniques. It is analogous to the speculative execution of high end superscalar processors, which attempts to uncover hidden parallelism in serial code.

Although parallelisation is not straightforward, techniques have been developed that will ease the job and these are gradually making their way into compilers and hardware. However, it will take many more years before they can provide the insight a development team can have into how a single threaded application can be spread across many processors.

Author
Chris Edwards

Supporting Information

Downloads
31233\P30-32.pdf

This material is protected by Findlay Media copyright
See Terms and Conditions.
One-off usage is permitted but bulk copying is not.
For multiple copies contact the sales team.

Do you have any comments about this article?

Add your comments

Name
 
Email
 
Comments
 

Your comments/feedback may be edited prior to publishing. Not all entries will be published.
Please view our Terms and Conditions before leaving a comment.

Related Articles

‘Wonder Gecko’ mcu unveiled

Energy Micro has announced the availability of the EFM32 Wonder Gecko, which it ...

8bit PIC mcus get upgrade

A host of new analogue peripherals have been added to Microchip's PIC16F178X ...

element14 unveils Embedded Pi

element 14 has launched Embedded Pi, which it says is a low cost 'triple play' ...

Smart design saves power

Designing loop powered field instruments with a 4 to 20mA analogue output and a ...

Vendors make core decisions

When ARM launched the Cortex-M0 core in February 2009, it had a particular ...

Displaying energy saving

Despite their sophistication, microcontrollers are increasingly found in ...

Capturing data in mcu apps

Developers continue to benefit from increased silicon integration, enabling ...

Migrating ARM7 code to a Cortex-M3 mcu

This white paper by Todd Hixon from Atmel covers the differences between ARM7 ...

Batteries worldwide celebrate the arrival of ...

The explosion in use of battery operated electronics is followed by the need ...

Low cost mcus from TI

Texas Instruments has expanded its MSP430 series of low cost microcontrollers ...

Microchip expands PIC24 series

Adding to its range of 16bit PIC microcontrollers, Microchip has introduced the ...

8bit usb mcus

Microchip has expanded its usb 2.0 PIC microcontroller portfolio with three new ...

Hands on with the ARM mbed

Monday 17th June 2013, ARM headquarters, Cambridge, UK

Stellaris LaunchPad overview

The Tiva C Series TM4C123GXL LaunchPad Evaluation kit is a low cost evaluation ...

InstaSPIN-FOC: Getting started

Overview of the InstaSPIN-FOC enabled motor kit contents and how to get started.

GUI Overview for InstaSPIN-FOC

Overview and demonstration of the InstaSPIN-FOC graphical user interface, that ...

Thinner, but higher profile

Freescale has launched the latest – and smallest – member of its Kinetis range ...

Renesas to build ecosystem

The news that Renesas is to outsource 40nm mcu manufacture to TSMC is not ...

Qualcomm buys Ubicom

Qualcomm has, apparently, acquired Ubicom with hardly a fanfare. It's the end ...

Gregg Lowe, Freescale

Freescale's new ceo tells Graham Pitcher that, while he's not 'dancing' yet, ...

Aurelius Wosylus, AMD

Chris Shaw discusses AMD's latest low power processors with Aurelius Wosylus.

Ian Menzies, General Dynamics

Graham Pitcher finds out how a new network will give Welsh electronics ...