multicore processor performance

A good number of applications are already 'embarrassingly parallel', as computer scientists put it. These are pieces of software where you can, almost blindly, spawn thread after thread. And, if each one has a new piece of hardware to run on, it provides greater performance. At 1980s trade shows, Inmos and the band of companies which sold Transputer based systems would show arrays of Transputers racing through forms of the Mandelbrot set and other images made from fractal equations while pcs struggled to build the image line by line. Each pixel in the Mandelbrot image can be calculated without worrying about the value of even the nearest neighbour – the only parameters that matter are the pixel coordinates. So the more processors you have, the better performance gets, with the only practical limitation being access to the image framebuffer. Another favourite 1980s Transputer demonstration – ray tracing – has finally found a renaissance among graphics processor unit (GPU) makers as a useful technique. The technique traces the path of light from a target to a source, making it possible to calculate each path without worrying about any others until the point comes to add all the lighting components together in the framebuffer. Many communications applications are 'embarrassingly parallel' because multiple processors can process independent packets – just as long as there is something that can separate the packets quickly enough. Tensilica systems architect Grant Martin has described this as 'single application, multiple data' in a nod to the 'single instruction, multiple data' parallelised instructions now found in many digital signal and media processors.

Unfortunately, not everything is as straightforward. Making use of multicore, multithreaded processors can take a lot more work. Even the specification of an algorithm can work against you in subtle ways. Professor Wen-mei Hwu of the University of Illinois has used the example of a video decompression algorithm to illustrate the problem. It demonstrates a simple 50:50 choice that can remove a major opportunity for exploiting parallelism. The Mpeg4 video decoder algorithm relies on a motion estimator in part that searches for a close match between 16 x 16pixel blocks within a frame. It could be a block from a previous frame or the previous block in the same frame: it doesn't matter, they just have to be a close match. If the algorithm chooses preferentially to analyse blocks from the current frame, then parallelism is much more limited than if blocks from the previous frame are used. This is because the previous macroblock in the chain has to finish processing before the next one can start if the the current frame is used as a source. If you use the previous frame, you know that all the processing has completed on it. Both methods are just as effective in terms of compression, but the latter can run much more efficiently on a many core processor. Even with a well designed algorithm, a lot of performance can be wasted when individual threads have to synchronise. Data needs to be shared in an organised way or, like asynchronous hardware, risk race conditions forming. For example, x may depend on y, so y needs to be updated before x can be assigned its new value. Any code that uses x before this happens will have the wrong data. Software in a single thread is executed so that statements follow the order defined by the programmer. Even if the processor runs some of the code out of order, it will check for dependencies to ensure that writes are always made in the order specified in the code.

With two threads – one updating x and one updating y – anything can happen. And, without protection, it will at some point. You cannot rely on threads running with high priority blocking others performing an update out of order, because a low priority thread on a different processor could conceivably perform the update before the high priority thread gets a chance to do what it has to do. So threads have to synchronise, usually through mutual exclusion semaphores or 'spinlocks', named because a thread may have to try the lock several times, 'spinning' on it before it successfully acquires the lock. A further blow to performance is the use of coarse grained locks in multithreaded code. It is very difficult to keep track of what is happening in code with a lot of spinlocks – it is still possible to get race conditions and deadlocks if threads attempt to acquire a number of locks in series and do so in different orders. So, the safest approach is to put all of the shared items behind one lock and take the hit on performance when multiple threads then contend over the same lock. Not only do they waste time, thread synchronisation locks are not exactly straightforward to program in widely used languages such as C or C++, prompting companies such as Intel to offer thread libraries that aim to ease some of the burden. Intel, among others, has looked at building hardware support into processors to help synchronise threads through a scheme called transactional memory. The idea behind transactional memory, originally developed at Digital Equipment and the University of Massachusetts at Amherst, is for hardware to monitor special areas of memory that contain known shared variables. The technique adds specialised load and store instructions to the processor that ensure a group of variables that needs to be updated in one step is written atomically. If two threads attempt to write values to the same group of memory locations, one will succeed. The other will fail and will have to restart the whole operation, using the newly written values from the shared variables. Although the technique makes some elements of programming concurrency easier, it can still be hard to conceptualise how threads interact. One option is to try to stop forcing threads to synchronise midstream. A number of companies and researchers have claimed success for stream oriented programming, in which tasks suck data from pipes and transfer the results into another one, ready for the next piece of work to be done. The worker threads are generally 'busy waiting' – each thread is a worker set up to consume data. If there is nothing in the queue, they wait. Once data appears, they can get to work. This concept underpins some multicore architectures such as the picoArray developed by Picochip and accelerators made by Mercury Computer Systems. A related technique intended to make this kind of stream processing operate across a network is the active message, first proposed by researchers at the University of California at Berkeley in the early 1990s, at roughly the same time as transactional memory. With active messages, tasks move around a network of processors with their data. The header contains the address of a handler task that will process the data stored in the message body. Even then, the concept was not entirely new. The Norsk Data computer, on which Tim Berners-Lee built an forerunner of the World Wide Web software at CERN in the early 1980s, employed a similar technique. But it was never widely used.

With active messages, the network can be viewed as a pipeline; its maximum rate determined by the communication overhead and the latency incurred by reading and writing the messages. Conceptually, the message is similar to a remote procedure call, but it's a non blocking operation so that communication and processing can proceed more or less independently. The bottlenecks caused by synchronisation do not go away with stream oriented processing, but many of the operations needed to coordinate the results from many threads can be absorbed into a smaller set of processes. This may be more manageable than trying to distribute the synchronisation problem across a huge number of threads. By delaying synchronisation it is possible to use speculation to squeeze more performance out of a processor array. This is the concept behind programming systems such as the Sieve C++ toolkit, developed by Codeplay. In many applications, threads will not usually attempt to overwrite each other's data – but doing so just one time out of a million can kill an application. So, they need to be protected from the consequences of doing so. If you work on that basis, you can fire off more work to worker threads before any checks need to be made. In Sieve C++, for example, parallelisable operations are put inside specially marked blocks of code. Code inside the blocks can be divided and distributed among parallel processors. This will read from shared memory, but not write to it directly. At the end of the block, the software will attempt to commit its results to shared memory: this is where compiler generated code checks whether there are conflicts and, if there are, force some of the work to be redone. As long as not too many blocks have to be restarted, it is possible to see performance gains that cannot be attained using conventional techniques. It is analogous to the speculative execution of high end superscalar processors, which attempts to uncover hidden parallelism in serial code. Although parallelisation is not straightforward, techniques have been developed that will ease the job and these are gradually making their way into compilers and hardware. However, it will take many more years before they can provide the insight a development team can have into how a single threaded application can be spread across many processors.

Engineering publications brought to you by Mark Allen

CPUs

Multicore processors: How can you programme them?

Comments

Related Articles

Engineering publications brought to you by Mark Allen

CPUs

Multicore processors: How can you programme them?

Comments

Related Articles

Multicore processors: Some of the technical issues

Concurrency checkers can improve multicore process performance

OUTLOOK 2017: The yin and yang of multicore

Multicore comms processors offer ‘unparalleled scalability’

C secure for some time

MIPS64 architecture powers Cavium’s new multicore processors