Creating the software that drives embedded system design

8 mins read

The term 'embedded system' dates back to a time when most things had code buried deep inside. The user interface might be nothing more than a segmented led and nobody expected the code inside to be updated, short of a complete board replacement. Now, even a cooker might have some form of touch based graphical user interface, even if it is not quite up to the job of running downloadable apps. And, as more systems acquire network connections, customers expect to be able to update the products they use.

The problem is keeping track of the installed systems and providing code updates that are compatible across a family of systems, especially if the design has been updated over time to keep ahead of component obsolescence. In these situations, it can begin to make sense to look at development languages and tools that produce code independent of the target. Sacrificing performance for greater target independence may seem anathema in embedded systems development, but the concept is growing – even in performance sensitive environments. Take the OpenCL environment, developed as a way of harnessing the enormous compute power of the modern graphical processing unit (gpu). During the past decade, shader based gpus have become the standard method for rendering 3d graphics, thanks to changes in operating system architecture made by Microsoft, among others. Early 3d graphics pipelines were more or less hard coded and followed a rigid sequence of steps that transformed a set of triangle coordinates sent by the host processor into shaded and lit rectangles. Whatever lighting model the 3d accelerator offered was the lighting model you got. Changes that Microsoft made with the release of DirectX 8 helped put 3d graphics on a new path and ushered in the gpu: a programmable parallel processor that, while optimised for graphics, can be pressed into service for other work. For example, the Nebula 3 audio software written by Acustica Audio uses Volterra kernels to simulate the behaviour of pro audio mixing consoles and equalisers and can run much of its code on a gpu to offload the floating point intensive workload from the host x86 processor. Aberdeen based ffA uses arrays of gpus to analyse oil well data, rather than conventional supercomputers. To encourage users to develop applications that could run on its gpus, nVidia developed the CUDA framework, a combination of compiler and runtime tools. Although CUDA support has been extended to the x86, it is unlikely to run on other makes of gpu, such as those sold by AMD. To provide a more portable gpu computing environment, Apple developed OpenCL, which it passed onto the Khronos Group, which already handles standardisation of the OpenGL language for 2d and 3d graphics processing. In just six months, Khronos prepared version 1.0 of the OpenCL standard. Naturally, Apple's operating systems support OpenCL, but a number of gpu manufacturers, such as Imagination Technologies, have embraced it. The reason is portability. Instead of generating machine code, OpenCL development tools produce an intermediate format that is translated into actual gpu instructions at runtime. And the code does not have to run on a gpu. If one is not present, the runtime environment will generate code that can run on the host processor or, potentially, another form of accelerator that can run core OpenCL operations. OpenCL functions (see fig 1) take the form of compute kernels – tight, often vectorised, loops that can be used to process multiple data elements in parallel. The host's OpenCL runtime environment takes care of loading the kernels and data into the target processors, triggering execution and retrieving the results. Because gpus are generally designed around processors with a small amount of fast local memory and limited access to global memory, it generally makes sense to load the kernels in this way. OpenCL uses the concept of command queues to load and process kernels generated by the host program in sequence, working out how many threads – which will be matched to the number of processing elements – to generate on the fly. For situations where dynamic compilation incurs too much an overhead, it is possible to generate binaries in advance and to call compatible ones at runtime, although this will increase overall code size if the binary needs to run on a variety of targets. OpenCL is one example of how code that only gets converted to binary just in time (JIT) is moving into the embedded space. Even though the core components of an embedded system almost certainly rely on statically compiled code, an increasing number of elements use interpreted code to get part of the job done. For example, the chances are that a smartphone with a web browser will happily run Javascript, an interpreted language designed for modifying the look and behaviour of web pages dynamically. Although Forth was arguably the first interpreted language suitable for use in resource constrained systems, Java was the first attempt by large software vendors to bring a portable, interpreted language into mainstream embedded systems, despite the runtime environment having some severe drawbacks for real time systems. At one level, Java was tailor made for resource constrained systems. When senior Sun Microsystems engineer James Gosling first conceived of Java's basic structure, one of the main design requirements was that it should be able to run on 32bit microprocessors with just a few registers, such as the x86. That led to the use of a stack architecture in its virtual machine – you only need two or three registers to implement a stack machine, well within the capabilities of most commercial 32bit architectures. Unfortunately, at the time Java's architecture imposed a limitation on speed, not just because each sequence of Java bytecodes needed to be translated at runtime into actual machine code. One problem with a stack based architecture is that it makes it harder for the runtime environment on a conventional computer to detect opportunities for instruction parallelism. As everything has to pass through the top of the stack, instead of being directed to different registers, a superscalar machine has to look more deeply into the instruction pipeline to tease out independent operations. When Google decided to build the Dalvik virtual machine for the Android mobile operating system, it scrapped the stack architecture and decided instead to use a register based virtual architecture on the basis that this should improve code density and execution speed. Although programmers writing for Android will, for the most part, use Java language constructs and classes, the resulting source code is converted into Dalvik's .dex format, which contains the instructions for Google's register based machine. However, research is continuing into superscalar stack machines and the widespread adoption of data caches helps minimise the problem of data spilling out of the stack into slow main memory. Java's speed is not helped by its conservative approach to data management. An analysis of the architecture of the standard Java Virtual Machine (JVM) uncovers a host of runtime bounds checks that need to be performed before any data can be processed. For real time systems, there is the additional hurdle of garbage collection; the way in which the JVM recovers unused memory from objects that have finished processing. One advantage to the programmer of using Java is that, unlike languages such as C or C++, it does not suffer from memory leaks: the situation where a programmer allocates memory from the shared heap, but writes the code in such a way that the program never frees it until the program itself terminates and the operating system steps in to free all of the memory blocks. The problem for real time systems is that, once started, it is hard to interrupt the garbage collection process. It is possible to write one that can suspend its work while another thread runs, but the increased overhead means this is rarely worthwhile outside the domain of systems that require all threads to be pre emptible to meet their deadlines. A number of companies clubbed together to develop a real time specification for Java so the language could be used in military and other embedded systems. The main change is to set aside part of the system that cannot be pre empted by the Java garbage collector. Conventional Java memory management techniques cannot be used here; instead, threads use either 'immortal' memory – objects that are not destroyed until the program ends – or scoped memory, in which any objects created within the scope of a method or function call are automatically deleted once program execution falls out of scope. To avoid memory leaks, programmers need to make sure the program does indeed exit the scope before the program itself terminates. The latest version of the Java real time system allows threads out of the sandbox that is meant to protect programs from each other. This makes it possible to write device drivers in Java, rather than a fully compiled language such as C. To speed code execution, many Java environments use JIT compilation, rather than direct interpretation of each bytecode. Because it translates entire functions into machine code, JIT compilation normally results in faster code. However, because the compilation step incurs an overhead, implementations such as Oracle's HotSpot (see fig 2) will execute code using an interpreter until an invocation counter for the method or function call hits a threshold value. To avoid blocking a thread that suddenly receives the compiler treatment, the JVM can perform the translation as a background task while the thread continues to be interpreted until the compiled code is ready. This avoids forcing a thread to miss its deadline if compilation takes longer than simply running the interpreter loop. Where the developer knows that critical sections of the code will call for execution as compiled code, it is possible to mark them in the real time system as initialisation time methods – they are compiled when the classes that contain these methods are initialised. These classes need to be set up during the boot phase so as not to slow the system when it is fully operational. Initially, Android implementations only interpreted Dalvik instructions, but later versions have included a JIT compiler to improve execution speed. Not long after Sun released its Java implementation as a product, Microsoft decided to develop an approach to retargetable applications that would not be limited to one language. The company had a long history of working with bytecode based interpreted languages, based on the same pseudocode (p-code) system developed at the University of California at San Diego from which Gosling took inspiration for the bytecode language of the JVM. Microsoft's work resulted in the Common Language Runtime (CLR) that underpins the .Net software framework (see fig 3). The CLR is, effectively, the software giant's answer to the JVM, whereas .Net encompasses similar functions to those found in the large class libraries of the various Java editions, such as Enterprise Edition and Micro Edition. The CLR takes care of running programs that have been compiled into the company's successor to p-code, the Common Intermediate Language (CIL). Like Java and its predecessor, the CIL architecture is stack based with a number of extensions, similar to those employed by Java, to make object oriented code easier to run. For example, metadata describes the interfaces and methods used by each compiled class. Like Java, the CLR needs to run garbage collection to clean up memory and there is no real time version of this. Applications that need deterministic behaviour must only generate objects and allocate the memory associated with them during initialisation and not during runtime – reflecting the way in which immortal memory is used in real time Java implementations. In practice, portability is more restricted than with Java applications, as the official versions of the CLR run only on Microsoft operating systems. While implementations exist for other operating systems, such as Mono and Portable.Net, the last stable release for the latter was published several years ago. The main advantage of the CLR is its flexibility in terms of development languages, which started with C#, an object oriented form of C with some of its more error prone features removed and with additions for functional programming, such as lambda expressions. Microsoft has gone further with functional programming for .Net with the F# language. Other languages that use CLR to run include versions of Python and Ruby and P#, a version of Prolog. The runtime performance of all of these managed virtual machine environments is still an issue. However, as with the shift from assembly language to compiled code, their backers believe improvements in runtime compilation technology and the willingness to incorporate new hardware features in processors as they become available, will gradually allow JIT compiled applications to improve on the performance of programs compiled ahead of time. These are currently less able to make use of specific hardware accelerators that may be present in a given target machine. By cutting down on the amount of porting time needed, managed code applications should also result in lower development and maintenance burdens. However, for real time systems this has to be traded off against the increased complexity of dealing with features such as garbage collection. But, as the use of interpreted languages written for online use, such as Javascript, increases many more embedded systems are likely to run not just one virtual machine, but a number of them at once.