Loose threads

4 mins read

Timing can be tricky in embedded systems. Shortly after it touched down on the surface of Mars in 1997, NASA’s Pathfinder lander started to malfunction, triggering a watchdog timer that engineers tried to fix with system resets.

There was no actual bug in the lines of code that controlled the robot. Instead the bug was lurking in the way the different tasks interacted with each other.

To complete its work each cycle, a management task needed access to the system bus. But it was being locked out by a weather-monitoring thread that needed access to the same bus. Because it would often get pre-empted by a long-running communications task, the weather thread failed to finish and release the mutual-exclusion semaphores that gave it access to the bus. Without access to the semaphore and the bus, the management task missed its deadlines repeatedly: an example of priority inversion. A change to the way locks are managed in the kernel fixed the problem in perhaps the prime example of long-distance bug-fixing.

Now, two decades on, cloud-computer operators are struggling with odd holdups in their systems – a phenomenon known as tail latency. Most tasks complete without a hitch in a matter of seconds. But a few get held up for much longer with no readily apparent cause. The software was the same; the requests being made to the servers were not unusual. They just had to wait a very long time to be serviced.

Growing sophistication

As IoT and distributed cyber-physical systems become more sophisticated, embedded systems designers are increasingly likely to be faced with the problems caused by scheduling issues from very different systems. At one end are the interactions between tasks caused by threads holding on too long to exclusive resources in a single device. At the other is the unpredictable tail latency that afflicts applications that do not exist in a single node but which are spread across many nodes, both local and in the cloud. In the middle are the performance issues caused by interactions between threads, the operating system, hypervisors and the processors and memory on which they depend.

A big difference between the kind of processor used in Pathfinder and those now employed in robotic controllers and cloud computers is the use of multiprocessing and caching. Both can affect the performance of the synchronisation techniques needed in multitasking systems. Such synchronisation is vital because different task loads can let one thread easily race ahead of another if not checked. Resource locking and thread synchronisation prevent memory and I/O ports from handling corrupt data.

If a lot of threads need to read data from the same location, a mutex of the kind used in the Mars Pathfinder tends to hurt performance. So, a broad range of techniques have emerged that fall under the banner of lock-free synchronisation. In these the application is structured so that a lock is only needed to complete a write operation or takes advantage of atomic operations where the memory is not released until the write is completed.

However, interactions between applications and more advanced locking strategies can still lead to unexpected sudden delays for unlucky threads – showing up in logs as excessive tail latency. For example, to obtain a lock, a thread has to “spin” on it – testing its status until it is free. These memory accesses can slow down other threads that have no logical connection because of interactions between the memory and the various caches inside the system.

Very often in high-performance multicore SoCs, the locks will be cached and the cache-coherency mechanisms used to ensure processors that cannot see inside the cache are updated properly. If locks and regular data lie on the same cache line, the threads accessing that data can have their performance degraded through continual cache-coherency updates. Simply reorganising memory can cause this problem to disappear. That is one reason why tail latency and similar problems can be hard to track down.

Since the Pathfinder missions, a number of tools have appeared that are designed to help embedded-systems designers sift through the mountains of data needed to get a picture of what is happening inside a single system. As IoT and distributed cyber-physical systems take hold, similar but larger-scale tools are beginning to appear.

Green Hill Systems was one of the early converts to the idea of making it easier to analyse embedded-systems performance using different perspectives. Fifteen years, alongside a debugger that could work backwards along a timeline from a fault, the company introduced the PathAnalyzer in 2006. The tool shows the flow of execution through function calls, showing the path taken through application code, interrupt service routines and context switches.

Similar tools have been applied in the server world, focusing on block I/O, thread wakeup events or function-call time, presenting views known as “flame graphs”. Because they show time taken in each function stacked on top of the calling code, the graphs tend to resemble a burning fire. The widest flame bases are where the application is spending the bulk of its time.

The server environment tends to use open-source tools, some of which have been incorporated into development environments for embedded systems though suppliers such as Montavista. Other tools, such as Rapitime from Rapita Systems, focus on bulk properties such as worst-case execution time to determine how much variability an application suffers from over hours or days.

Overhead imposition

One issue with determining where scheduling and memory-access bottlenecks are affecting an application’s performance is that the techniques used to obtain the information themselves can impose an overhead. The most intrusive option is instrumenting the code directly: using a debug build to let function calls themselves update a database of entry and exit timestamps. Adding these calls increases runtime and can inadvertently hide problems because the cache-access patterns change.

Server-based developers tend to use sampling profilers that are less intrusive. These attach to common interrupt service routines and probe each processor’s call stack to work out where the application is. There is, naturally, a trade-off between accuracy and performance, making it easy to miss important but short-lived events that may have a knock-on effect on performance.

As embedded developers often have access to hardware trace ports and debug support, they can often avoid the overhead of instrumentation. Although some tools will insert labels into the compiled code to aid with the decoding of the raw trace dump streamed out of the processor, these changes have no impact on actual performance.

The instrumentation situation is being made more complicated by the use of accelerators in both embedded systems and server blades. These can interact with the caches and take over a lot of memory operations but they do not often have their own trace mechanisms. To help deal with the problem, SoC makers are putting performance monitors into their silicon to augment the software trace available from the main processor.

IP suppliers such as UltraSoc propose the use of SoC- or even system-wide debug and trace networks that tap into the various processors and accelerators. Visualisation tools would bring those pieces of data together into flame graphs and potentially more advanced graphical presentations. As distributed systems morph into systems of systems, such tools will need to evolve to track down the hidden defects that will lead to them suddenly failing to do their job for no apparent reason.