10 July 2012

ARM's big.LITTLE systems provide more processing power for less energy

Traditionally, it has not been possible to design a processor that offers high performance and high energy efficiency. Solutions have typically involved integrating processors with microarchitectures optimised for performance and energy efficiency respectively. An example is a high performance application processor coupled with a low power asic.

This concept, heterogeneous multiprocessing, involves putting a number of specialised cores together – such as, say an application processor and a baseband radio processor – and running tasks suited to each specialised core as needed.

ARM's solution is to couple a high performance processor with a highly energy efficient counterpart while retaining full instruction set architecture compatibility. This coupled design, known as big.LITTLE processing, provides new options for matching compute capacity to workloads by enabling optimal distribution of the load between the 'big' and 'LITTLE' cores.

The current generation of big.LITTLE design pairs a Cortex-A15 processor cluster with an energy efficient Cortex-A7 processor cluster. These processors are 100% architecturally compatible and have the same functionality – support for LPAE, virtualisation extensions and functional units such as NEON and VFP – allowing software applications compiled for one processor type to run on the other without modification.

Both processor clusters are fully cache coherent, enabled by ARM's CoreLink Cache Coherent Interconnect (CCI-400, see fig 1). This also enables I/O coherency with other components, such as a Mali-T604 gpu. CPUs in both clusters can signal each other via a shared interrupt controller, such as the CoreLink GIC-400.

Since the same application can run on a Cortex-A7 or a Cortex-A15 without modification, this opens up the possibility of mapping applications to the right processor on an opportunistic basis. This is the basis for a number of execution models, namely:

* big.LITTLE migration
* big.LITTLE multiprocessing

Migration models can be further divided into types:
* Cluster migration
* CPU migration

Migration models enable the capture and restoration of software context from one processor type to another. The software stack only runs on one cluster in the case of cluster migration. In cpu migration, each cpu in a cluster is paired with its counterpart in the other cluster and the software context is migrated opportunistically between clusters on a per cpu basis.

The multiprocessing model enables the software stack to be distributed across processors in both clusters and all cpus may be in operation simultaneously.

Migration models
Migration models are a natural extension to power performance management techniques such as dynamic voltage and frequency scaling (dvfs). A migration action is akin to a dvfs operating point transition. Operating points on a processor's dvfs curve will be traversed in response to load variations. When the current processor (or cluster) has attained the highest operating point and the software stack requires more performance, a processor (or cluster) migration action is effected (see fig 2). Execution then continues on the other processor (or cluster) with the operating points on this processor (or cluster) being traversed. When performance isn't needed, execution can revert back.

Coherency is a critical enabler in achieving fast migration time as it allows the state that has been saved on the outbound processor to be snooped and restored on the inbound processor, rather than going via main memory. Additionally, because the L2 cache of the outbound processor is coherent, it can remain powered up after a task migration to improve the cache warming time of the inbound processor through snooping of data values. However, since the L2 cache of the outbound processor cannot be allocated, it will eventually need to be cleaned and powered off to save leakage power.

Only one cluster, either big or LITTLE, is active at any one time, except very briefly during a cluster context switch. The aim is to stay resident on the energy efficient Cortex-A7 cluster, while using the Cortex-A15 cluster opportunistically.

If the load warrants a change from big to LITTLE, or vice versa, the system synchronises all cores and then transfers all software context to the other cluster. As part of this process, the operating system on every processor has to save its state whilst still operating on the old (outbound) cluster. When the new (inbound) cluster boots, the operating systems need to restore their state on each processor. Once the inbound cluster has restored context, the outbound cluster is switched off.

The mode works most efficiently with a symmetric big.LITTLE system – the same number of cores in both clusters. An asymmetric system would require additional operating system involvement to scale execution down to the least common number of cores before migration could take place. While this is possible, it will increase the latency.

Cluster migration can be implemented alongside existing operating system power management functionality (such as idle management) with about the same complexity.

CPU migration
In cpu migration (see fig 3), each processor on the LITTLE cluster is paired with a processor on the big cluster. CPUs are divided in pairs (CPU0 on the Cortex-A15 and Cortex-A7 processors, CPU1 on the Cortex-A15 and Cortex-A7 processors and so on). When using cpu migration only one cpu per processor pair can be used at the same time.

The system actively monitors the load on each processor. High load causes the execution context to be moved to the big core, while if the load is low, execution is moved to the LITTLE core.
When the load is moved from an outbound core to an inbound core, the former is switched off. This model allows a mix of big and LITTLE cores to be active at any one time.

Since a big.LITTLE system is fully coherent through the CCI-400, another model is to allow both Cortex-A15 and Cortex-A7 processors to be powered on and executing code simultaneously. This is called big.LITTLE MP – essentially heterogeneous multiprocessing. This is the most sophisticated and flexible mode for a big.LITTLE system, involving scaling a single execution environment across both clusters. In this use model, a Cortex-A15 processor core is powered on and executing simultaneously with a Cortex-A7 processor core if there are threads that need such a level of processing performance. If not, only the Cortex-A7 processor needs to be powered on. Since processor cores are not matched explicitly, asymmetric topologies are simpler to support with MP operation.

With big.LITTLE MP, the operating system requires a higher degree of customisation to extract maximum benefit from the design. For example, the scheduler subsystem will need to be aware of the power-performance capabilities of the different processors and will need to map tasks to suitable processors. Here, the operating system runs on all processors in all clusters which may be operating, migrating tasks between processors in the two clusters simultaneously. Since the scheduler is involved in the directed placement of suitable tasks on suitable processors, this is a complex mode. The OS attempts to map tasks to processors that are best suited to running those tasks and will power off unused, or underused, processors.

Multiprocessor support in the Linux kernel assumes that all processors have identical performance capabilities and therefore a task can be allocated to any available processor. This means support for full big.LITTLE MP requires significant changes to the scheduling and power management components of Linux.

Author profile:
Robin Randhawa is a principal engineer with ARM.

Robin Randhawa

Supporting Information




This material is protected by Findlay Media copyright
See Terms and Conditions.
One-off usage is permitted but bulk copying is not.
For multiple copies contact the sales team.

Do you have any comments about this article?

Add your comments


Your comments/feedback may be edited prior to publishing. Not all entries will be published.
Please view our Terms and Conditions before leaving a comment.

Related Articles

Wafer split service from ams

Those looking for a more cost effective prototyping system could be interested ...

Miniature SiP project complete

Infineon Technologies, together with 40 European research partners, has ...

Quartz crystal

Seiko Instruments has added the ceramic SMD SC-32S to its range of quartz ...

Digital design: Transistors

At this year's International Solid State Circuits Conference (ISSCC), ...

Cores for optimism

It's fair to say that Bristol based XMOS has had a bumpy ride since it made its ...

Good things, small packages

There are a number of elements in the electronics world which have a higher ...

Is digital power moving forward?

It is now several years since commercial products with 'added digital ...

The CHAMP-AV6: Maximising Performance with ...

This paper presents information from a benchmark prepared by Gedae Inc. which ...

Changing the embedded development model with ...

While there is a broad range of embedded applications in need of complete and ...

ROLEC aluCASE - Modern IP67 Diecast ...

These stylish IP67 enclosures have many intelligent design features including a ...

High Speed Digital Seminars

17th June 2014, Winnersh 18th June 2014, Cambridge

How to cut debug time

This video demonstrates how you can cut your debug time when working on Linux ...

Touch interface innovation

A new contact microphone, when connected to a system, is able to process sound ...

TI ADC for medical imaging

Look inside TI's most compact ADC for medical imaging - ADS5263 16-bit ...

Bionic lenses and rabbits

A Terminator style bionic contact lens has been developed by researchers in a ...

Claire Jeffreys, NEW

Claire Jeffreys, events director, National Electronics Week, talks with Chris ...

Henry Parker, Intellect

Henry Parker, Intellect's programme manager, technology markets speaks with New ...