10 July 2012
ARM's big.LITTLE systems provide more processing power for less energy
Traditionally, it has not been possible to design a processor that offers high performance and high energy efficiency. Solutions have typically involved integrating processors with microarchitectures optimised for performance and energy efficiency respectively. An example is a high performance application processor coupled with a low power asic.
This concept, heterogeneous multiprocessing, involves putting a number of specialised cores together – such as, say an application processor and a baseband radio processor – and running tasks suited to each specialised core as needed.
ARM's solution is to couple a high performance processor with a highly energy efficient counterpart while retaining full instruction set architecture compatibility. This coupled design, known as big.LITTLE processing, provides new options for matching compute capacity to workloads by enabling optimal distribution of the load between the 'big' and 'LITTLE' cores.
The current generation of big.LITTLE design pairs a Cortex-A15 processor cluster with an energy efficient Cortex-A7 processor cluster. These processors are 100% architecturally compatible and have the same functionality – support for LPAE, virtualisation extensions and functional units such as NEON and VFP – allowing software applications compiled for one processor type to run on the other without modification.
Both processor clusters are fully cache coherent, enabled by ARM's CoreLink Cache Coherent Interconnect (CCI-400, see fig 1). This also enables I/O coherency with other components, such as a Mali-T604 gpu. CPUs in both clusters can signal each other via a shared interrupt controller, such as the CoreLink GIC-400.
Since the same application can run on a Cortex-A7 or a Cortex-A15 without modification, this opens up the possibility of mapping applications to the right processor on an opportunistic basis. This is the basis for a number of execution models, namely:
* big.LITTLE migration
* big.LITTLE multiprocessing
Migration models can be further divided into types:
* Cluster migration
* CPU migration
Migration models enable the capture and restoration of software context from one processor type to another. The software stack only runs on one cluster in the case of cluster migration. In cpu migration, each cpu in a cluster is paired with its counterpart in the other cluster and the software context is migrated opportunistically between clusters on a per cpu basis.
The multiprocessing model enables the software stack to be distributed across processors in both clusters and all cpus may be in operation simultaneously.
Migration models are a natural extension to power performance management techniques such as dynamic voltage and frequency scaling (dvfs). A migration action is akin to a dvfs operating point transition. Operating points on a processor's dvfs curve will be traversed in response to load variations. When the current processor (or cluster) has attained the highest operating point and the software stack requires more performance, a processor (or cluster) migration action is effected (see fig 2). Execution then continues on the other processor (or cluster) with the operating points on this processor (or cluster) being traversed. When performance isn't needed, execution can revert back.
Coherency is a critical enabler in achieving fast migration time as it allows the state that has been saved on the outbound processor to be snooped and restored on the inbound processor, rather than going via main memory. Additionally, because the L2 cache of the outbound processor is coherent, it can remain powered up after a task migration to improve the cache warming time of the inbound processor through snooping of data values. However, since the L2 cache of the outbound processor cannot be allocated, it will eventually need to be cleaned and powered off to save leakage power.
Only one cluster, either big or LITTLE, is active at any one time, except very briefly during a cluster context switch. The aim is to stay resident on the energy efficient Cortex-A7 cluster, while using the Cortex-A15 cluster opportunistically.
If the load warrants a change from big to LITTLE, or vice versa, the system synchronises all cores and then transfers all software context to the other cluster. As part of this process, the operating system on every processor has to save its state whilst still operating on the old (outbound) cluster. When the new (inbound) cluster boots, the operating systems need to restore their state on each processor. Once the inbound cluster has restored context, the outbound cluster is switched off.
The mode works most efficiently with a symmetric big.LITTLE system – the same number of cores in both clusters. An asymmetric system would require additional operating system involvement to scale execution down to the least common number of cores before migration could take place. While this is possible, it will increase the latency.
Cluster migration can be implemented alongside existing operating system power management functionality (such as idle management) with about the same complexity.
In cpu migration (see fig 3), each processor on the LITTLE cluster is paired with a processor on the big cluster. CPUs are divided in pairs (CPU0 on the Cortex-A15 and Cortex-A7 processors, CPU1 on the Cortex-A15 and Cortex-A7 processors and so on). When using cpu migration only one cpu per processor pair can be used at the same time.
The system actively monitors the load on each processor. High load causes the execution context to be moved to the big core, while if the load is low, execution is moved to the LITTLE core.
When the load is moved from an outbound core to an inbound core, the former is switched off. This model allows a mix of big and LITTLE cores to be active at any one time.
Since a big.LITTLE system is fully coherent through the CCI-400, another model is to allow both Cortex-A15 and Cortex-A7 processors to be powered on and executing code simultaneously. This is called big.LITTLE MP – essentially heterogeneous multiprocessing. This is the most sophisticated and flexible mode for a big.LITTLE system, involving scaling a single execution environment across both clusters. In this use model, a Cortex-A15 processor core is powered on and executing simultaneously with a Cortex-A7 processor core if there are threads that need such a level of processing performance. If not, only the Cortex-A7 processor needs to be powered on. Since processor cores are not matched explicitly, asymmetric topologies are simpler to support with MP operation.
With big.LITTLE MP, the operating system requires a higher degree of customisation to extract maximum benefit from the design. For example, the scheduler subsystem will need to be aware of the power-performance capabilities of the different processors and will need to map tasks to suitable processors. Here, the operating system runs on all processors in all clusters which may be operating, migrating tasks between processors in the two clusters simultaneously. Since the scheduler is involved in the directed placement of suitable tasks on suitable processors, this is a complex mode. The OS attempts to map tasks to processors that are best suited to running those tasks and will power off unused, or underused, processors.
Multiprocessor support in the Linux kernel assumes that all processors have identical performance capabilities and therefore a task can be allocated to any available processor. This means support for full big.LITTLE MP requires significant changes to the scheduling and power management components of Linux.
Robin Randhawa is a principal engineer with ARM.