How to achieve ultra-efficiency in 64-bit compute

6 mins read

This article describes how combining dual 64-bit/32-bit computing with targeted enhancements to memory, CPU and power-management features, the latest ARM Cortex-A35 processor delivers the efficiency needed to satisfy the future demands of fast-growing markets for entry-level mobile devices.

Opportunities in entry level mobile markets

The entry level smartphone market segment (including entry level smartphones and tablets) is the most rapidly expanding segment, growing at a CAGR of 8%. Shipments are projected to exceed one billion by 2020, which represents a significant revenue opportunity for all market players in this segment. This success has been achieved by providing a rich feature set, similar to the features of more expensive models, at a price that is accessible to first-time and budget-conscious buyers. Product pricing is usually in the range of $50 to $200.

As far as the underlying computing platform is concerned, efficiency is the key to meeting performance and price targets. Historically, the ARM Cortex-A5 and Cortex-A7 processors have satisfied the needs of the entry level. They have shipped in over two billion entry mobile smartphones to date.

Burgeoning entry level mobile markets are now demanding more capabilities and even better user experiences. Improved performance often comes at the expense of increased cost, but the next generation of entry handsets must meet the demands of the markets while retaining an affordable price tag. The rapid proliferation of 64-bit computing in premium smartphones is also creating opportunities for providing 64-bit compute in the affordable smartphone market segments. To create an advanced processor to power the next generation of entry smartphone devices, ARM has relentlessly pursued innovation to design an ultra-efficient and small 64-bit processor that improves the performance of the demanding mobile workloads in the entry mobile power budget.

While some improvements have been gained by taking advantage of process and technology advances made since the introduction of the Cortex-A5 and Cortex-A7, the new ARM Cortex-A35 processor boosts efficiency through enhancements to the processor microarchitecture, to maximise support for the most popular activities of entry-equipment users. These tend to be web browsing, gaming, audio, sharing images and video, and mobile payments.

The move to 64-bit

The Cortex-A35 processor is based on the latest ARMv8-A architecture, and supports both 32-bit and 64-bit computing. Since software development activity in the 32-bit domain remains strong, legacy support is vital. However, the superior memory- and data-handling capabilities of 64-bit compute deliver clear advantages when challenged by the increasing sophistication of applications. Efficiency when processing large files is also improved. This allows for faster data manipulation for modern mobile workloads and compute-intensive applications. It also opens the opportunity for applications that address more than 4GBytes of RAM.

The ARMv8-A architecture supports distinct 32-bit and 64-bit processor execution states. The 32-bit state, known as AArch32, delivers improved 32-bit performance compared to previous generations by incorporating enhancements such as new cryptographic and floating-point instructions. The currently published benchmarking analysis for Cortex-A35 processor has been done in AArch32 execution state.

The AArch64 state supports a new instruction set, A64 that allows the processor to address larger chunks of data, and introduces further improvements such as IEEE-compliant double-precision floating-point vector operations. The coexistence of AArch32 and AArch64 states allows any device with an ARMv8 architecture processor to run 32-bit apps on a 64-bit operating system (OS). It is worth noting that the Cortex-A35 processor delivers this 64-bit and 32-bit dual capability within 25% less silicon area than the first power optimized ARMv8-A processor Cortex-A53.

Several mobile workloads like web browsing and multimedia are very memory intensive, causing large amounts of data movement between memory and the processor. The Cortex-A35 processor is architected to deliver significant improvements in memory performance compared to Cortex-A7 processor. The major changes include an improved prefetching mechanism that can automatically prefetch multiple streams of data. This compares with single-stream capability the ARMv7-A based Cortex-A7 processor. Other microarchitecture improvements such as automatic write stream detection and the ability to support more outstanding data cache misses, as well as doubling the translation lookaside buffer (TLB) depth, improve the handling of large sets of data and enhance address translation performance. In addition, further changes to the memory subsystems improve utilisation of L1 and L2 memories and the main memory bus, and also reduce contentions. Figure 1 illustrates the memory-streaming performance advantage of the Cortex-A35 processor over Cortex-A7 processor.

Figure 1. Relative memory streaming performance. Improving memory streaming helps accelerate key mobile workloads.

Connecting users to the web

For some entry-equipment users, particularly those in developing nations that have little wired infrastructure, a mobile device is the main tool used for accessing the web. Hence a good browsing experience is essential. Figure 2 shows how browsing performance is significantly improved over Cortex-A7 processor. A 16% boost is achieved when testing like-for-like processor configurations clocked at the same frequency, whereas a performance-optimised implementation of the Cortex-A35 processor running at 2.0GHz delivers 84% better performance than the Cortex-A7 processor running at 1.2GHz.

Figure 2. Architectural enhancements boost browsing performance.

Video and gaming on the move

Other important mobile workloads, such as gaming and video or audio playback are not only dependent on moving large quantities of data quickly and efficiently, but also demand high compute performance. Gaming, in particular, places heavy demand on floating-point operations to calculate movements or trajectories.

The ARMv8-A architecture features improvements in the NEON media-processing engine that improve both single-precision and double-precision floating-point performance. The NEON and floating-point pipelines are also extremely area-efficient. Figure 3 expresses the improvements, relative to Cortex-A7 processor, in integer, floating-point and video performance that are critical for great gaming experiences. The video comparison is done with the NEON engine running video decoding for some popular video formats like MP4. The Geekbench single-core benchmark also shown includes the integer, floating-point and memory-streaming tests, and confirms an overall 40% improvement for the Cortex-A35 processor compared to Cortex-A7 processor.

Figure 3. Comparison of compute performance for video and gaming workloads.

Securing the new-user experience

Mostly transparent to the user, but essential for a positive mobile experience, ensuring security for online financial transactions requires constant attention and improvement. In developing countries, where large numbers of consumers do not have bank accounts and there is little wired infrastructure to support electronic payments, rapidly increasing mobile phone ownership provides the opportunity for mobile money systems to simplify access and extend the reach of banking services. In developed countries, where access to electronic payment systems is already well established, mobile payments can offer greater convenience. In both cases it is important for entry-level devices to deliver strong and secure user experiences.

The ARMv8-A architecture introduces cryptography extensions that are available in both AArch32 and AArch64 states. These significantly accelerate execution of cryptography algorithms. With these extensions, the Cortex-A35 processor is able to achieve 350% faster performance for executing SHA-1 algorithms and an 1100% boost for AES algorithm execution, as illustrated in figure 4.

Figure 4. Improvements in cryptography algorithms execution showing performance comparisons at the same clock frequency.

Performance boost, power savings

While increased performance is essential to deliver the user experiences expected from next-generation mobiles, designers remain under pressure to stay within a tight power budget allocated to the processor in the SoC platform. Size and cost constraints limit typical power for entry smartphones to under 100 to 150mW per processor core, and careful power management is needed to maximise battery life. Today’s users do not expect to have to curb their appetites for web browsing, gaming, video snacking or other connected activities in order to manage the device’s recharge interval.

The design of Cortex-A35 processor has tackled both dynamic power consumption and idle power management. Changes to the processor’s micro architecture, such as the enhanced pipeline, have yielded significant reductions in dynamic power. These improvements alone ensure that the Cortex-A35 processor consumes 10% lower power than the Cortex-A7 processor, even after the Cortex-A7 processor’s implementation is re-baselined using the latest EDA tools and implementation methodologies. Figure 5 illustrates the combined effects of the design-flow improvements and microarchitecture enhancements, resulting in around 20% lower dynamic power compared to a historical implementation of Cortex-A7 processor. The comparisons shown are based on Dhrystone power per core for the same target frequency implementations on 28nm process technology.

Figure 5. Power savings are achieved at the same time as performance improvements.

In addition to providing lower dynamic power consumption, the Cortex-A35 processor incorporates three important new power-management features to maximize opportunities for saving idle power. For instance, there are now four power domains that allow independent control of power to the CPU, NEON media engine, L2 caches, and remaining top-level logic. The Cortex-A35 processor incorporates a new standardised set of four signals called the Q-channel which connects to an on-chip power controller, as figure 6 shows. This simplifies the software needed to coordinate power-management modes and also enables easier control of the CPU idle states. The Cortex-A35 processor supports one Q-channel per power domain.

Figure 6. Software-independent control of retention states.

The governor block also incorporates hardware features that allow the processor to enter and exit power-saving data-retention modes with minimal latency and no software intervention. If, for example, a NEON instruction does not enter the pipeline for a programmer-specified time period the NEON block enters retention by interacting only with the power controller via the Q-channel. If a NEON instruction then enters the pipeline, the NEON block quickly exits retention mode and executes the instruction.

Overall the combination of targeted increases in compute performance, with reduced dynamic power and enhanced power management, makes Cortex-A35 ARM’s lowest-power 64-bit CPU, capable of delivering greater performance than previous entry-targeted cores within the same power budget.

Conclusion

To deliver a processor capable of satisfying the demands of fast-growing entry-level mobile markets calls for a careful balance between performance, power and cost.

The ARM Cortex-A35 core with dual 64-bit and 32-bit processor states responds to the challenges by concentrating on the workloads that matter most, to deliver improved mobile experiences within the tightly constrained power and price budgets of typical entry-level platforms.

Author profile:
Kinjal Dave is senior product manager at ARM