Outlook 2023: The Age of Inference

4 mins read

Future pervasive AI architectures for Edge and Cloud look unified and scalable.

The world of artificial intelligence and machine learning (AI/ML) is fragmented into different domains. Two of these domains represent splits between training and inference, and cloud versus edge. There are myriad other AI/ML task differentiations, but these two splits are the major topics of discussion here.

AI/ML training develops models that inference uses to recognise whatever needs identifying, whether its light versus heavy traffic on a smart city’s street, the clearance level for an ID badge and matching face used for secure access control, words spoken by a telephone caller to a customer service call centre, or a handwritten address on an envelope at a postal sorting centre.

Training normally takes place in enterprise data centres or in the cloud where many high-powered servers, plenty of memory, hardware accelerators, and high-speed networking can be thrown at the workload. In this environment, tremendous amounts of electrical power for computing, networking, and cooling are used for training with the aim of finishing quickly. Inference workloads can also be performed in a data centre or the cloud, but increasingly, inference tasks are migrating to the edge, for several reasons.

First, there’s the issue of latency. It takes time to ship raw data back to the cloud or data centre. It takes more time to perform the inference, and it takes yet more time to ship the desired answer or decision back to the edge. For some real-time tasks – including factory automation, radar, and electronic warfare – decisions that take too long can be costly.

Two more reasons that inference workloads are migrating to the edge involve power: computing power and electrical power. As AI/ML inference workloads migrate to large numbers of edge devices, the aggregate computing power of millions of inference engines in those edge devices exceeds the computing power of a data centre’s servers. In addition, edge inference engines don’t consume large amounts of power.

Many interesting chips with new computing architectures have been announced recently to handle the unique needs of edge inference. Makers highlight big teraFLOPS and teraOPS (TFLOPS and TOPs) computing numbers that their devices can attain, with less power consumption. While it’s true that inference workloads require plenty of TFLOPS and TOPS, these specialised edge inference chips represent a one-way architectural street, which may prove to be an undesirable route when considering combined training and inference workloads.

Today, AI/ML model training workloads largely run-on high-powered CPUs and GPUs in data centres where they draw large amounts of power and leverage advanced cooling to perform the many trillions of calculations needed to train AI/ML models. Such training almost universally employs floating-point data formats with high dynamic range to maximise model accuracy by allowing tiny incremental adjustment to model weights. Floating-point computations consume more power and therefore require additional cooling. In addition, CPUs and GPUs expend considerable amounts of power to move large training data sets between memory and their internal computing elements.

Most edge inference chips cannot afford the silicon or the power consumption to perform all calculations using full-precision, floating-point data formats. Many make compromises to attain high peak TFLOPS and TOPS metrics, often by employing data types with less precision to represent AI/ML weights, activations, and data. Vendors of edge AI/ML chips provide software tools to reduce the precision of the trained model weights, converting models to smaller number formats such as FP8, scaled integers, or even binary data formats. Each of these smaller data formats deliver advantages for edge inference workloads, but all these formats lose some amount of model accuracy. Retraining AI/ML models with reduced precision can often reclaim some of that accuracy.

Now imagine that you have a scalable device architecture that can be deployed in small, embedded edge devices and in larger devices capable of aggregating workloads running in the data centre. Those same optimisations that improve power consumption and cost efficiency at the edge also make compute in the data centre denser and more cost-efficient, which lowers the facility’s capital and operating expenses for both inference and training.

AI/ML accelerator scalable architectures that support both full- and reduced-precision floating point formats break down the artificial boundary between training and inference and enable the deployment of the same standard and familiar software tools for a unified architecture. These efficient edge AI accelerators employ architectural innovations such as dataflow and on-chip broadcast networks that permit data fetched from external memory to be reused many times once brought on-chip.

There are real application examples where the existence of a unified scalable dataflow architecture for machine learning breaks down the wall between distinctive phases of training and inference. Federated learning is one such example, which unlocks new types of AI/ML workloads. For many connected applications, federated learning can supplant the one-way street approach of reduced-precision AI/ML inference models derived through one-time offline training and unlock performance that might be difficult to achieve because the representative centralised offline training sets are unavailable.

Federated learning exploits an important characteristic of inference at the edge, where devices are exposed to many diverse inputs that range far beyond the original model training sets. If properly designed, these edge devices can learn from these additional inputs and further improve their model accuracy during device deployment. There can be hundreds, thousands, or millions of edge devices that are all improving the same AI/ML models to provide better local answers or decision.

For example, consider CT or MRI scanners made by the same vendor, distributed in hospitals around the world. These imaging devices are often tasked with finding cancer tumours and other problems and can increasingly use AI/ML models to help radiologists identify suspect tissues. As each machine in the field improves its model, the original trained model that’s being used to initialise new imaging equipment can benefit from the same improvements if federated learning is employed to update and improve the original model.

Such updates can be performed in a way that ensures that only the insights gained through the additional edge-based training are shared, and not an individual’s private data. All fielded machines can benefit from this additional training without compromising privacy. Federated learning has wide applicability in privacy-preserving device personalisation, where the performance of vision and speech algorithms can be tailored to specific users. It also has applications in network security, where the collective learning of network ingress nodes can be used to discover proactive security rules without sharing sensitive private network traffic.

The benefit of a unified cloud and edge compute architecture is that the model can be logically split to run in the cloud and on the edge using identical software binaries. The unified architecture ensures that compatible data formats are used and optimisations for data-formats such as sparsity representations don’t break between cloud and edge. A scalable, unified architecture and continual learning throughout the lifetime of a deployed application departs from today’s conventional training and inference practice that relies on CPUs and GPUs in the data centre and specialised devices in the edge. Yet this unified approach seems the most logical path if the industry wants to make large gains in performance, accuracy, and power efficiency as AI/ML becomes pervasive.

Author details: Ivo Bolsens, Senior VP, AMD