Meeting the AI/ML design challenge

10 mins read

Kedar Patankar considers the physical design considerations that need to be taken into account when designing AI/ML applications.

Symmetric and Asymmetric multiprocessor platforms continue to proliferate, driven by the rapid growth of AI/ML applications for voice, image, text and object recognition. There is a revolution underway in software development & integration methodologies, IDEs and the specialised IP to support this, resulting in an explosion of innovation across the technology spectrum.

But one cascading effect that has received scant attention to date is – how does all of this affect physical design at the silicon level?

The greatest influence on the physical design portion of an HMP chip development is deciding what should be done in hardware versus what should be abstracted to software. This is not exclusively a power issue, even in edge applications – for instance, there are circumstances where cycle budget limitations can favour the use of hardware accelerators.

Even so, the very nature of AI/ML applications generally precludes using hardware acceleration to any great extent. AI apps are very dynamic. An edge application’s data inputs are gathered for the most part from sensors in the audio, voice, image or video realm, and multimedia processing largely favours the use of programmable architectures.

It can be tempting to implement the AI processor in fixed function hardware once training is completed. The sequentially layered computations of a neural network lend themselves to pipelining and parallelizing, which will maximise both speed and performance/watt, especially if implemented as a hardened core. However, AI models need periodic retraining, which forces a design team’s hand towards favouring a programmable implementation.

As soon as an AI/ML application reaches a limited level of complexity, the need for processors with specialised architectures becomes inescapable. DSPs are frequently employed for low or midrange video, voice, and audio apps and are commonly seen in voice and object recognition applications. Many designs use embedded GPUs for HD video, graphics and the 3D vector processing/tensor calculations that are integral to neural network operations. Facial recognition, image recognition and automotive collision avoidance applications also commonly utilise GPUs.

The rapid growth of the AI/ML market has provoked an aggressive repositioning of existing processor architectures. It has also stimulated innovations in processing hardware, as a variety of chip vendors and systems houses, recognising the limitations of GPUs in AI applications, have developing ASICs and programmable devices tuned to Tensor processing.

By reducing resolution to 8b and 16b, these Tensor processors are more efficient in power consumption and memory resource utilisation at the same or higher level of performance as their GPU counterparts.

However, when one examines AI/ML applications more broadly, employing more than one processor core is necessary for all but the most rudimentary of designs. Viewed in basic terms, an AI device receives input data of high complexity, high volume or both, processes that data through a neural network, and finally comes to a decision from its numerical analysis which determines a conclusion or course of action. But if we then delve into the details of this operating routine, the need for more than one programmable processor quickly becomes apparent.

This is especially true of Cloud applications but is also becoming increasingly necessary for edge devices as data processing capabilities are steadily being decentralised and forced closer to the data collection point.

Symmetric multiprocessing systems are sometimes employed in AI/ML designs, using feature-rich CPUs that have been enhanced with arithmetic blocks. There are some important advantages to designing a multiprocessor chip this way. The toolset will be identical for each core, there will be only one processor architecture license, and your software engineers will be able to focus their efforts on writing code and managing threads in a more uniform processing environment.

But often enough symmetric multiprocessing falls short of an AI/ML chip’s needs. There are many applications where a CPU architecture – primarily suited for control functions and complex algorithms – is a poor fit, even with applicable extensions. This inefficiency negatively impacts performance requirements and power budgets.

Implications

An asymmetric multiprocessor design can present some daunting challenges to a physical design team. Placement of processor blocks is a primary consideration. Data is often received in continuous high-volume streams. A system’s reaction to processing that data must often be deterministic and in real time, so block placement must support the nature of the data as well as the design’s processing objectives. Physical designers are also confronted with thorny problems in regard to latencies, timing delays, routing congestion and parasitics from the speed, volume and bandwidth of data.

Of course, each processor block will have its own supporting memories (both for operations and configuration firmware) as well as requiring system memory access. This complicates placement as well as consuming silicon real estate – an especially important consideration for edge devices, which are typically cost-sensitive.

Memories providing cache support can be a source of other general problems. Communication between caches will consume bus resources and can cause routing, timing and even crosstalk issues. To complicate matters, physical designers cannot freely move cache memories around the die, as these caches have communication responsibilities both with the processors they support and other memory elements with whom they are involved in data transactions and coherency obligations.

There are physical design issues unique to every type of processor. For example: GPUs and DSPs are vector processors which are good at parallelizing and are frequently employed in Asymmetric multiprocessor chips for AI/ML applications. Nonetheless, they have their own specific limitations, particularly in terms of memory and bandwidth, where inefficiencies and latencies are manifested. Their memory hierarchies in particular can be inflexible – a negative outgrowth of their VLIW and/or SIMD architectures.

There will be communication blocks as well – Ethernet, PCIe, USB, DDR4 and various wireless circuits for rdge apps. Along with the potpourri of I/O typical of AI/ML chips, the analogue signalling portions of the silicon will create problems at chip, package and board level for the PD team. Because of the nature of the data and how it’s collected, edge AI/ML chips normally sit right at the analogue/digital boundary.

And finally, power consumption and management will create dilemmas for physical designers, edge applications again being the most sensitive. The asymmetric nature of the processors in an AI/ML chip also poses difficulties in both power routing and distribution.

There are, of course, some basic advantages to programmable approaches. You can transfer multiple hardwired accelerators to a software abstraction. This can result in an overall smaller die, lower power and higher performance, but the cases may vary. Another potential advantage is that by abstracting to software, your design effort can get an early start. Software would then drive hardware definition and design, with consequent advantages to performance. Such a design process will require close coordination between software and hardware development teams, with a significant amount of iteration.

Consequences

What is required in the overall development plan is a thoughtful approach that considers the physical design phase of the chip’s development in advance. The basic elements of such a methodology are as follows:

  • Partition standard algorithmic functions vs variable ones. There are algorithms even in multimedia software that can be hardwired, so this first step requires comprehensive deliberation.
  • Determine whether the per-algorithm variability is limited enough that it can be covered dynamically through a limited set of fixed logic additions or is unpredictable enough to require full abstraction to a congruent programmable processor architecture.
  • Categorise the digital nature of the variable algorithms remaining (bit vs word, data type, combinatorial, sequential, arithmetic, scalar, vector or tensor.)
  • Filter the above through logic design limitations (cycle budget, timing, performance) requirements and physical design limitations (area, power, memory requirements.)
  • Select the programmable architectures that are the best fit for the above.

Naturally, the above methodology will be iterative, as it will demand multiple cycles of planning and reflection before the design will finally settle into an acceptable form.

Since HMP Edge AI chip designs are, arguably, the most complex design projects in microelectronics today, this level of effort is inevitable to ensure success in the development project.

Once the processing contents of the chip are determined, however, the hardest portion of physical design work begins. Block placement makes the most logical sense when reflecting the actual flow of data. Routing then becomes the primary problem.

A hierarchical topology with local and global routing layers addresses issues of congestion while providing fixed dimensions where RCL parasitics become more uniform and can be extracted more accurately. Such a topology also lends itself to implementation of a mesh network managed by an appropriately tailored OS, allowing timely external memory access to circuit blocks more distant from I/O. Though this does impose an extra workload on the software developers to personalise OS driver and abstraction layers, it also reinforces the need for a comprehensive development methodology which feeds physical design requirements into the development flow of the chip hardware and software design teams.

For AI applications on the edge, power is rather understandably accorded a singular level of attention. The potential problems power can present in Edge AI applications are both budgetary and thermal.

Edge devices are limited in power consumption by their environment, especially if that power depends on batteries. However, thermal dissipation is also an important factor, as the cost-effective packaging must be able to dissipate heat sufficiently to protect device reliability and functionality.

Designing for power has been a principal issue at both the chip and system level since battery-powered laptops began proliferating in the 1980s. Different capabilities and methodologies have been implemented, refined and augmented over the decades. Process engineering has played its part with the deployment of finFET and FD-SOI nodes, which have dramatically reduced the significance of static leakage current on power. These features have also changed physical design rules in favour of packing circuitry more densely and driving improvements in performance thru decreased capacitance parasitics and antenna effects. The ‘downstream’ effects of this are evident in the surge in growth of IoT and Edge AI/ML applications, greatly facilitated by these process extensions of Moore’s Law.

But there are also various practices and their associated mechanisms in widespread use for minimising power draw. At the RTL level, there are standard cell libraries with cells offering different threshold activation voltages, allowing designers to select high-speed cells for critical paths while using lower power cells for the approximate 70%-90% of the design that is less performance-driven. There are also methods for dynamically managing power at the block or gate level, including ways to power down unused or lightly used memory partitions. The power consumption of system clocks can be managed thru clock gating, which dynamically turns on or off registers in the clock tree based on activity. This can be done through manual insertion of specialized library elements as control logic in a clock tree, or thru the use of specialized tools which partially automate the effort.

The EDA sector has been active as well, with the implementation of options in their tools to reduce power at the synthesis and P&R development stages. P&R tools also allow high frequency signals to be forced up the metal stack to reduce capacitance and thus maintain performance at lower power expenditure by minimising a source of signal integrity problems.

Furthermore, vendor tools can optimise power by analysing signal activity in a block and optimising on that basis.

Techniques for optimising a neural network itself should not be overlooked, as they offer essential contributions to improving power, performance and area/cost. Even simple neural networks have multiple computational layers. Their computations tend to be lower resolution and can be reduced from FP to fixed point. Pooling layers of various kinds are frequently used after convolutional layers to condense the network and reduce both the number of layers and the calculation load.

A notable open-source contribution to physical design efforts to manage power, performance and size trade-offs in AI applications is the Nvidia Deep Learning Accelerator (NVDLA) architecture. Dedicated primarily to implementation of networks for inferencing, the NVDLA offering recognises the primary functions of DL and breaks them down into convolutions, activations, pooling and normalization. Each of these functions is supported by specialised modules, with the addition of a memory module for tensor reshaping. Using these building blocks, physical designers can either create custom neural network models or convert existing ones so that they work with the processor’s dedicated embedded software stack. This building block approach also lends itself readily to exploiting the predictable memory access patterns and parallelization of computations inherent in DL networks and does so while maintaining efficiency and scalability.

The asymmetric multiprocessing nature of Edge AI chips, however, demands further advances in power management. Power on an HMP AI chip not only needs to be distributed, but also sequenced and managed, based on which processors and their associated memories are operating at any given instance. It logically follows that the concept of power islands at the system level is being employed at the chip level, allowing regional supervision of power sequencing, switching, monitoring, clocking and resetting. By using power islands, regional power demands can be controlled on an activity basis, reducing transient effects on factors such as slew rates and parasitics. IR drops are also minimised by regional power distribution through reducing trace lengths to the point of load. Moreover, the safety margin reductions made inevitable by today’s low on-chip voltages are more capably managed at a regional level.

Theoretically, if software developers had power information (models or other factors) supported in their processor toolsets, they could conceivably code not just with performance in mind, but power as well. This would involve a ‘cross-pollination’ of sorts between IDEs and EDA tools which, to the author’s knowledge, has not yet been seriously explored

Assurances

Once a high performance, low power and low cost HMP Edge AI design is architected and built, a development team needs to verify that the design is truly doing what it is intended to do. Verification of such complex designs presents its own special dilemmas, however.

The central component of such chips is the AI block. It is simply impossible to exhaustively verify these engines conventionally. Yet despite the optimisations which were mentioned in the previous section that simplify and condense the computational load, networks very quickly reach a level of complexity that precludes complete verification in a reasonable time period.

Model checking has been attempted but has also proven to be inadequate as an approach. Unless an AI network is extremely simple, the number of possible computational comparisons that have to be made between the model and the test object soon overwhelm even the most dedicated test & verification engineering teams.

Equivalence checking, by contrast, has been demonstrated to be a workable alternative. Verification engineers can compare a gate level netlist against an RTL abstraction or RTL against a higher-level abstraction (C, C++, Python or, depending on the network and how it was developed, other possible languages), and there are EDA tools that support this alternative.

Since HMP AI Edge chips are rigorously developed to meet stringent power specifications, power verification is a crucial step in the design verification process. EDA tools for power verification check if power control is functioning correctly and that blocks are isolated as intended (using specialized cells such as gated power cells, level shifters and so forth.) Since there are these additional power control/management cells included in the design, a logic equivalency is also required to ensure that the cells have been placed properly. These checks are run after synthesis and layout on the appropriate netlists, as well as after test insertion.

Conclusions

As we have seen, the complexity of the design is amplified by the necessity for significant programmability, as AI/ML devices must be amenable to supporting not just inferencing, but training and periodic re-training. Supporting circuitry such as memories and communication interfaces compound the intricacy of the project, making even basic AI/ML chip design efforts analogous to an SoC development.

Combined with parasitic effects at the chip, package and PCB level that arise from needed analogue components and high bandwidth signalling, along with verification problems unique to AI designs and, furthermore, the criticality of power management for Edge applications, even the most skilled physical design teams find themselves heavily challenged by AI/ML designs.

Author details: Kedar Patankar is CTO, P2F Semi