Microcontrollers learn to embrace neural networks

4 mins read

A decade on from its debut running on high-end servers, deep learning is making its way to far more constrained systems out on the edge, though often with the help of significant amounts of pruning and tuning.

A big motivation for bringing machine learning into microcontrollers and the processors in edge devices lies in concerns about offloading so much to the cloud.

“There are some issues with having all the ML processing primarily happening in the cloud,” says Dhiraj Sogani, senior director of wireless product marketing at Silicon Labs. “One is latency: it takes longer to get a response from a remote server. And with a wireless connection, if the network itself is down you don’t want to have the system increasing the transmit power just trying to find a router that isn’t there.”

Even with a good network connection, the amount of bandwidth needed to pass raw data to a remote server may be too expensive in terms of access charges as well as transmission energy.

“There may be no other way to process data than on the edge,” said Silicon Labs product manager Tamas Daranyi at the company’s recent Works With conference.

A further reason for local AI is the perception of privacy: people balk at the idea their home devices are only too willing to upload even just fragments of their conversations to a cloud server. Corporate managers also worry that data uploads to the cloud may reveal patterns in their operations they prefer to keep secret.

“So, a lot more people want to do edge processing now,” Sogani says.

Ali Osman Örs, director of edge AI and machine learning strategy and technologies at NXP Semiconductors, sees a number of applications for machine learning in microcontroller-grade hardware, including speech recognition and text to speech, as well as other time-series inputs “such as vibration, temperature or pressure for machine health and predictive maintenance use cases”.

Microcontroller-based AI can extend to vision-based services, which could extend from presence detection to face recognition.

“Of course, these vision-based applications are lower resolution or lower frame rate compared to similar applications run on higher compute-capable processors,” Örs adds.

A game changer

Yann Le Faou, director for machine learning at the edge for Microchip, claims, “Machine learning is going to change the game, I'm convinced of that. The main thing that people are looking for with machine learning is to lower their development time and develop applications faster. Say you have an application that focuses on recognising a sound such as glass breaking but marketing comes and says, ‘I want it to detect a baby crying’. With machine learning you can do that by just adding a new classification. That's where people are seeing the benefit.”

As machine learning becomes more important, some chipmakers have begun to incorporate accelerators directly into their 32bit microcontrollers. Earlier this autumn, NXP put an accelerator into the top end of its MCX family: a range of devices based on the Cortex-M architecture intended as a unified follow-on to its longstanding LPC family and the Kinetis range that arrived with the acquisition of Freescale.

In the spring, Silicon Labs added machine learning acceleration to its EFR32, an SoC that combines a Cortex-M33 with interfaces for a variety of wireless protocols aimed at home automation.

“Moving forward, we think machine learning will be a key contributor. So, for most of our chips we will be looking at machine learning-integration,” says Sogani.

On-chip acceleration fills a gap between dedicated accelerators that tend to provide more than 2 teraoperations per second, according to Örs. Typically, he says, the acceleration needed in the MCU applications NXP is targeting falls more in the gigaoperations per second range. That can be either supported by adding additional general-purpose cores or a dedicated accelerator. In NXP’s case, the acceleration for the matrix operations used in neural networks is 30 times higher than a Cortex-M. “Where there is a clear machine learning application, it is more efficient to leverage the NPU than to select an MCU with increased CPU capabilities.”

So far, Microchip Technology is one of the large microcontrollers suppliers that has held back from adding on-chip machine learning acceleration though the company is looking at options for the future for acceleration, which may include an internally developed architecture for machine learning chips as well as Arm’s Ethos coprocessor design for the newer Cortex-M processors.

Le Faou says there is plenty of scope for machine learning on conventional processors, particularly with the arrival of frameworks from software companies such as Edge Impulse and SensiML that focus on low resource usage and with the use of feature engineering to reduce the amount of data that needs to be handled by the neural network.

He notes that there are many applications around predictive maintenance where the processing may well be tightly integrated with the control loops because they need to handle the same data rather than using different sensors. “For predictive maintenance with motor control, there is often important information in the voltage, torque and other signals that are already being used by the control loop. AI is one piece of the application and not the whole thing, so you will have motor control and anomaly detection running on the same chip.”

Transformer focus

The nature of machine-learning algorithms and the acceleration at the edge to support them may change over time as they follow in the footsteps of much more compute-intensive systems. Take the Transformer, which has become the focus for high-performance AI in data centres. Given that it lies at the heart of models that store billions of parameters, this may seem an unusual choice for memory constrained MCUs and processor. But the Transformer has turned up in several experiments that indicate it can perform well even with a small set of parameters and, crucially, better than convolutional neural networks (CNNs) and the recurrent neural networks (RNNs) used for audio-recognition systems.

If machine learning evolves to use these more complex architectures, that may put more emphasis on the use of more specialised or even custom operators. Though the Transformer still leans heavily on the matrix-matrix operations that lie at the heart of most accelerators that have been optimised for CNN and RNN work it also involves more complex memory accesses and data manipulations such as the soft-max operation, which involves the relatively expensive calculation of several exponentials each time.

Accelerators with fine-grained control over their processing or which have tight integration with the host microcontroller’s pipeline where code runs an approximated form of soft-max would have the benefit of supporting these new operators more easily.

Work published by ETH Zurich last year suggested that vector accelerators added to existing Arm and RISC-V processor pipelines would work reasonably well for this kind of processing though new instructions tuned for some of the data manipulations needed may help boost performance.

“Transformers are definitely an area of research that is showing a lot of promise. In the near-to-mid future, we don’t currently expect standard CNNs and RNNs to be displaced. Already the trend on best in class performing models is combining transformers and CNNs,” Örs explains, but adds, “It is important to distinguish state of the art models and techniques in research from what is practical and stable for product deployment. Products require a level of maturity, and currently we feel that making significant hardware architectural changes to align more with Transformers would be too early.”

Assuming machine learning continues on its current trajectory you can expect a lot more evolution in both hardware and software as designers working with microcontrollers look to incorporate increasingly sophisticated models.