comment on this article

Voice interfaces are becoming more popular, but what are the analogue signal chain requirements?

While voice has been the natural means of communication between humans, it hasn’t been quite so important when it comes to electronic devices. With the exception of the various telephony devices developed over the years, voice has only been used for one way communication – the television and the radio being the obvious examples.

The emergence of cloud computing has changed all that. Today, a range of companies are producing what could be termed audio assistants. Leading examples include Amazon’s Alexa and Google’s Home, but there are many other applications, including smart TVs. The power of the cloud allows such devices to hear commands, interpret them and deliver the expected result – whether that’s streaming music, ordering something for the house or telling the vacuum cleaner to start work.

Just as when the iPhone exposed users to the benefits of touch technology, creating the expectation that all devices would offer similar functionality, it’s entirely likely that designers will now be looking to integrate voice technology into their next products.

We all know it’s a ‘wiggly world’ – voice is analogue – but processing voice is best handled by a digital component – such as a DSP. How much of the signal chain between microphone and speaker should be analogue and what should you be bearing in mind if you’re looking to develop a voice interface?

Jim Jacot, director of marketing with Cirrus Logic’s Smart Home business, said: “If you look at a voice system, there are a number of elements: there’s a microphone at the front end and a module which performs some kind of digital voice processing. There will also be some kind of audio clean up, speech recognition and probably an interface to the cloud. At the ‘far end’, there will be some kind of playback system.”

In Jacot’s opinion, one of the first things an aspiring voice interface designer should consider are price, performance and power. “Is it going to be used in a battery powered product? How many microphones do you need? What performance are you hoping to achieve?”

The first two elements – price and performance – are closely linked. Part of the decision relates to whether you select analogue or digital microphones and, as he noted, how many. “At the front end, you need to achieve a certain level of performance,” he noted. “For example, you might want to capture far field voice, in which case you’ll need multiple microphones – maybe up to eight. While digital and analogue mikes are pretty similar in performance, cost is an issue.

“With higher end systems,” he continued, “cost may not be an issue. But if you’re building a low end system, then it becomes a big deal. Analogue brings lower cost – a mike may cost about 15cents – so if you’re using a large number, the savings add up.”

Mark Melvin, product manager, hearing solutions, with On Semiconductor, disagreed. “Digital mikes consume more power; even supposedly low power parts. Analogue parts have the lowest power consumption.”

If you decide to use analogue mikes, the next block in the signal chain is an A/D converter. “The performance of this device will be important,” Jacot underlined. “If your A/D converter is of poor quality, it will affect the performance of the speech recognition engine, so you should provide the best signal possible.”

Jacot believes an A/D converter with an SNR of at least 100dBc should be selected. “You might be able to use a lower SNR to save power in battery powered systems, but it’s a trade off.”

Melvin suggested that A/D converters targeted towards hearing aid applications could be used. “These have a sampling rate of about 16kHz and are good for voice recognition. They provide a good trade off of power against quality. But it comes down to running at a particular sampling rate; those devices which run at Msample/s are not appropriate because they produce a lot of noise. However, we are pushing towards 32kHz parts with a 16bit word length with hearable applications in mind.

Being able to bias the microphone and capture audio all the time for no current consumption is also important and that’s one of On Semi’s core competences,” he added.

“We are pushing towards 32kHz parts with a 16bit word length with hearable applications in mind.”
Mark Melvin, product manager, hearing solutions at ON Semiconductor

Jacot contended that word length and sample rate are not so important for the A/D converters. “But they are important when it comes to the D/A converter. However, I would always point designers towards a better performing A/D converter because the speech recognition engine needs a full scale signal to work correctly. If you don’t have enough bits, it won’t work as well, especially if you think about environments where the input signal may be low – far field, for example.”

The ‘box’ in the middle has variable boundaries and where those boundaries are set depends upon how ambitious the designer might be. On Semi’s ‘box’ is BelaSigna R281; an ultra-low-power voice trigger solution for a range of applications. Typically, the R281 is ‘always listening’ and will detect a single, user-trained trigger phrase, asserting a wake-up signal when this trigger phrase is detected. “The simplest solution is to recognise a single phrase,” Melvin noted. “The device needs to be trained, but will then assert when it recognises that phrase, after which a higher power system might take over.”

Below: Cirrus’ Voice Capture Development Kit for Amazon AVS helps manufacturers get to market with Alexa-enabled products that feature high-accuracy wake-word triggering and command interpretation

Jacot pointed out that trigger word – or wake word – engines were available from a number of sources. “What’s important is they should be low power. Some could need 1Mbyte of memory and hundreds of MIPs. At the low end, however, they might only need 20kbyte and less than 5MIPs. You have to remember that voice recognition needs a lot of processing and you can’t do it with discrete components.”

Cirrus Logic has an integrated solution with a built in front and which runs signal processing. “There are also integrated products on the output side with built in amps,” Jacot added.

Once the speech processing has been completed, the ‘box’ needs to provide an output to a D/A converter, which then feeds a speaker. Melvin contended this device should have the same sample rate as the A/D converter on the input side. “Typically, the two converters would be locked together,” he explained. “We usually specify a sigma-delta converter with a Class D output.”

Melvin also pointed out that a trigger word can be used to assert an action. “If you’re waking something up using a trigger word,” he continued, “you will need an MCU in the system.”

How integrated your system will be is likely to be closely related to the projected production volume. “As you expand into the mainstream,” Jacot observed, “it’s likely you will need an integrated solution that comes with hardware, software and tools

“Not everyone is an audio expert and selecting the right amp, doing the tuning, verifying the results and trigger word recognition takes a certain level of expertise. Not everyone can do it

“Cirrus is trying to enable the mainstream market,” he noted. “We have created a development kit with Amazon, which has two mikes and everything integrated. You can drop it in and take a solution to market quickly. The goal is to let people create their own systems,” he concluded.

Below: On Semi’s voice recognition system block diagram

Author
Graham Pitcher

Related Downloads
171899/P22-23.pdf

Comment on this article


This material is protected by MA Business copyright See Terms and Conditions. One-off usage is permitted but bulk copying is not. For multiple copies contact the sales team.

What you think about this article:


Add your comments

Name
 
Email
 
Comments
 

Your comments/feedback may be edited prior to publishing. Not all entries will be published.
Please view our Terms and Conditions before leaving a comment.