Speech recognition technology enters next phase of development

6 min read

We are spoiled by technology. We are so used to its rapid progress that we are surprised that it has taken more than half a century to solve a 'problem' is something we do ourselves almost as naturally as breathing: speech recognition (SR).

But that is how long it has taken to crack this particular nut and speech is finally becoming the kind of interface its developers claimed it would be. Examples abound, from PCs featuring SR in their operating systems like Windows 7 and 8, to many of the major automotive brands, to mobile devices – think Siri, the ground breaking SR system used by Apple, and Google Now – to hosts of others. To cite just a selection: healthcare – for medical documentation; military – high performance aircraft like the Eurofighter Typhoon; training air traffic controllers; automatic translation; robotics; and aerospace – NASA's Mars Polar Lander used SR.

SR's success is down to one simple reason: it now works. Previously, factors like lack of accuracy, sensitivity to noise, over dependence on training to a particular voice and similar problems meant it worked in principle, but not in practice. It frequently ended up being more annoying than useful.

The journey has taken decades. SR first emerged in the 1950s – with examples including the 'Audrey' system developed by Bell Laboratories, which recognised digits spoken by a single voice. Kurzweil Applied Intelligence launched a commercial SR system in 1982, but what was regarded as the first consumer SR product – Dragon Dictate, which still cost thousands of pounds – didn't appear until 1990. Today, it works far better, is easier to operate, costs less than £100 and is used widely on PCs and Macs alike.

"A further development is the addition of a deeper level of understanding," says John West, principal solutions architect for Nuance, a speech recognition specialist which is the result of several mergers and acquisitions over the past 20 years. "Here, the aim is to not only recognise speech, but also to extract the meaning and intent of what has been said, enabling voice driven systems as a whole to react in an intelligent way, appropriate to the user's needs."

Such intelligence can be added in various ways, for example with sensors on mobile devices or cars, giving data about things such as temperature, location, fuel gauge, direction of travel.

"If you are in your car, a voice driven personal assistant application might tell you fuel was running low," West says. "You could ask it where the nearest fuel station was and it could know you have a preference for a particular brand, that you are heading north on a specific road and work out the closest station for you. Or perhaps it could warn you that it is too far to reach with the fuel remaining. When you tell the system your decision, it will ask if you require directions and provide them by interfacing with the car's GPS facility."

Eventually, most machines we interact with – from cars to phones, TVs to watches – will feature such capabilities. Then the aim will be to merge them together into a seamless, apparently single intelligent assistant that you talk to in a natural, conversational way. So, if you are in your home and say to your PC or phone that you need directions to somewhere, this request will be fed automatically via the cloud to your car.

Individual devices are already featuring assistants, such as Samsung's S Voice, and West sees the next few years as being principally about joining them all up.

One of the fundamental drawbacks to SR was simply its poor accuracy. This has been hugely improved – often reaching the high nineties in percentage terms – for several reasons: the general increase in the availability of affordable computing power, the advent of the cloud and the vast numbers of people now using it.

"In terms of accuracy, we are collecting more and more data because more people are using speech based services," West says. "I can't think of a smartphone that doesn't have a voice activated app. In 2009, Nuance put its first app on the iPhone, Dragon Dictation, which is now available in more than 40 languages worldwide.

"That was our first cloud service. Last year, 8billion cloud transactions were voice facilitated – in May this year, we did 1bn alone. These are voice requests from users to access a speech driven device like a TV, mobile phone, even wearables and robots. Around 28million voice enabled cars are sold annually, most with Nuance systems. To encourage greater development, Nuance has a mobile developer network that comprises just under 20,000 application developers using speech."

Another advance for SR is the development of more conversational interfaces, allowing a less structured dialogue than previous systems. This is happening not just for personal assistant applications on smartphone but also in call centres, once notorious for the rigidity of their voice interfaces. One large customer is HMRC, which asks a caller to tell it, in a few words, the reason for their call. The aim is to route the call to the appropriate department.

Using SR for some kinds of dictation, such as for documents that have to be well structured and grammatically accurate, has proved to be relatively limited, but this probably tells us more about people's abilities and preferences than the technology's capability. Creating more casual documents like emails and texts by speaking is simply easier – not many people are sufficiently articulate to speak a perfectly coherent, complex article directly into their computer.

So, is SR now a mature technology?

"Yes, it is," said West. "But what our customers want – and expect –actually goes beyond the pure recognition process. We are seeing that in the automotive world with connected cars, like the latest BMWs and Audis, which are connected to the cloud all the time."

A car is the ideal environment for voice driven services, because drivers should not use their hands, and technically challenging to achieve high levels of accuracy because of noise and so on. Despite this, so-called 'one shot destination entry' is being made possible, where the driver says something like 'Go to 5 Kings Parade, Cambridge, avoiding toll roads'. The system will find the optimal route, check for traffic jams and other potential problems, the fuel situation and so on.

Another new application area seen as having great potential for SR is the smart TV, which is still suffering from the cumbersome interface of the conventional remote control. Speech is seen as a potentially more natural, powerful way of controlling the next generation of TVs.

There are at least two approaches, one involving speaking to the TV, the other to a remote control. The first has to solve the problem of 'beam forming', focusing the TV's microphones on the person who is issuing the commands, in a room where there may be several people talking. A gesture or a wake up word can be used to initiate recognition and the system can work out where the speaker of the word is located.

"Most of the people we are talking to about SR for TVs are implementing the alternative approach, putting a SR capability into a remote control handset, with the recognition capability activated at the push of a button," West says. "This is already happening in the US on services like Comcast and Verizon.

"There are two major benefits. One obvious benefit is the control of the device, but a more significant benefit is searching content that exists in the cloud, which is difficult to do using a remote control via a keypad. Issuing spoken commands will be a far more natural and easy way to do that."

To support the core technology of SR, various elements of artificial intelligence (AI) have been incorporated, one being natural language understanding (NLU). A pioneer of this approach is US company VoiceBox Technologies, founded a decade ago with the aim of creating a natural, conversational interface by overlaying NLU algorithms on top of the SR engine.

"Our algorithms are focused on understanding contexts to facilitate the recognition, personalising the response and creating a more natural flow," explains spokesman Victor Melfi. "So you can track context across a conversation. For example, having asked about weather in Paris, you could say 'how about Boston?' and it would know you were still referring to weather."

VoiceBox launched the first such speech system using embedded NLU in 2008, in the Toyota Lexus, which it says was the first time SR was supplemented with NLU in a commercial application. Then, it realised that the concept of the 'connected car', equipped with full access to the cloud, could accelerate significantly the development of broader conversational abilities because of the greater computing power it made available.

The company now ships millions of its SR/NLU systems worldwide in 23 languages, mostly to automotive customers including Fiat, Dodge, Chrysler and Tom-Tom. But its technology has also been used in devices like mobile phones, PCs and tablets, smart TVs and wearables. Recently, it was announced that VoiceBox's speech applications will be used by AT&T in its Drive Studio product, a dedicated facility for connected car research.

In Melfi's view, the core science of SR has not developed that much for many years. VoiceBox takes the view that one way to overcome these limitations is to better manage the dialogue as it occurs.

Another element of AI being used for speech interfaces is machine learning, a route being taken by VocaliQ, a spin out from the University of Cambridge's Dialogue Systems Group. VocaliQ has developed proprietary software which uses machine learning to improve dialogue interactions automatically in voice activated systems.

"Current technologies are expensive to produce and do not always offer a smooth user experience, with irrelevant or inappropriate automated responses to questions being one source of frustration – problems that VocaliQ's new method of dialogue management addresses," the company says.

VocaliQ will develop its first products for automotive, education and retail businesses and is already working on a prototype application for one of the world's largest car manufacturers.

Ironically, what may turn out to be one of the toughest challenges facing speech interfaces in the future relates to the other side of the coin from recognition – the machine's vocal response, or text to speech (TTS). Creating machine speech that is even vaguely close to that of people, with natural inflection that takes into account meaning, is extremely difficult – something the world's best known TTS user, Stephen Hawking, demonstrates.

Eventually we may reach the ultimate goal of SR, being able to 'chat' with machines in virtually the same way as we do with each other. A lifelike robot with such a speech capability may be many years away – but it is something on which Nuance is currently working, along with Aldebaran Robotics.

At the very least, it will give a whole new meaning to the phrase man-machine interface.