Speech Recognition: Then and Now

This article was originally published on LinkedIn in January 2018.

When I began using speech recognition software in 1993, DragonDictate for DOS cost $1700 (plus a $2000 PC). It was designed to respond to individual words — you... had... to... pause... after... every... word — and the vocabulary was limited to 30,000 words. Dragon Systems trumpeted input rates of 40 words per minute with 95% accuracy. My experience was different. On a good day, I dictated 20 to 35 words per minute with 80% or 90% accuracy; but after taking time to correct misrecognized words, my actual input rate was closer to eight or ten words per minute. Given this mediocre performance, I found DragonDictate useful only for dictating short emails and letters to friends, and occasionally, for notekeeping, provided the vocabulary was not too arcane. I gave up on using DragonDictate for “serious” writing projects, like my Master's thesis, formal correspondences, and reports.

Despite DragonDictate's limitations, I was fascinated by its potential. While writing my thesis (on the history and philosophy of the human body), I managed to weave in a footnote on DragonDictate: I described the difference between negative and positive feedback in light of state-of-the-art, 1995 “voice recognition” systems:

A “system” consisting of a human operating a computer furnishes many examples of feedback, one of the more striking of which is voice recognition technology. A speech recognition system “adapts” to each individual's vocal mannerisms; the performance of the system improves, to a point, with use. When a user utters a word, the system either recognizes or misrecognizes it. If the former, the user continues dictating. If the latter, the user corrects the error by typing or spelling the word. The user's negative feedback causes the program to modify itself. As the gap between error (the guess) and goal (the word) narrows, the ability of the program to recognize words increases. If errors are not corrected the performance of the system degrades (positive feedback).

Fast forward to the present. My fascination with speech recognition technology remains undiminished. Most days, I rely on Dragon Professional, an unnervingly accurate dictation and command-and-control system for Windows PCs.

But speech recognition is no longer restricted to desktop and laptop computers. Baked into every modern smart phone is a digital assistant that responds to continuous speech: “Siri” (for the Apple iOS) and “OK Google” (for Android devices).

More sophisticated digital assistants like “Alexa” and “Google Assistant” show signs of becoming ubiquitous, or nearly ubiquitous, due to their modest cost and extraordinary capabilities. Digital assistants will, I predict, be game-changers.

Digital Assistants are already proving to be game-changers for people with disabilities. See these three articles on the impact of Alexa and Google Assistant on the lives of people who are blind:

Why Amazon's Alexa is Life Changing for the Blind
Blind Dad Tried Amazon Echo Now Loves Alexa. Originally published at http://www.thememo.com on 2017/08/01, but no longer appears to be on-line.
There's No Place like Google Home: A Review of Google's Voice Assistant

It's obvious that digital assistants are supplanting personal computers for certain tasks. For someone wanting to know a factoid, the choice is clear. Either fire up a computer, launch a browser, type in keywords, choose a link, and read a web page. Or, if you are within earshot of a digital assistant, ask the question.

Most surprising to me is that speech recognition has moved out of our computers and into our social ecosystems. It is no longer unusual to see someone pose a question to their phone, and receive an immediate and accurate response. For tasks such as getting directions, fact-checking, or composing text messages, a portable voice-enabled device is the best tool for the job.

Had you asked me in 1993 to predict what speech recognition technology would be like 25 years hence, my guesses would have been wildly wrong. For example, I did not expect continuous speech input during my lifetime. Yet it became available only four years later, in 1997. Even after I began using continuous speech systems, I did not anticipate such rapid progress.

There is no such thing as a crystal ball, and I think prognosticators tend to get things wrong. With that in mind, I have decided not to try to guess what future generations of speech recognition will bring. Regardless of the specifics, I am hopeful that the technology will accelerate the dismantling of barriers experienced by folks with disabilities: we could be witnessing more changes of the sort being wrought by the advent of digital assistants.