Voice recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format.
Designing a machine that mimics human behavior, especially the capability of speaking and responding to it, has intrigued engineers and scientists for centuries. Speech technologies have witnessed a dramatic transformation, from what started as a speech machine using resonance tubes to Graham Bell’s first recording device to Dictaphone and the first voice synthesizer, Voice Operating Demonstrator (VODER) to today’s smart virtual assistants like Apple’s Siri or Amazon’s Alexa . Thanks to the advancements in AI, Voice recognition technology is gaining popularity. According to a recent U.S. Cellular survey, 36% of smartphone owners use a virtual assistant daily and 30% use smart home technology daily. This connectivity is expected to increase with the number of devices and sensors predicted to rise 200% to 46 billion by 2021.
The idea is to transform recorded audio into a sequence of words, as an alternative to typing on the keyboard. From helping people with physical disabilities, transcription of interviews, learning a new language or accessing a file via voice commands, speech recognition finds use in a number of applications. Voice recognition systems facilitate the interaction with technology, enabling hands-free requests.
From 1952 to today.
The earliest voice recognition technologies could only comprehend digits. Audrey system, built by Bell Labs in 1952 considered to be the first speech recognition device, recognised only ten digits spoken by a single voice. This was followed by the Shoebox machine, developed by IBM in 1962, which could recognise 16 English words, 10 digits and 6 arithmetic commands.
The U.S. Department of Defence made great contributions towards the development speech recognition systems. From 1971 to 1976, it funded the DARPA SUR (Speech Understanding Research) program, which led to the development of Harpy by Carnegie Mellon that could comprehend 1011 words. At around the same time, the first commercial speech recognition company, Threshold Technology was founded and Bell Labs introduced a system that could interpret multiple people’s voices. In 1978, Texas Instruments introduced Speak & Spell, which was a milestone in speech development because of its use of speech chip, leading to more human-like digital synthesis sound. The development of hidden Markov model, which considered the probability of unknown sounds using statistics proved to be a major breakthrough, it even entered the home, in the form of Worlds of Wonder’s Julie doll.
Thanks to the introduction of faster microprocessors, speech, in 1990, the world’s first speech recognition software for consumers was developed. It was the first continuous dictation software, meaning one did not have to pause between words. In 1992, Apple also produced its real-time continuous speech recognition system that could recognise as many as 20,000 words.
By 2001, speech recognition development had hit a plateau, until in 2008, Google emerged with its Google Voice Search application for iPhones. In 2010, Google introduced personalized recognition on Android devices which would record different users’ voice queries to develop an enhanced speech model. It consists of 230 billion English words. Eventually, Apple’s Siri was implemented in iPhone 4S in 2011, which relied on cloud computing as well.
A Stanford study revealed that speech recognition is now about three times as fast as typing on a cell phone. Once 8.5%, the error rate has now dropped to 4.9%. These technological advances have given rise to multiple applications like transcription assistant tools including Happy Scribe.
Little Known Facts About Speech Recognition Technology
- Technically speaking, speech recognition goes way back to 1877 when Thomas Edison invented the phonograph, the first device to record and reproduce sound.
- When it comes to speech recognition, accuracy is measured by a Word Error Rate calculation, which tracks how often a word is transcribed incorrectly.