How Does It Work
Speech recognition software works by breaking down the audio of a speech recording into individual sounds, analyzing each sound, using algorithms to find the most probable word fit in that language, and transcribing those sounds into text. Speech recognition software uses natural language processing (NLP) and deep learning neural networks. “NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way,” according to the algorithms blog. This means that the software breaks the speech down into bits it can interpret, converts it into a digital format, and analyzes the pieces of content. From there, the software makes determinations based on programming and speech patterns, making hypotheses about what the user is actually saying. After determining what the users most likely said, the software transcribes the conversation into text. This all sounds simple enough, but the advances in technology mean these multiple, intricate processes are happening at lightning speed. Machines can transcribe human speech more accurately, correctly, and quickly than humans can.
Speech Recognition & AI Software
Voice recognition and transcription technology have come a long way since their first inception. We now use voice recognition technology in our everyday lives with voice search on the rise, more people are using assistants like Google Home, Siri, and Amazon Alexa.
History of Voice Recognition Technology
Programmers and engineers have made great leaps in the science of voice recognition over the past decade, so you’d be forgiven for thinking that this technology is a relatively new development. Much of the reporting and scholarship around voice recognition tech only focuses on the post-2011 Age of Siri, following the release of Apple’s now-ubiquitous personal assistant.
But there’s a rich secret history to voice recognition tech that stretches back to the mid-20th-century, to those early days when rudimentary computers needed to fill an entire warehouse with vacuum tubes and diodes just to crunch a simple equation. And this history not only reveals some interesting trivia about the technology we know and love today, but it also points the way toward potential future breakthroughs in the field. Let’s explore the untold story of voice recognition technology, and see how much progress has been made over the years (and how much has stayed the same).
AUDREY and the Shoebox
In the early 20th century, the U.S. research firm Bell Laboratories (named after founder Alexander Graham Bell, the inventor of the telephone) racked up a string of impressive technological advances: The invention of radio astronomy (1931), solar batteries (1941), and transistors (1947). Then in 1952, Bell Labs would mark another groundbreaking technological advancement: The Audrey system a set of vacuum-tube circuitry housed in a six-foot-high relay rack that could understand numerical digits spoken into its speaker box. When adapted to a specific speaking voice, AUDREY could accurately interpret more than 97% of digits spoken to it. AUDREY is no doubt primitive by today’s standards, but it laid the groundwork for voice-dialing, a technology that was widely used among toll-line operators. (Remember those?)Ten years later, IBM unveiled its shoebox machine at the 1962 World Fair in Seattle. Like AUDREY, Shoebox could understand up to 16 words, including the digits 0 through 9. And when Shoebox heard a number combined with a command word (like “plus” or “total”), it would then instruct a linked adding machine to calculate and print the answer to simple arithmetic problems. Just like that, the world’s first calculator powered by voice recognition was born!
HARPY takes wing
Voice recognition began to take off as a field in the 1970s, thanks in large part to interest and funding from the U.S. Department of Defense and DARPA. Running from 1971 to 1976, DARPA’s Speech Understanding Research (SUR) program was one of the largest research initiatives ever undertaken in the field of voice recognition.
SUR ultimately helped created Carnegie Mellon’s Happy voice recognition system, which was capable of processing and understanding more than 1,000 words. HARPY was particularly significant due to its use of “beam search” technology, which was a far more efficient method for machines to retrieve the meaning of words from a database and better determine the structure of a spoken sentence. Indeed, advances in voice recognition have always been closely tied to similar strides in search engine tech — look no further than Google’s current dominance in both fields for proof-positive of this fact.
From recognition to prediction
By the 1980s voice recognition tech had begun to advance at an exponential rate, going from simple machines that could understand only dozens or hundreds of spoken words, to complex networked machines that could comprehend tens of thousands. These advances were largely powered by the development of the hidden Markov model (HMM), a statistical method that allowed computers to better predict whether a sound corresponds to a word, rather than trying to match the sound’s pattern against a rigid template. In this way, HMM enabled voice recognition machines to greatly expand their vocabulary while also comprehending more conversational speech patterns. Armed with this technology, voice recognition began to be adopted for commercial use and became increasingly common in several specialized industries. The 1980s is also when voice recognition began to make its way into home consumer electronics, like with World of Wonder’s 1987 “Julie” doll, which could understand basic phrases and reply back.
Voice recognition goes mainstream
In 1990, we saw the release of the very first consumer-grade voice recognition product: Dragon Dictate, priced at $9,000 (that’s $17,000 in 2017 dollars). Following this, Dragon Dictate’s 1997 successor, Dragon NaturallySpeaking, was the first commercial voice recognition program that could understand the natural speech of up to 100 words per minute.
1997 also saw the release of BellSouth’s VAL, the very first “voice portal.” VAL was an interactive system that could respond to questions over the phone, laying the groundwork for the same technology powering the voice-activated menus you hear today when calling your bank or ISP. But after more than 40 years of advancement after advancement in voice recognition technology, developments in the field stalled out from the mid-1990s through to the late 2000s. At the time, voice recognition programs had hit a ceiling of about 80% accuracy in recognizing spoken words due to the HMM underpinning speech technology.
Google, Siri, and the voice recognition revolution
Apple’s iPhone had already made waves when it came out in 2007, as tech began to re-orient itself towards an increasingly smartphone-centric and mobile-first future. But with the release of the Google Voice Search App for the iPhone in 2008, voice recognition technology began to once again make major strides. In many ways, smartphones proved to be the ideal proving grounds for the new wave of voice recognition technology. Voice was simply an easier and more efficient input method on devices with such small screens and keyboards, which incentivized the development of hands-free technology.
But even more significantly, the design principles google laid down with Voice Search in 2008 continue to define voice recognition technology to this day: The processing power necessary for voice recognition could be offloaded to Google’s cloud data centers, enabling the kind of high-volume data analysis capable of storing human speech patterns and accurately matching words against them. Google’s approach was then perfected by Apple in 2011 with the release of Siri, an AI-driven personal assistant technology that likewise relies on cloud computing to predict what you’re saying. In many ways, Siri is a prime example of Apple doing what it does best: Taking existing technology and applying a mirror-sheen of polish to it. Siri’s easy-to-use interface combined with her sparkling ‘personality’ and Apple’s expert marketing of the iPhone helped make the program nearly ubiquitous.
The Potential Variables in Speech Recognition Software
“Correctness and accuracy are two different things,” says CallRail Product Manager, Adam Hofman. the difference lies in that correctness means completely “free from error” while accurate means “correct in all details” and “capable of or successful in reaching the intended target.”With speech recognition, this means that while the transcription may not be 100% correct (some words, names, or details might be mistranscribed), the user understands the overall idea of the chunk of speech that’s been transcribed. That is to say, it’s not just a jumble of random words–but that a cohesive concept can be interpreted from the text, in general. However, no two people are alike, and therefore, speech patterns and other deviations must be taken into account. Anomalies like accents (even those across English as native language speakers) can cause speech recognition software to miss certain aspects of conversations. The ways in which speakers enunciate versus mumble, the speeds at which they speak, and even fluctuations in speaker voice volume can throw speech recognition technology for a loop.
Regardless, most modern speech recognition technologies work along with machine learning platforms. Hence, as a user continues to use the technology, the software learns that particular person’s speech patterns and variances and adjusts accordingly. In essence, it learns from the user. CallRail’s voice recognition technology is used in conversation intelligence features like CallScore, Automation Rules, and Transcriptions.
The Benefits of Using Speech Recognition Software
Though speech recognition technology falls short of complete human intelligence, there are many benefits of using the technology–especially in business applications. In short, speech recognition software helps companies save time and money by automating business processes and providing instant insights into what’s happening in their phone calls. Because software performs the tasks of speech recognition and transcription faster and more accurately than a human can, it means it’s more cost-effective than having a human do the same job. It can also be a tedious job for a person to do at the rate at which many businesses need the service performed. Speech recognition and transcription software costs less per minute, than a human performing at the same rate, and never gets bored with the job.