Up Next

ASR 101: Introduction to Speech Recognition


Imagine that you are an alien visiting Earth for the first time. As you circle the planet in high orbit, you intercept a random transmission in English, which reads:


You quickly realize that, due to some technical glitch, all spaces and punctuation are missing from the text. Being the curious and resourceful alien that you are, you hack the Library of Congress, and download a comprehensive dictionary of all the words in the English language, as well as thousands of other books to aid in your research. Wrapping your slimy head in your alien tentacles, you sit down to figure out what the original message was – a message written in a language you have never seen before, written by creatures you’ve never encountered, discussing a world you’re totally unfamiliar with.


How Would You Accomplish This Task?

Perhaps this seems like a contrived scenario to you, but if you happen to be interested in deciphering ancient texts of lost civilizations, it might feel less so. Classical Greek, for example, used to be written in Scriptio continua, or “Continuous Script”, a writing style where there are no spaces between words. 

This is also, incidentally, analogous to what a computer program is expected to do when asked to automatically transcribe an audio recording of someone speaking. This technical challenge is a central part of what is now known as Automatic Speech Recognition or ASR for short.


The Language Model: Context is Key 

To get a sense of how a computer might tackle this problem, let us return to our alien persona and try to decipher the intercepted message


Since you have a dictionary, your first step is to search for words that start with “A”. You quickly determine that, for this sequence, the only candidates are “A” or “An”. Encouraged by your progress, you try to decipher the message by reading on, looking for possible candidates for the next set of letters. (Take a stab at it yourself – it’s fun!)

After some time, you conclude there are three ways to break down the sequence into legitimate English words:

  1. An odor as mile
  2. An odor, a smile
  3. A nod or a smile

Which is the correct one? You realize now that your “dictionary technique”, which focuses on words in isolation, does not take context into account, specifically, the grammatical structure of the language, the meaning of the words and their relationship to one another. This is what is known as the Language Model of the language.

Unperturbed, you turn to your vast library of books to derive this model. You start with a naïve approach, searching the library for instances where the candidate words (“odor”, “nod”…) appear. When you find them, you look at the surrounding words, to see if you can find patterns like 1, 2 and 3 above. For each time the pattern appears, you mark down a point in favor of that specific sequence. You will quickly find that “an odor as mile” is not a valid sentence, and you can drop it from the list. However, you’re still left with the choice between 2 and 3.

After completing this extensive literature survey, you sum up the points you’ve given for each sequence and note that “A nod or a smile” got the most points – this is the more likely candidate. Huzzah!

Of course, things can get more complex very easily. Everything changes if there happened to be one smudged letter in the original text, which you cannot recover. It then reads 


After a brief pang of annoyance, you realize there are only 26 options to replace the missing letter, and you decide to check each one. But now, following the process outlined above, it becomes more difficult to determine if it’s more likely the sentence reads


Or rather


The answer might come, of course, from the words that follow. Each of the two options above will fit into a different sentence structure. But as you can see, things can get complicated really fast.


From Audio to Text: How Does it Really Work?

So is this really what ASR is? In a nutshell, yes. But of course, the devil (or rather, the art and the science) is in the details.

When we human beings talk, we do so without pausing between words. Go ahead and try it. If you actually make an effort to pause between words as you speak, it will sound strange, stinted, and forced. It’s no surprise that ancient texts did not see a need to separate words with spaces as people don’t do that when they speak.

The first step when conducting ASR is to use an Acoustic Model (AM) to turn tiny slivers of audio into a sequence of Phonemes. You can think of Phonemes as the building blocks or “letters” of spoken language, as opposed to the familiar letters of written language. While there is some relationship between phonemes and actual letters, the two are very different. On the one hand, some letters (like “c”) can sound different depending on the context, and on the other hand, some atomicly-spoken sounds (like the english sound “ch”) do not have a single-letter representation. You can find more discussion of phonemes in English in this Wikipedia article, and elsewhere on the web.

Each sequence of phonemes is then converted to a set of “humanly spelled” words, using a Pronunciation Dictionary (PD). Most times, the AM and PD together will produce several possible word sequences to choose from, as in our example above. This is where the Language Model (LM) comes in, to help determine which word sequence is the most likely. 

How do we create such a LM, though? Since the computer does not “know” any human language, we provide it with a bunch of texts and let it learn the language from them. By “learn” we don’t mean “comprehend”, mind you – a computer cannot understand things in the way humans do since it has no real reference point – no “world” to associate words with and give them meaning. Instead, all it does is extract word patterns from those texts, and give them a “weight” that corresponds to their prevalence in the texts. With the LM at hand, we can sift through possible ways to string those phonemes together and make an educated guess of what the most likely word sequence is.


Wait, is that all there is?

Challenges and new advances in this field abound. We’re constantly figuring out new ways to represent words so computers can process them in a meaningful way, and creative ways to define what a “word pattern” means when building our different models. Then there are the challenges of tackling missing or unclear audio (the smudged letters in our example above). All of these factors and more make perfect ASR something that is still beyond the horizon. There are still things that we humans do better than the machines we train. This is why Verbit fuses AI with Human Intelligence (HI) to provide high-quality transcription: because understanding the broad meaning of the spoken word is something humans do instinctively better than machines.

Till next time, when we will talk about the significance and power of content adaptation in our quest for high-quality ASR!


About the Author

Dr. Elisha Rosensweig received his Ph.D. in Computer Science from UMass Amherst in 2012. Since then, he’s worked as a software engineer and researcher and has held various management positions in several hi-tech companies. He is currently the Head of Data Science at Verbit, where he gets to work on fascinating projects with people who are equally as fascinating.   

Up Next

Virginia Tech and Verbit – National Disability Employment Awareness Summit

“In 1945, Congress declared the first week in October ‘National Employ the Physically Handicapped Week.’ That week is now a month, so there’s more time to celebrate,” reported the National Disability Institute, explaining that the entire month of October is now dedicated to highlighting barriers to employment that people with disabilities face, and the significance of including them more efficiently in the workforce.  

Of course, for people with disabilities to be prepared to thrive in the workforce, the work needs to start earlier. Specifically, it needs to start by creating more inclusive and accessible classrooms that provide equal education opportunities to people who were previously left behind.

That’s why Virginia Tech is hosting a special event on October 4, 2019, and we’re excited to have our own Scott Ready, senior customer success and accessibility strategist at Verbit, keynote speak there.

Here’s everything you need to know about this event.

Discover the Best Tools and Practices for Accessible University Communication at a Free Event

Virginia Tech’s National Disability Employment Awareness event is all about exploring tools and best practices for accessible communications employed around the university. The event will include five talks. The first four will cover proactive accessible design, accessible Web communications, leveraging audio and video technology in the classroom, and how to keep C.A.L.M (Choose Accessible Learning Materials) and caption on. The fifth talk will be Scott’s keynote, which will cover the path to inclusion, and specifically, how to move from being reactive to taking a proactive approach toward accessibility for all.


Virginia Tech, the University Behind the Event: Deep Commitment to Equal Access and Opportunity for Students, Employees and Campus Visitors

One of the main reasons we’re excited to collaborate with Virginia Tech on this event is that this university is very committed to equal access and opportunity for all students. It is also committed to providing the same equality to its employees and campus visitors. To follow through on this commitment, the university provides a wealth of physical and digital resources. In the Accessible Learning Materials hub on its website, for example, the university talks about one of the event’s topics, keeping C.A.L.M and captioning on.


According to this hub, choosing accessible learning materials supports the development of UDL and, therefore, supports making “learning available to the broadest possible audience.” But just as we emphasize consistently here at Verbit, Virginia Tech explains that adding captions to support hard of hearing students ends up benefiting “a myriad of other students, such as visual learners and multilingual learners.” If you’ve been around Verbit for a while, you know there is a very deep alignment in both values and passion here.


An Opportunity to Meet Our Senior Customer Success and Accessibility Strategist

As we shared, we’re very excited that Scott Ready, senior customer success and accessibility strategist here at Verbit, is keynoting the event. Scott grew up with deaf parents who were teachers at the Missouri School for the Deaf, so he grew up on campus and gained a unique perspective that propelled over 20 years of his own professional experience with global higher education and K-12. He started his professional path as an assistant director of the Kansas Commission for the Deaf, and as the director of community relations at the Southwestern Bell Telephone Company’s Relay Center. He continued this path as the community liaison for the South Carolina School for the Deaf and the Blind, enhancing the educational experience for students with varying abilities. Then, he served as the director of online education, plus faculty and department chair, of the first online Interpreter Training Program (ITP) in the country. More recently, Scott spent 14 years at Blackboard Ally, among others as the director of the accessibility strategy, before he joined Verbit.


Here’s how you can meet Scott and attend Virginia Tech’s special event:

Event Details

 When: October 4, 2019, at 10:00 AM

Where: New Classroom Building, Room 230, at Virginia Tech

Contact person: Gloria Hartley, ghartely@vt.edu



Back To Top