Imagine that you are an alien visiting Earth for the first time. As you circle the planet in high orbit, you intercept a random transmission in English, which reads:
You quickly realize that, due to some technical glitch, all spaces and punctuation are missing from the text. Being the curious and resourceful alien that you are, you hack the Library of Congress, and download a comprehensive dictionary of all the words in the English language, as well as thousands of other books to aid in your research. Wrapping your slimy head in your alien tentacles, you sit down to figure out what the original message was – a message written in a language you have never seen before, written by creatures you’ve never encountered, discussing a world you’re totally unfamiliar with.
How Would You Accomplish This Task?
Perhaps this seems like a contrived scenario to you, but if you happen to be interested in deciphering ancient texts of lost civilizations, it might feel less so. Classical Greek, for example, used to be written in Scriptio continua, or “Continuous Script”, a writing style where there are no spaces between words.
This is also, incidentally, analogous to what a computer program is expected to do when asked to automatically transcribe an audio recording of someone speaking. This technical challenge is a central part of what is now known as Automatic Speech Recognition or ASR for short.
The Language Model: Context is Key
To get a sense of how a computer might tackle this problem, let us return to our alien persona and try to decipher the intercepted message
Since you have a dictionary, your first step is to search for words that start with “A”. You quickly determine that, for this sequence, the only candidates are “A” or “An”. Encouraged by your progress, you try to decipher the message by reading on, looking for possible candidates for the next set of letters. (Take a stab at it yourself – it’s fun!)
After some time, you conclude there are three ways to break down the sequence into legitimate English words:
- An odor as mile
- An odor, a smile
- A nod or a smile
Which is the correct one? You realize now that your “dictionary technique”, which focuses on words in isolation, does not take context into account, specifically, the grammatical structure of the language, the meaning of the words and their relationship to one another. This is what is known as the Language Model of the language.
Unperturbed, you turn to your vast library of books to derive this model. You start with a naïve approach, searching the library for instances where the candidate words (“odor”, “nod”…) appear. When you find them, you look at the surrounding words, to see if you can find patterns like 1, 2 and 3 above. For each time the pattern appears, you mark down a point in favor of that specific sequence. You will quickly find that “an odor as mile” is not a valid sentence, and you can drop it from the list. However, you’re still left with the choice between 2 and 3.
After completing this extensive literature survey, you sum up the points you’ve given for each sequence and note that “A nod or a smile” got the most points – this is the more likely candidate. Huzzah!
Of course, things can get more complex very easily. Everything changes if there happened to be one smudged letter in the original text, which you cannot recover. It then reads
After a brief pang of annoyance, you realize there are only 26 options to replace the missing letter, and you decide to check each one. But now, following the process outlined above, it becomes more difficult to determine if it’s more likely the sentence reads
A NOD OR A SMILE…
AN ODOR AS VILE…
The answer might come, of course, from the words that follow. Each of the two options above will fit into a different sentence structure. But as you can see, things can get complicated really fast.
From Audio to Text: How Does it Really Work?
So is this really what ASR is? In a nutshell, yes. But of course, the devil (or rather, the art and the science) is in the details.
When we human beings talk, we do so without pausing between words. Go ahead and try it. If you actually make an effort to pause between words as you speak, it will sound strange, stinted, and forced. It’s no surprise that ancient texts did not see a need to separate words with spaces as people don’t do that when they speak.
The first step when conducting ASR is to use an Acoustic Model (AM) to turn tiny slivers of audio into a sequence of Phonemes. You can think of Phonemes as the building blocks or “letters” of spoken language, as opposed to the familiar letters of written language. While there is some relationship between phonemes and actual letters, the two are very different. On the one hand, some letters (like “c”) can sound different depending on the context, and on the other hand, some atomicly-spoken sounds (like the english sound “ch”) do not have a single-letter representation. You can find more discussion of phonemes in English in this Wikipedia article, and elsewhere on the web.
Each sequence of phonemes is then converted to a set of “humanly spelled” words, using a Pronunciation Dictionary (PD). Most times, the AM and PD together will produce several possible word sequences to choose from, as in our example above. This is where the Language Model (LM) comes in, to help determine which word sequence is the most likely.
How do we create such a LM, though? Since the computer does not “know” any human language, we provide it with a bunch of texts and let it learn the language from them. By “learn” we don’t mean “comprehend”, mind you – a computer cannot understand things in the way humans do since it has no real reference point – no “world” to associate words with and give them meaning. Instead, all it does is extract word patterns from those texts, and give them a “weight” that corresponds to their prevalence in the texts. With the LM at hand, we can sift through possible ways to string those phonemes together and make an educated guess of what the most likely word sequence is.
Wait, is that all there is?
Challenges and new advances in this field abound. We’re constantly figuring out new ways to represent words so computers can process them in a meaningful way, and creative ways to define what a “word pattern” means when building our different models. Then there are the challenges of tackling missing or unclear audio (the smudged letters in our example above). All of these factors and more make perfect ASR something that is still beyond the horizon. There are still things that we humans do better than the machines we train. This is why Verbit fuses AI with Human Intelligence (HI) to provide high-quality transcription: because understanding the broad meaning of the spoken word is something humans do instinctively better than machines.
Till next time, when we will talk about the significance and power of content adaptation in our quest for high-quality ASR!
About the Author
Dr. Elisha Rosensweig received his Ph.D. in Computer Science from UMass Amherst in 2012. Since then, he’s worked as a software engineer and researcher and has held various management positions in several hi-tech companies. He is currently the Head of Data Science at Verbit, where he gets to work on fascinating projects with people who are equally as fascinating.