Up Next

Will AI Solutions Replace Stenographers?


Let’s start with a basic, but important question: What is AI? Essentially, it’s the broad discipline of creating smart machines that can perform tasks that are reminiscent of human capabilities such as recognizing objects, texts, sounds, speech, and solving different types of problems. Another more technical term for AI is machine learning. 

AI is often mentioned as being superhuman in terms of its capabilities, so a quick reality check on what AI can and can’t do wouldn’t hurt. Two decades ago, IBM designed a program called Deep Blue, a software that was able to defeat world champion Garry Kasparov in a chess match, a feat that shattered the way people thought about machine capabilities. Fast forward to 2018, where another huge success in the domain took place. AlphaGo, designed by the DeepMind group in Google was able to defeat the world champion in “Go”, a much less rigid game than chess. It was predicted that it would take many more years for a machine to beat a human in this type of situation, but it happened. 

But which situations result in AI underperforming? What is different about those tasks, compared to a task where the AI excels? The difference is real-world knowledge. For example, accurately identifying an image requires an understanding of how things exist in reality, such as how objects look when they’re rotated in three dimensions. It’s a totally different set of skills than logic-based problems like chess or Go. For situations that involve real-world knowledge, humans maintain a huge advantage over machines. 

Therefore, in the world of court reporting, we’re faced with a dilemma: How can we still leverage machine capabilities in a domain that is obviously dependent on knowledge of the real world? The answer lies in the hybrid model, combining the strengths of both artificial and human intelligence for optimal results. This involves deploying both machines and people, each in their own element, to carry out tasks in the most efficient way. This is done by assigning the AI the easier, more repetitive tasks, which it can perform accurately, quickly, and for a low cost, and leave the more difficult, complex and creative tasks for highly skilled individuals to work on.

Let’s dive into the example of legal transcription in greater detail, and how both humans and machines are involved in the process. A computer-generated transcript can be produced using speech-to-text technology. However, in the legal domain, there are often complex and industry-specific terms mentioned, meaning the computer may mistranscribe some of these terms. That’s where the element of human expertise comes in. Human transcribers can go over the automatically-generated text and make any necessary corrections. Legal transcription is a perfect example of the fusion of artificial and human intelligence, as the computer does most of the simple work, with human expertise coming in later to perform tougher terminology and context-related corrections. 

Even in spite of complex subject matter, there are ways to ensure that automatically produced transcripts are of high quality. The last decade has seen tremendous advances in speech recognition due to huge amounts of data becoming available. Developments in machine learning, deep learning, and neural networks have enabled significant improvement, although a gap of understanding still remains. Then there is the issue of audio elements. If there is difficult audio, it’s challenging to obtain accurate transcription. Similarly, if the speakers have accents that the machine wasn’t previously introduced to it will not perform at its best. 

The best way to minimize these issues and ensure the highest level of precision is to train the machine on as many elements as possible, most notably on taxonomy. This is where AI really shines. In the domain of court reporting, if the topic of the hearing is known in advance, such as a medical issue or an insurance claim, then this data can be fed to the AI. This will bring in the necessary terminology, have a specific model for the case and, consequently, produce much higher accuracy for these terms than a human, who likely will not be particularly familiar with these concepts.

Beyond the process of transcription, AI can be practically applied in a variety of tasks related to court reporting. Think of a live court reporting session. Suppose there is a need to go back to what someone said earlier in the proceeding. A stenographer would have to search their notes for his statement, while a digital recorder would have to quickly scan the log notes, hoping they wrote something relevant. On the other hand, an AI program could simply search for the phrase that is entered and find the exact place in the audio where it was stated. Even if the original transcription contained an error, the technology would allow searching the audio itself, which could then be played back. 

Let’s look at another scenario. It’s the digital reporter’s duty to make an accurate record of the court proceeding, so if someone speaks unclearly, they must mention it. This is something that can sometimes go unnoticed by a busy human but is easily detected by a machine that can pick up on it and alert the recorder. The same goes for going off the record, ambient noise, and overlapping conversation. All of these scenarios can be detected by machines which could then alert the reporter.

Therefore, going back to the original titular question: Will AI replace stenographers? The short answer is no, as human expertise is essential to work together with technology. However, in order to maximize the use of their unique and specialized skills, stenographers should target situations where technology cannot be applied, such as scenarios where there is no digital setup. 

The rise of AI technology has significantly impacted almost every aspect of daily life and nearly every professional industry. AI offers distinct advantages and strengths in many domains, and legal is no exception. In particular, AI and machine learning technology have the potential to achieve faster transcription turnaround and a high level of accuracy for court reporters, as well as assist with other elements of a court proceeding.

Up Next

ASR 101: Introduction to Speech Recognition

Imagine that you are an alien visiting Earth for the first time. As you circle the planet in high orbit, you intercept a random transmission in English, which reads:


You quickly realize that, due to some technical glitch, all spaces and punctuation are missing from the text. Being the curious and resourceful alien that you are, you hack the Library of Congress, and download a comprehensive dictionary of all the words in the English language, as well as thousands of other books to aid in your research. Wrapping your slimy head in your alien tentacles, you sit down to figure out what the original message was – a message written in a language you have never seen before, written by creatures you’ve never encountered, discussing a world you’re totally unfamiliar with.


How Would You Accomplish This Task?

Perhaps this seems like a contrived scenario to you, but if you happen to be interested in deciphering ancient texts of lost civilizations, it might feel less so. Classical Greek, for example, used to be written in Scriptio continua, or “Continuous Script”, a writing style where there are no spaces between words. 

This is also, incidentally, analogous to what a computer program is expected to do when asked to automatically transcribe an audio recording of someone speaking. This technical challenge is a central part of what is now known as Automatic Speech Recognition or ASR for short.


The Language Model: Context is Key 

To get a sense of how a computer might tackle this problem, let us return to our alien persona and try to decipher the intercepted message


Since you have a dictionary, your first step is to search for words that start with “A”. You quickly determine that, for this sequence, the only candidates are “A” or “An”. Encouraged by your progress, you try to decipher the message by reading on, looking for possible candidates for the next set of letters. (Take a stab at it yourself – it’s fun!)

After some time, you conclude there are three ways to break down the sequence into legitimate English words:

  1. An odor as mile
  2. An odor, a smile
  3. A nod or a smile

Which is the correct one? You realize now that your “dictionary technique”, which focuses on words in isolation, does not take context into account, specifically, the grammatical structure of the language, the meaning of the words and their relationship to one another. This is what is known as the Language Model of the language.

Unperturbed, you turn to your vast library of books to derive this model. You start with a naïve approach, searching the library for instances where the candidate words (“odor”, “nod”…) appear. When you find them, you look at the surrounding words, to see if you can find patterns like 1, 2 and 3 above. For each time the pattern appears, you mark down a point in favor of that specific sequence. You will quickly find that “an odor as mile” is not a valid sentence, and you can drop it from the list. However, you’re still left with the choice between 2 and 3.

After completing this extensive literature survey, you sum up the points you’ve given for each sequence and note that “A nod or a smile” got the most points – this is the more likely candidate. Huzzah!

Of course, things can get more complex very easily. Everything changes if there happened to be one smudged letter in the original text, which you cannot recover. It then reads 


After a brief pang of annoyance, you realize there are only 26 options to replace the missing letter, and you decide to check each one. But now, following the process outlined above, it becomes more difficult to determine if it’s more likely the sentence reads


Or rather


The answer might come, of course, from the words that follow. Each of the two options above will fit into a different sentence structure. But as you can see, things can get complicated really fast.


From Audio to Text: How Does it Really Work?

So is this really what ASR is? In a nutshell, yes. But of course, the devil (or rather, the art and the science) is in the details.

When we human beings talk, we do so without pausing between words. Go ahead and try it. If you actually make an effort to pause between words as you speak, it will sound strange, stinted, and forced. It’s no surprise that ancient texts did not see a need to separate words with spaces as people don’t do that when they speak.

The first step when conducting ASR is to use an Acoustic Model (AM) to turn tiny slivers of audio into a sequence of Phonemes. You can think of Phonemes as the building blocks or “letters” of spoken language, as opposed to the familiar letters of written language. While there is some relationship between phonemes and actual letters, the two are very different. On the one hand, some letters (like “c”) can sound different depending on the context, and on the other hand, some atomicly-spoken sounds (like the english sound “ch”) do not have a single-letter representation. You can find more discussion of phonemes in English in this Wikipedia article, and elsewhere on the web.

Each sequence of phonemes is then converted to a set of “humanly spelled” words, using a Pronunciation Dictionary (PD). Most times, the AM and PD together will produce several possible word sequences to choose from, as in our example above. This is where the Language Model (LM) comes in, to help determine which word sequence is the most likely. 

How do we create such a LM, though? Since the computer does not “know” any human language, we provide it with a bunch of texts and let it learn the language from them. By “learn” we don’t mean “comprehend”, mind you – a computer cannot understand things in the way humans do since it has no real reference point – no “world” to associate words with and give them meaning. Instead, all it does is extract word patterns from those texts, and give them a “weight” that corresponds to their prevalence in the texts. With the LM at hand, we can sift through possible ways to string those phonemes together and make an educated guess of what the most likely word sequence is.


Wait, is that all there is?

Challenges and new advances in this field abound. We’re constantly figuring out new ways to represent words so computers can process them in a meaningful way, and creative ways to define what a “word pattern” means when building our different models. Then there are the challenges of tackling missing or unclear audio (the smudged letters in our example above). All of these factors and more make perfect ASR something that is still beyond the horizon. There are still things that we humans do better than the machines we train. This is why Verbit fuses AI with Human Intelligence (HI) to provide high-quality transcription: because understanding the broad meaning of the spoken word is something humans do instinctively better than machines.

Till next time, when we will talk about the significance and power of content adaptation in our quest for high-quality ASR!


About the Author

Dr. Elisha Rosensweig received his Ph.D. in Computer Science from UMass Amherst in 2012. Since then, he’s worked as a software engineer and researcher and has held various management positions in several hi-tech companies. He is currently the Head of Data Science at Verbit, where he gets to work on fascinating projects with people who are equally as fascinating.   

Back To Top