Up Next

AI for Speech Recognition and Transcription in Law and Legal


Have you ever been frustrated with how Alexa or Siri don’t always understand your verbal requests? If so, then you already understand the problem that our guest this struggles with. He’s Tom Livne, co-founder and CEO of Verbit.ai.

Verbit is a company that focuses on AI transcription services, specifically for the law and legal space. They use a combination of machine learning and human experts to transcribe audio in different accents, in different noise environments, with different diction, to give people more accurate results and hopefully help the process scale.

In this episode, Livine explains five different factors that go into getting transcription right and getting AI to be able to aid in the process. In addition, Tom talks about some of the critical factors for where transcription will come into play in terms of bringing value into business.


Guest: Tom Livne, co-founder and CEO – Verbit.ai

Expertise: Entrepreneurship/tech startup life cycle

Brief Recognition: Livne holds an MBA in Business Administration from Yale


Interview Highlights


(03:00) Give us an understanding of what’s possible with transcription today?

TL: Think about this podcast. We are recording this episode and let’s assume we want to get a professional transcript. When I’m referring to a professional transcript, I mean 100% accuracy. And the way it’s been done today, it’s fully manual, right? People are listening and typing it from scratch and it creates a limited capacity of scale and low gross margin.

On the other hand, speech recognition technology can reach only 70 to 80%. If we’re going to court and give the automated transcript only, this is not good enough. So the way we solve it at Verbit is [with] the approach of the machine-human hybrid.

So we have our own speech recognition technology we’ve developed in-house. We have patterns register for our technology. We have a team of nine PhD’s working on it. We have the combination of our network and platform of freelance transcribers from all around the globe that take in the automated output of the machine and correct it in order to bring it to 100%.

So regarding what is possible, I mentioned that the technology is not there. And the reason for it, I’ll explain why. There are few parameters that affect the accuracy of the speech, and this is the reason the machine. and also. in my point of view, even in 10 years from now, we won’t be able to get to the 100% machine only.

So the parameters that affect the accuracy of the speech recognition is one, the language model. So think about if you go to legal transcription or medical transcription, there is a lot of specific jargon and specific words that are relevant for this use case. For the machine, it’s really really hard to do it, also to get the names of people, also to get specific terminology, so this affects the accuracy.

The second thing is the acoustic model. So if you do it talking in an open space or if you’re talking via phone or if you have a courtroom, et cetera, so all of this different acoustic model that also affects the accuracy of the speech.

And the third one, as you can hear my terrible Israeli accent, so usually accents affect the accuracy of the voice-to-text. So you need to tune it to train the machine for a specific accent. Then you have the fourth one: background noise. Overlapping of people, all the background noise, is really damaging the quality of the output of the machine.

And the fifth one is the pace of when you talk. You talk really, really fast or you’re talking slowly, then it also affects the accuracy.

And the last one will be the diction. If there are people, young people or children talking or elder people talking, this is also really specific diction that affects the accuracy of the speech. So if you combine…all the parameters in a different use case, it’s really, really hard, almost impossible to get all of this correctly. Unless you have specific data for this specific use case, combine all these parameters together for this specific customer, this will enable you to get 90 plus percent accuracy.

Our work in Verbit is not to replace the humans, [but] actually to help the human to do a better job and to make their life easier.


(08:30) These are challenging factors here. I’m wondering which of these is the most insurmountable.

TL: I think each one of those are very tough in their own unique way but if you ask me I think all the acoustic model and the background noise and the ability to identify different speakers, et cetera, this is very hard to talk of and to make adaptation for different acoustic environments and…controlling the quality of the audio recording.

To be able to adapt the algorithm accordingly, this is something that is very challenging and with all of the neural nets and the ability to train, still it’s having a hard time to understand sometimes when you put to the machine something with bad recording and bad acoustic…I think this is the toughest one.


(10:30) In other words, is that still where human intuition might still have a bastion of specialness, even if algorithms are trained to…take poor audio and fill in the blanks, is that still something where you think humans are gonna have the edge?

TL: I do believe so because they have the ability to hear it again and again and get the input in to understand the context of what has been said.

So I guess a courtroom…will never be satisfied with a machine only because they are required by law to have the 100% [accuracy] and this is going to take a lot of time and a leap of faith until they would be able to believe that the machine would be able to just get the perfect output for them to submit…You have Google, you mention Baidu…they are building something very generic. Something that should be suitable for everyone and…because we are taking more the vertical approach, this allows us to be much more tailor-made for any of the customers and will give us the advantage to get better results.

Because at the end of the day…what is speech recognition technology? Speech recognition is trying to identify what has been said and there are very complex statistical models that give the ranking, in showing you the best probability of the best guess for the machine what has been said. You have a lot of parameters that try to guess in the best way what has been said there. And this [is] actually because you think about verbiage as a contextual there. When you are in a generic engine, speech recognition engine, you just put the input, which is audio, and output will be text based on the same algorithm that everyone used for speech recognition.

If you think about verbiage…you need to use this contextual layer that gives you [information such as] the person that talked, and you have this accent and this is the jargon that he is talking about, legal space, in this acoustic environment. So use all these parameters in order to give better accuracy in the transformation from voice-to-text before you do it. This is something that helps us because we are not trying to be generic, we are trying to be very tailor-made.


(14:30) When you think about what we’ll be able to do five years from now that we can’t do now with transcription, where are you most hopeful that real traction will be made in terms of improvement?

TL: So the way we are thinking about it is in verbiages to be much beyond just transcription. We think that transcription just got much smarter, and what do I mean by that? Think about the use-case of…calls? When you have publicly traded companies…at the end [of the] quarter talking to the analyst about the company results.

Think about having an automated transcription for it, and then you already have the pace data and you can create actionable links and intents and you know let’s say Apple is talking about iPhone X, so you can identify in your transcription that this is what has been said and you can…click…and go directly to the website and buy the iPhone X. You can do a comparison, take all the numbers that you just automatically transcribe and create a graph and create a visualization and compare it to past results because you already have the transcription of the past results. And to get much more insights from the data.

Because we are allowing people to get more value out of their verbal assets so all this verbal communication and information that has been exchanged we want to allow our customer to get more value.


(17:30) Can you talk about the business value of transcription?

TL: Think about once you have the examination of a witness and then you can see if in his past testimonial does he contradict himself? Maybe he’s lying [so we can] try to analyze in his voice to get some realization of the text. You have many things that you can extract, so the speech and the transcription is the first layer. You can do on top of it many, many things. We think that the transcription market is very, very big. Once we would be able to increase the accuracy and we would be able to allow more people to get more value out of their verbal assets.

Up Next

Will artificial intelligence make the college classroom more accessible?

New tools designed to help institutions meet accessibility requirements stand to personalize learning for all students.

Artificial intelligence (AI) has seeped into almost every corner of higher education, popping up in the classroom, administrative offices, and even in dorm rooms and on campus grounds — all with the promise to streamline tasks and create a more personalized college experience for students.

As the technology steers colleges away from a one-size-fits-all approach, it is helping them make progress on one of their most long-running goals: making higher ed more accessible to all types of learners.

It is doing that in several ways. Among them, by scanning class materials for accessibility issues, improving learning tools for students with disabilities and offering personalized resources for learners who may need additional support, such as those who speak English as a second language.

AI stands to open the door to levels of accessibility that weren’t possible before, and its effects extend to the entire student body.

“So many of the barriers that are in the (college) environment are due to technology,” said Cynthia Curry, director of the National Center on Accessible Educational Materials for Learning. “If there can be systems built within technology to automatically, accurately and consistently make sure that the technology is being delivered in a way that’s inherently accessible to all learners, that’s really exciting.”

Accessibility at scale

Speech-to-text software is perhaps one of the most prominent examples of AI being used to assist students with disabilities on campus. As the name suggests, the software can take audio and translate it into written word, helping those who can’t or may have difficulty taking notes or hearing an instructor during class.

The Americans with Disabilities Act (ADA) and the Rehabilitation Act require public colleges and institutions that receive federal funds to provide transcription services to students who need it. But doing so is no small feat; it can take hours for service providers to transcribe and write captions for all of a class’s materials.

That’s an issue Brigham Young University-Idaho ran up against several years ago, said Valerie Sturm, the university’s deaf and hard of hearing services coordinator . As the use of online learning tools — such as movies, YouTube clips and TED Talks — grew in the classroom, the university’s backlog of transcription requests for media swelled.

“We were behind in media by several thousand pieces — not minutes, but pieces,” Sturm said.

In 2017, Sturm shopped around for a speech-to-text service to cut down on the mounting work, eventually settling on Verbit, a start-up that also claims Harvard University, Stanford University and Coursera as customers. Verbit quickly proved itself, chopping the media backlog down within a matter of weeks, Sturm said.

The company now does most of the post-production captioning for the university’s classroom media and performs some of its transcriptions for live lectures, though on-site service providers are still used as well, Sturm said. While some students prefer the university’s human providers, which are either community or student employees, she believes Verbit can be more reliable.

“It really is a viable option for students who have a concern about speed and accuracy because (for) the on-site service providers, … it depends on their individual response that day,” Sturm said. “Are they feeling good? Are they tired? Have they got their own class project to worry about so their mind isn’t really on their business? The artificial intelligence doesn’t have those kinds of issues.”

“If there can be systems built within technology to automatically, accurately and consistently make sure that the technology is being delivered in a way that’s inherently accessible to all learners, that’s really exciting.” –  Cynthia Curry, Director, National Center on Accessible Educational Materials for Learning

Speech recognition has seen vast improvements in recent years. Heavyweights in the field such as Google and Microsoft, for instance, now boast speech-to-text accuracy rates that hover around the 95% mark, which is around what human transcribers can do.

But in the classroom — where background noise, cheap recording equipment and the use of jargon and other uncommon words can make speech harder to understand — the technology’s accuracy can dip.

Christopher Phillips, electronic and information technology accessibility coordinator at Utah State University, said he sees most companies advertising their accuracy rate at roughly 80% to 90%. “(That) sounds high,” he said, “but you’re saying one out of five or one out of 10 words is incorrect, which is pretty crummy service.”

However, the industry standard for closed captioning is a 99% or better accuracy rate, meaning a human still has to check and correct the work of AI-powered speech-to-text tools. Even so, the technology improves with time, and can even point out to instructors where in their lectures they’re not enunciating or if they’re using odd phrases that could make their speech difficult for the software to pick up.

Speech technology still needs to improve before it can be used without a person checking its work for accuracy, Curry said, adding that trouble can arise when instructors put “too much faith” into the technology.

Automatic speech recognition “is a promising step,” she said, “but we still need to think of it as a first step.”

Still, speech to text can reduce the time it takes to get a document up to the 99% accuracy benchmark because transcribers no longer have to start working on a file from scratch.

That process can lead to cost savings for colleges, said Fred Singer, CEO of Echo360, a video platform that provides automated closed-captioning services. “If you want to get to 99%, the only cost-effective way is to start with this (automated) transcript and then pay somebody to fix it.”

Working hand in hand with AI

Much in the way AI provides a faster pathway to transcribing speech, it can help colleges pinpoint and convert thousands of inaccessible documents into more easily used formats.

To do so, Utah State uses Blackboard Ally, a tool embedded within the learning management system that automatically checks documents instructors have uploaded for accessibility issues.

When Ally’s scanner finds issues — such as a PDF with hard-to-read text, for example — it flags them and gives the document an accessibility score.

It then offers suggestions to instructors about how to improve those documents and can convert them into alternative formats, such as an audio file that a student can listen to on the go.

Although the process uses AI to identify issues, it still relies on human intervention for course correction. And Curry cautions that such services can give instructors a “false sense of security,” because the technology may miss or be unable to fix issues on its own.

Even so, Ally and other services like it have been game changers for colleges making content more accessible, Phillips said.

“It’s very hard to justify the cost to go back and do more individualized document-by-document accessibility work, (which) can now be automated to a certain degree,” he said. “Some of that information that was fairly locked up to users with disabilities (is now) more accessible than it might have otherwise been.”

“We can start to, not hand over the reins to artificial intelligence, but figure out good and healthy partnerships.” – Christopher Phillips, Electronic and information technology accessibility coordinator, Utah State University

And because the service flags issues that pose problems for students with disabilities, such as scanned documents or untagged PDFs that can be difficult for those using screen readers to understand, it can teach instructors what kinds of materials are more accessible.

“That’s an important step,” Curry said. “A lot of instructors may not have considered before (or) even thought that their material could be presented in a different format, so just having that in front of (them) can be really powerful.”

Blackboard’s review of Ally analytics indicates slight progress on some of these issues. Five years ago, for example, 52.5% of all PDFs in the LMS were untagged, compared to about 44.6% today. Yet during that time, the number of documents in the system with contrast issues and without headings increased, according to the company.

What’s needed to foster more widespread adoption of accessible course materials, Curry said, is for colleges to provide instructors with professional development opportunities that help them understand why accessibility is essential for students with disabilities.

Phillips predicts higher ed’s use of AI will only grow. “We can start to, not hand over the reins to artificial intelligence, but figure out good and healthy partnerships,” he said.

Accessibility for all students

Through the use of AI, some online education tools have been able to adapt to the unique needs of each student, a benefit that extends beyond those with disabilities.

Take Voxy, a web and mobile application that has made strides in increasing students’ English language proficiency by generating personalized lessons for them that adapt to their individual skill levels and interests. Colleges can offer the platform either to supplement their own courses and programs or as a study tool.

“One size doesn’t fit everyone,” said Katie Nielson, chief education officer at Voxy. “In fact, with language learning, the more personalized you can make content recommendations, the better.”

The AI behind the platform has cut down on the time needed to personalize each students’ program, with Nielson estimating the platform does in 10 minutes what would take an instructor about six hours.

“You’re going to get to this whole other level, and it’s because we’re starting to automate this whole process instead of it all just being random and not data driven.” – Fred Singer, CEO, Echo360

That kind of service would be nearly impossible to replicate in a typical English-as-a-second-language class at a community college, she said, in which a professor would have to craft learning plans for upward of 20 students, all from different backgrounds.

A recent study by the American Institutes for Research backs up Voxy’s claims that its AI-powered service offers a better way of learning English. Out of 317 students enrolled at Miami Dade College’s language labs, those who used Voxy learned more English than those who didn’t use the platform. Students who used the service didn’t engage with it for the recommended amount of time, the study notes, though they did use it outside of class time.

Additionally, some of the technology that benefits students with specific learning needs has broader uses. For instance, students in classes that use Echo360’s closed captioning service have access to transcripts they can search by keyword when they study or revisit the material. Additionally, conveying content in multiple ways can increase the likelihood students will learn new information, Curry said.

What’s more, researchers can comb big data sets produced from technology, such as automatically generated lecture transcripts, in order to better understand how students learn and interact with the material.

“There’s going to be all kinds of relationships we had no idea really existed about how learning takes place,” Singer said. “You’re going to get to this whole other level, and it’s because we’re starting to automate this whole process instead of it all just being random and not data driven.”

Back To Top