Have you ever been frustrated with how Alexa or Siri don’t always understand your verbal requests? If so, then you already understand the problem that our guest this struggles with. He’s Tom Livne, co-founder and CEO of Verbit.ai.
Verbit is a company that focuses on AI transcription services, specifically for the law and legal space. They use a combination of machine learning and human experts to transcribe audio in different accents, in different noise environments, with different diction, to give people more accurate results and hopefully help the process scale.
In this episode, Livine explains five different factors that go into getting transcription right and getting AI to be able to aid in the process. In addition, Tom talks about some of the critical factors for where transcription will come into play in terms of bringing value into business.
Guest: Tom Livne, co-founder and CEO – Verbit.ai
Expertise: Entrepreneurship/tech startup life cycle
Brief Recognition: Livne holds an MBA in Business Administration from Yale
(03:00) Give us an understanding of what’s possible with transcription today?
TL: Think about this podcast. We are recording this episode and let’s assume we want to get a professional transcript. When I’m referring to a professional transcript, I mean 100% accuracy. And the way it’s been done today, it’s fully manual, right? People are listening and typing it from scratch and it creates a limited capacity of scale and low gross margin.
On the other hand, speech recognition technology can reach only 70 to 80%. If we’re going to court and give the automated transcript only, this is not good enough. So the way we solve it at Verbit is [with] the approach of the machine-human hybrid.
So we have our own speech recognition technology we’ve developed in-house. We have patterns register for our technology. We have a team of nine PhD’s working on it. We have the combination of our network and platform of freelance transcribers from all around the globe that take in the automated output of the machine and correct it in order to bring it to 100%.
So regarding what is possible, I mentioned that the technology is not there. And the reason for it, I’ll explain why. There are few parameters that affect the accuracy of the speech, and this is the reason the machine. and also. in my point of view, even in 10 years from now, we won’t be able to get to the 100% machine only.
So the parameters that affect the accuracy of the speech recognition is one, the language model. So think about if you go to legal transcription or medical transcription, there is a lot of specific jargon and specific words that are relevant for this use case. For the machine, it’s really really hard to do it, also to get the names of people, also to get specific terminology, so this affects the accuracy.
The second thing is the acoustic model. So if you do it talking in an open space or if you’re talking via phone or if you have a courtroom, et cetera, so all of this different acoustic model that also affects the accuracy of the speech.
And the third one, as you can hear my terrible Israeli accent, so usually accents affect the accuracy of the voice-to-text. So you need to tune it to train the machine for a specific accent. Then you have the fourth one: background noise. Overlapping of people, all the background noise, is really damaging the quality of the output of the machine.
And the fifth one is the pace of when you talk. You talk really, really fast or you’re talking slowly, then it also affects the accuracy.
And the last one will be the diction. If there are people, young people or children talking or elder people talking, this is also really specific diction that affects the accuracy of the speech. So if you combine…all the parameters in a different use case, it’s really, really hard, almost impossible to get all of this correctly. Unless you have specific data for this specific use case, combine all these parameters together for this specific customer, this will enable you to get 90 plus percent accuracy.
Our work in Verbit is not to replace the humans, [but] actually to help the human to do a better job and to make their life easier.
(08:30) These are challenging factors here. I’m wondering which of these is the most insurmountable.
TL: I think each one of those are very tough in their own unique way but if you ask me I think all the acoustic model and the background noise and the ability to identify different speakers, et cetera, this is very hard to talk of and to make adaptation for different acoustic environments and…controlling the quality of the audio recording.
To be able to adapt the algorithm accordingly, this is something that is very challenging and with all of the neural nets and the ability to train, still it’s having a hard time to understand sometimes when you put to the machine something with bad recording and bad acoustic…I think this is the toughest one.
(10:30) In other words, is that still where human intuition might still have a bastion of specialness, even if algorithms are trained to…take poor audio and fill in the blanks, is that still something where you think humans are gonna have the edge?
TL: I do believe so because they have the ability to hear it again and again and get the input in to understand the context of what has been said.
So I guess a courtroom…will never be satisfied with a machine only because they are required by law to have the 100% [accuracy] and this is going to take a lot of time and a leap of faith until they would be able to believe that the machine would be able to just get the perfect output for them to submit…You have Google, you mention Baidu…they are building something very generic. Something that should be suitable for everyone and…because we are taking more the vertical approach, this allows us to be much more tailor-made for any of the customers and will give us the advantage to get better results.
Because at the end of the day…what is speech recognition technology? Speech recognition is trying to identify what has been said and there are very complex statistical models that give the ranking, in showing you the best probability of the best guess for the machine what has been said. You have a lot of parameters that try to guess in the best way what has been said there. And this [is] actually because you think about verbiage as a contextual there. When you are in a generic engine, speech recognition engine, you just put the input, which is audio, and output will be text based on the same algorithm that everyone used for speech recognition.
If you think about verbiage…you need to use this contextual layer that gives you [information such as] the person that talked, and you have this accent and this is the jargon that he is talking about, legal space, in this acoustic environment. So use all these parameters in order to give better accuracy in the transformation from voice-to-text before you do it. This is something that helps us because we are not trying to be generic, we are trying to be very tailor-made.
(14:30) When you think about what we’ll be able to do five years from now that we can’t do now with transcription, where are you most hopeful that real traction will be made in terms of improvement?
TL: So the way we are thinking about it is in verbiages to be much beyond just transcription. We think that transcription just got much smarter, and what do I mean by that? Think about the use-case of…calls? When you have publicly traded companies…at the end [of the] quarter talking to the analyst about the company results.
Think about having an automated transcription for it, and then you already have the pace data and you can create actionable links and intents and you know let’s say Apple is talking about iPhone X, so you can identify in your transcription that this is what has been said and you can…click…and go directly to the website and buy the iPhone X. You can do a comparison, take all the numbers that you just automatically transcribe and create a graph and create a visualization and compare it to past results because you already have the transcription of the past results. And to get much more insights from the data.
Because we are allowing people to get more value out of their verbal assets so all this verbal communication and information that has been exchanged we want to allow our customer to get more value.
(17:30) Can you talk about the business value of transcription?
TL: Think about once you have the examination of a witness and then you can see if in his past testimonial does he contradict himself? Maybe he’s lying [so we can] try to analyze in his voice to get some realization of the text. You have many things that you can extract, so the speech and the transcription is the first layer. You can do on top of it many, many things. We think that the transcription market is very, very big. Once we would be able to increase the accuracy and we would be able to allow more people to get more value out of their verbal assets.