Up Next

Verbit raises $23M for its transcription service


Verbit, a transcription startup with offices in Tel Aviv and New York, today announced that it has raised a $23 million Series A round led by Viola Ventures. Vertex Ventures, HV Ventures, Oryzn Capital, Vintage Venture Partners and ClalTech also participated in this round. The company, which currently focuses on the legal and academic sector, uses both its custom machine learning models and freelancers to offer an accuracy guarantee of over 99 percent. In total, the company has now raised $34 million.

Tom Livne, Verbit’s CEO and co-founder, told me that he used to be a lawyer and saw how the quality and turnaround time of traditional transcription services could be improved with the help of machine learning. While this is a huge but very fragmented market, Livne argues there hasn’t been a lot of innovation here. “There is no innovation and technology in this market,” he said. “So we came up with this idea to build a technological transcription company.”

The company started out with the three founders, but today, Verbit has more than 70 employees and more than 100 customers. These include a number of law firms, but Livne also found that there is a big market for good transcription in academia, where accessibility laws often require these institutions to provide transcriptions of their classes and lectures. Coursera, Stanford and Harvard now use its service. Livne says Verbit now has millions of dollars in revenue, and by the end of the year he hopes to get to tens of millions of dollars.

Today, Verbit’s automated system — which the company customizes and retrains for all new customers based on their specific needs and contexts — gets to about 90 percent accuracy. Then, its army of freelancers sets to work on those automated transcripts to look for mistakes — and fixes them. All of those fixes then flow back into the model, which then (ideally) gets better over time.

Livne stressed that he believes that his company is not setting out to destroy jobs but that Verbit is creating thousands of new jobs for the freelancers that support its service. “We are not here to replace the human,” he said. “We are here to give them tools to make their job better and easier and we are actually reducing the barrier to entry to be a transcriber. Think about Verbit as an Uber for transcription.”

Recently, Verbit also launched a live transcription service for media firms that also uses a human-in-the-loop process to offer transcriptions with a delay of only a few seconds. It’s no surprise, then, that the company plans to add new verticals to its lineup as well, though it’s still considering its options. Livne noted that the company is looking at insurance and financial firms, as well as media and medical use cases. “But right now, we have very high demand from academia and law, so we need to support it on a larger scale,” he said. The company is also looking at adding support for other languages.

That’s where the new funding comes in. Verbit plans to hire aggressively, especially in its New York office, with a focus on sales, marketing and customer success.

Up Next

AI for Speech Recognition and Transcription in Law and Legal

Have you ever been frustrated with how Alexa or Siri don’t always understand your verbal requests? If so, then you already understand the problem that our guest this struggles with. He’s Tom Livne, co-founder and CEO of Verbit.ai.

Verbit is a company that focuses on AI transcription services, specifically for the law and legal space. They use a combination of machine learning and human experts to transcribe audio in different accents, in different noise environments, with different diction, to give people more accurate results and hopefully help the process scale.

In this episode, Livine explains five different factors that go into getting transcription right and getting AI to be able to aid in the process. In addition, Tom talks about some of the critical factors for where transcription will come into play in terms of bringing value into business.


Guest: Tom Livne, co-founder and CEO – Verbit.ai

Expertise: Entrepreneurship/tech startup life cycle

Brief Recognition: Livne holds an MBA in Business Administration from Yale


Interview Highlights


(03:00) Give us an understanding of what’s possible with transcription today?

TL: Think about this podcast. We are recording this episode and let’s assume we want to get a professional transcript. When I’m referring to a professional transcript, I mean 100% accuracy. And the way it’s been done today, it’s fully manual, right? People are listening and typing it from scratch and it creates a limited capacity of scale and low gross margin.

On the other hand, speech recognition technology can reach only 70 to 80%. If we’re going to court and give the automated transcript only, this is not good enough. So the way we solve it at Verbit is [with] the approach of the machine-human hybrid.

So we have our own speech recognition technology we’ve developed in-house. We have patterns register for our technology. We have a team of nine PhD’s working on it. We have the combination of our network and platform of freelance transcribers from all around the globe that take in the automated output of the machine and correct it in order to bring it to 100%.

So regarding what is possible, I mentioned that the technology is not there. And the reason for it, I’ll explain why. There are few parameters that affect the accuracy of the speech, and this is the reason the machine. and also. in my point of view, even in 10 years from now, we won’t be able to get to the 100% machine only.

So the parameters that affect the accuracy of the speech recognition is one, the language model. So think about if you go to legal transcription or medical transcription, there is a lot of specific jargon and specific words that are relevant for this use case. For the machine, it’s really really hard to do it, also to get the names of people, also to get specific terminology, so this affects the accuracy.

The second thing is the acoustic model. So if you do it talking in an open space or if you’re talking via phone or if you have a courtroom, et cetera, so all of this different acoustic model that also affects the accuracy of the speech.

And the third one, as you can hear my terrible Israeli accent, so usually accents affect the accuracy of the voice-to-text. So you need to tune it to train the machine for a specific accent. Then you have the fourth one: background noise. Overlapping of people, all the background noise, is really damaging the quality of the output of the machine.

And the fifth one is the pace of when you talk. You talk really, really fast or you’re talking slowly, then it also affects the accuracy.

And the last one will be the diction. If there are people, young people or children talking or elder people talking, this is also really specific diction that affects the accuracy of the speech. So if you combine…all the parameters in a different use case, it’s really, really hard, almost impossible to get all of this correctly. Unless you have specific data for this specific use case, combine all these parameters together for this specific customer, this will enable you to get 90 plus percent accuracy.

Our work in Verbit is not to replace the humans, [but] actually to help the human to do a better job and to make their life easier.


(08:30) These are challenging factors here. I’m wondering which of these is the most insurmountable.

TL: I think each one of those are very tough in their own unique way but if you ask me I think all the acoustic model and the background noise and the ability to identify different speakers, et cetera, this is very hard to talk of and to make adaptation for different acoustic environments and…controlling the quality of the audio recording.

To be able to adapt the algorithm accordingly, this is something that is very challenging and with all of the neural nets and the ability to train, still it’s having a hard time to understand sometimes when you put to the machine something with bad recording and bad acoustic…I think this is the toughest one.


(10:30) In other words, is that still where human intuition might still have a bastion of specialness, even if algorithms are trained to…take poor audio and fill in the blanks, is that still something where you think humans are gonna have the edge?

TL: I do believe so because they have the ability to hear it again and again and get the input in to understand the context of what has been said.

So I guess a courtroom…will never be satisfied with a machine only because they are required by law to have the 100% [accuracy] and this is going to take a lot of time and a leap of faith until they would be able to believe that the machine would be able to just get the perfect output for them to submit…You have Google, you mention Baidu…they are building something very generic. Something that should be suitable for everyone and…because we are taking more the vertical approach, this allows us to be much more tailor-made for any of the customers and will give us the advantage to get better results.

Because at the end of the day…what is speech recognition technology? Speech recognition is trying to identify what has been said and there are very complex statistical models that give the ranking, in showing you the best probability of the best guess for the machine what has been said. You have a lot of parameters that try to guess in the best way what has been said there. And this [is] actually because you think about verbiage as a contextual there. When you are in a generic engine, speech recognition engine, you just put the input, which is audio, and output will be text based on the same algorithm that everyone used for speech recognition.

If you think about verbiage…you need to use this contextual layer that gives you [information such as] the person that talked, and you have this accent and this is the jargon that he is talking about, legal space, in this acoustic environment. So use all these parameters in order to give better accuracy in the transformation from voice-to-text before you do it. This is something that helps us because we are not trying to be generic, we are trying to be very tailor-made.


(14:30) When you think about what we’ll be able to do five years from now that we can’t do now with transcription, where are you most hopeful that real traction will be made in terms of improvement?

TL: So the way we are thinking about it is in verbiages to be much beyond just transcription. We think that transcription just got much smarter, and what do I mean by that? Think about the use-case of…calls? When you have publicly traded companies…at the end [of the] quarter talking to the analyst about the company results.

Think about having an automated transcription for it, and then you already have the pace data and you can create actionable links and intents and you know let’s say Apple is talking about iPhone X, so you can identify in your transcription that this is what has been said and you can…click…and go directly to the website and buy the iPhone X. You can do a comparison, take all the numbers that you just automatically transcribe and create a graph and create a visualization and compare it to past results because you already have the transcription of the past results. And to get much more insights from the data.

Because we are allowing people to get more value out of their verbal assets so all this verbal communication and information that has been exchanged we want to allow our customer to get more value.


(17:30) Can you talk about the business value of transcription?

TL: Think about once you have the examination of a witness and then you can see if in his past testimonial does he contradict himself? Maybe he’s lying [so we can] try to analyze in his voice to get some realization of the text. You have many things that you can extract, so the speech and the transcription is the first layer. You can do on top of it many, many things. We think that the transcription market is very, very big. Once we would be able to increase the accuracy and we would be able to allow more people to get more value out of their verbal assets.

Back To Top