Up Next

Verbit.ai raises $23 million to automate transcription and captioning


There’s a lot of value inherent in services that can capture and automatically transcribe phone calls, keynotes, and recorded content — if nothing else, they save an enormous amount of manual labor. According to a recent report published by Grand View Research, the speech and voice recognition market is estimated to reach $31.82 billion by 2025, driven by new and “rising” applications in banking, health care, and automotive sectors.

Tom Livne, who cofounded Verbit.ai with Eric Shellef and Kobi Ben Tzvi in 2017, has high hopes that the Tel Aviv and New York-based startup will contribute substantially to the industry’s growth in the years ahead. Verbit’s adaptive speech recognition tech, which it claims can generate “detailed” transcriptions with over 99 percent accuracy at “record” speed, recently attracted the attention of VCs at the likes of Vertex Ventures and Oryzn Capital, which both participated in the startup’s Series A round.

Verbit today announced that it has raised $23 million in a round led by Voila Partners, with the aforementioned investors and HV Ventures, Vintage Venture Partners, and ClalTech chipping in. This comes less than a year after the firm’s $11 million seed round and follows a fiscal year in which total revenue grew by 300 percent. The new funding brings Verbit’s total capital raised to $34 million.

As part of the round, Viola Ventures’ Ronen Nir will join the board of directors, and Livne said the capital will be used to jumpstart global growth of Verbit’s sales, marketing, and product teams, with a particular emphasis on stateside expansion.

“I am lucky to work with such a talented team that is devoted to customer experience, company growth, and product innovation,” he said. “It’s been only eight months since our last round of funding, and this latest infusion of capital is a testament to the strong demand for an AI solution in such a manual and traditional space.”

Voice transcription and captioning isn’t exactly novel — it’s a decades-old industry with well-established players, like Nuance and Google. Enterprise platforms like Microsoft 365 offer AI-powered speech-to-text, along with Cisco and startups such as Otter and Voicera.

But, according to Livne, what sets Verbit apart is its reliance on “cutting-edge” advances in deep learning, neural networks, and natural language understanding.

Three models — an acoustic model, linguistic model, and contextual events model — inform Verbit’s captioning, first by filtering out background noise and echo and identifying speakers and next by detecting domain-specific terms, recognizing accents and dialects, and incorporating current events and updates. In practice, clients upload an audio or video file to the cloud for processing, which a team of thousands of human freelancers in over 20 countries subsequently edits and reviews, taking into account any customer-supplied notes and guidelines before making the finished transcription available for export to platforms like Blackboard, Vimeo, YouTube, Canvas, and BrightCode.

Verbit’s cloud dashboard shows progress throughout each job and lets users edit and share files or add inline comments, request reviews, and update files every step of the way, as needed. A forthcoming feature — Verbit Express — will allow clients to drag files in need of transcription to a folder on a desktop PC, where they’ll be automatically uploaded and processed.

Livne claims the platform can reduce operating costs by up to 50 percent and deliver results 10 times faster than the competition. In any case, it was enough to woo a healthy client base of educational institutions and commercial customers, including the London Business School, Fashion Institute of Technology, Utah State University, University of Utah, University of Southern Utah, University of Vermont, Auburn University, Western Governor University, University of California Santa Barbara, Oakland University, Stanford, Coursera, Panopto, Kaltura, and close to 100 others (up from 50 in May 2018).

Customer have to make a minimum commitment of $10,000 worth of work, a pricing structure that has apparently paid dividends. Verbit.ai isn’t disclosing exact revenue but says it’s in the “millions” and that the company is cash flow positive.

“We have been closely following Verbit for the past two years. The disruption it brings to the market, both in its technological superiority, as well as market traction, are really exceptional,” Nir said. “We are excited to partner with the Verbit team to accelerate this journey.”

Verbit currently employs a team of over 30 across its Tel Aviv and New York offices, and it hopes to bump that number to around 60 this year.

Up Next

AI for Speech Recognition and Transcription in Law and Legal

Have you ever been frustrated with how Alexa or Siri don’t always understand your verbal requests? If so, then you already understand the problem that our guest this struggles with. He’s Tom Livne, co-founder and CEO of Verbit.ai.

Verbit is a company that focuses on AI transcription services, specifically for the law and legal space. They use a combination of machine learning and human experts to transcribe audio in different accents, in different noise environments, with different diction, to give people more accurate results and hopefully help the process scale.

In this episode, Livine explains five different factors that go into getting transcription right and getting AI to be able to aid in the process. In addition, Tom talks about some of the critical factors for where transcription will come into play in terms of bringing value into business.


Guest: Tom Livne, co-founder and CEO – Verbit.ai

Expertise: Entrepreneurship/tech startup life cycle

Brief Recognition: Livne holds an MBA in Business Administration from Yale


Interview Highlights


(03:00) Give us an understanding of what’s possible with transcription today?

TL: Think about this podcast. We are recording this episode and let’s assume we want to get a professional transcript. When I’m referring to a professional transcript, I mean 100% accuracy. And the way it’s been done today, it’s fully manual, right? People are listening and typing it from scratch and it creates a limited capacity of scale and low gross margin.

On the other hand, speech recognition technology can reach only 70 to 80%. If we’re going to court and give the automated transcript only, this is not good enough. So the way we solve it at Verbit is [with] the approach of the machine-human hybrid.

So we have our own speech recognition technology we’ve developed in-house. We have patterns register for our technology. We have a team of nine PhD’s working on it. We have the combination of our network and platform of freelance transcribers from all around the globe that take in the automated output of the machine and correct it in order to bring it to 100%.

So regarding what is possible, I mentioned that the technology is not there. And the reason for it, I’ll explain why. There are few parameters that affect the accuracy of the speech, and this is the reason the machine. and also. in my point of view, even in 10 years from now, we won’t be able to get to the 100% machine only.

So the parameters that affect the accuracy of the speech recognition is one, the language model. So think about if you go to legal transcription or medical transcription, there is a lot of specific jargon and specific words that are relevant for this use case. For the machine, it’s really really hard to do it, also to get the names of people, also to get specific terminology, so this affects the accuracy.

The second thing is the acoustic model. So if you do it talking in an open space or if you’re talking via phone or if you have a courtroom, et cetera, so all of this different acoustic model that also affects the accuracy of the speech.

And the third one, as you can hear my terrible Israeli accent, so usually accents affect the accuracy of the voice-to-text. So you need to tune it to train the machine for a specific accent. Then you have the fourth one: background noise. Overlapping of people, all the background noise, is really damaging the quality of the output of the machine.

And the fifth one is the pace of when you talk. You talk really, really fast or you’re talking slowly, then it also affects the accuracy.

And the last one will be the diction. If there are people, young people or children talking or elder people talking, this is also really specific diction that affects the accuracy of the speech. So if you combine…all the parameters in a different use case, it’s really, really hard, almost impossible to get all of this correctly. Unless you have specific data for this specific use case, combine all these parameters together for this specific customer, this will enable you to get 90 plus percent accuracy.

Our work in Verbit is not to replace the humans, [but] actually to help the human to do a better job and to make their life easier.


(08:30) These are challenging factors here. I’m wondering which of these is the most insurmountable.

TL: I think each one of those are very tough in their own unique way but if you ask me I think all the acoustic model and the background noise and the ability to identify different speakers, et cetera, this is very hard to talk of and to make adaptation for different acoustic environments and…controlling the quality of the audio recording.

To be able to adapt the algorithm accordingly, this is something that is very challenging and with all of the neural nets and the ability to train, still it’s having a hard time to understand sometimes when you put to the machine something with bad recording and bad acoustic…I think this is the toughest one.


(10:30) In other words, is that still where human intuition might still have a bastion of specialness, even if algorithms are trained to…take poor audio and fill in the blanks, is that still something where you think humans are gonna have the edge?

TL: I do believe so because they have the ability to hear it again and again and get the input in to understand the context of what has been said.

So I guess a courtroom…will never be satisfied with a machine only because they are required by law to have the 100% [accuracy] and this is going to take a lot of time and a leap of faith until they would be able to believe that the machine would be able to just get the perfect output for them to submit…You have Google, you mention Baidu…they are building something very generic. Something that should be suitable for everyone and…because we are taking more the vertical approach, this allows us to be much more tailor-made for any of the customers and will give us the advantage to get better results.

Because at the end of the day…what is speech recognition technology? Speech recognition is trying to identify what has been said and there are very complex statistical models that give the ranking, in showing you the best probability of the best guess for the machine what has been said. You have a lot of parameters that try to guess in the best way what has been said there. And this [is] actually because you think about verbiage as a contextual there. When you are in a generic engine, speech recognition engine, you just put the input, which is audio, and output will be text based on the same algorithm that everyone used for speech recognition.

If you think about verbiage…you need to use this contextual layer that gives you [information such as] the person that talked, and you have this accent and this is the jargon that he is talking about, legal space, in this acoustic environment. So use all these parameters in order to give better accuracy in the transformation from voice-to-text before you do it. This is something that helps us because we are not trying to be generic, we are trying to be very tailor-made.


(14:30) When you think about what we’ll be able to do five years from now that we can’t do now with transcription, where are you most hopeful that real traction will be made in terms of improvement?

TL: So the way we are thinking about it is in verbiages to be much beyond just transcription. We think that transcription just got much smarter, and what do I mean by that? Think about the use-case of…calls? When you have publicly traded companies…at the end [of the] quarter talking to the analyst about the company results.

Think about having an automated transcription for it, and then you already have the pace data and you can create actionable links and intents and you know let’s say Apple is talking about iPhone X, so you can identify in your transcription that this is what has been said and you can…click…and go directly to the website and buy the iPhone X. You can do a comparison, take all the numbers that you just automatically transcribe and create a graph and create a visualization and compare it to past results because you already have the transcription of the past results. And to get much more insights from the data.

Because we are allowing people to get more value out of their verbal assets so all this verbal communication and information that has been exchanged we want to allow our customer to get more value.


(17:30) Can you talk about the business value of transcription?

TL: Think about once you have the examination of a witness and then you can see if in his past testimonial does he contradict himself? Maybe he’s lying [so we can] try to analyze in his voice to get some realization of the text. You have many things that you can extract, so the speech and the transcription is the first layer. You can do on top of it many, many things. We think that the transcription market is very, very big. Once we would be able to increase the accuracy and we would be able to allow more people to get more value out of their verbal assets.

Back To Top