Captioning in the Age of AI: Highlights from Slator’s Recent Report 

By: Sarah Roberts
woman in a virtual conference call

Research and analysis company Slator published an interesting report which spotlights use cases where ASR Automatic speech recognition (ASR) technology might be adequate to use on its own, versus the need for human involvement. As more professionals invest in captioning their content for a multitude of use cases, this report may prove useful as they look for ways to do faster and cheaper.

However, while Slator provides some insights for when and how to use ASR technology, Verbit’s Director of Product Management, Adi Margolin, cautions that it’s not so simple. For every “rule,” there’s a caveat. Instead, you must look at the individual use case, the goals and all surrounding circumstances to decide on the right approach, Margolin said. Since, at their core, captions are an accessibility solution to assist individuals with disabilities, being too quick to adopt fully automated processes can lead to inequitable experiences, and sometimes even dangerous ones.

Here’s a helpful overview of the usefulness of ASR in different environments as highlighted by Slator, coupled with some of our experts’ recommendations.  

A woman speaking into a microphone

ASR for short-form content and meetings 

The report on the state of captioning and subtitling in 2023 made it clear that human captioners are still necessary in many settings. If the project is a mainstream documentary, a major sporting event or a television series that will stream on Netflix, humans should either be captioning or correcting the ASR output. However, the report identified two use cases for ASR-only captioning: short-form content and remote meetings. In both cases, it’s worth examining whether ASR captions would be enough. 

Short-form content 

Short videos, defined as 60 seconds or less, tend to be aimed at social media platforms like LinkedIn. ASR may work in this setting because these videos often deal with one subject and have a limited number of speakers using one language. The simplicity of such videos, and the fact that they’re recorded in controlled settings, means that the audio is less likely to present challenges for the ASR. Also, short-form content may only appear for a limited time, making it less worthwhile to invest in more costly captioning solutions.  

Remote meetings with few participants 

In the case of routine virtual meetings, ASR might be enough to support teams and help them stay focused and catch up if they missed something. If there are few participants, the audio may be less complicated and easier for the technology to interpret.  

However, experts caution against choosing to rely on ASR just because a use case fits into these categories. There are many factors at play that can influence ASR’s effectiveness at producing accurate captions.  

A man and a woman in a virtual meeting

The nuanced reality of ASR 

ASR is undoubtedly useful for streamlining the captioning process. Still, deciding whether ASR, a combination of ASR and humans or human captioners is the right choice for a project is complicated.  

“Having those rules of thumb is not accurate,” said Margolin. “There are no set rules here. You should think about what is affecting the decision between ASR and humans. It’s a matter of price, accuracy and adapting to what you care about most. For example, do you care about formatting or not?” 

Margolin pointed out that even in a small one-on-one meeting where ASR can perform well, using it might not be enough. If one of the parties needs captions because they are Deaf or hard of hearing, this could be an accessibility issue. Also, she said the subject can influence the decision, as sometimes the stakes are just too high.  

“Let’s say this is a medical discussion between a doctor and a patient who is hard of hearing,” she said. “You don’t want to use ASR. If I get the name of the prescription wrong, if I prescribe you the wrong dose… if I saw 3500 milligrams and you heard grams, that’s huge, it’s going to kill you.” 

ASR might be enough in a similar meeting but covering a less important topic. In such cases, if you carefully consider ASR’s capabilities and eliminate conditions that might impede its accuracy, it could be a cost-effective, adequate solution. 

Realistic outlooks on ASR and accuracy 

Margolin pointed out that people don’t often understand the myriad factors that contribute to ASR captioning results. Consequently, when someone boasts about the high accuracy level of their ASR tools, it’s critical to investigate the circumstances under which they tested the technology. For instance, distinctions like whether it’s working in a live setting or on a recording can have extreme impacts on the output. 

“If your ASR is working live or on a file, the benchmark for accuracy is different,” said Margolin. 

Many factors that influence ASR quality in a live setting can be corrected in a recording. Even within those broad categories, other circumstances can impact the accuracy of ASR.  

“What is your use case?” said Margolin. “Is it a football match? Does it include songs? Are the speakers professional? Is the audio quality good or sketchy?”  

Each of these things can impact the quality. A football match is full of crowds yelling in the background, which can affect the ASR’s accuracy. It’s also less effective at transcribing song lyrics. If speakers use clear, audible voices, the audio will be better than if they mumble or have strong accents. Poor-quality audio feeds will have more problems than high-quality recordings. It’s never reasonable to assume that even the best ASR can offer quality results in all settings. Therefore, it’s necessary to have a human captioning solution or one that combines ASR and human captioners. 

People cheering in a huge stadium

Human involvement remains imperative  

Margolin brought up a specific example where users found that ASR couldn’t make the cut. The technology was interpreting an earnings call, and the speakers were using numbers to discuss shares, dollars and percentages. The ASR couldn’t keep up with those different designations. However, she pointed out that in a post-production setting, these errors are quick fixes.  

Often, having a human editor is the best way to resolve these issues. A human captioner or transcriptionist is best suited to understand what the speakers are referencing when they say a number. 

Still, even here, there is a qualifier — for now.  

“Technology is also advancing, so things that were true yesterday will not be true tomorrow,” said Margolin. 

For instance, large language models, which are the technology behind generative AI tools like ChatGPT, are getting better at connecting the dots and picking up on context. Such advancements might allow the technology to tell the difference between when someone’s speaking about 15% versus $15. As a result, the process could become even more automated.  

How to select the right captioning strategy 

Incorporating ASR into captioning has clear advantages like speed and cost-effectiveness. Yet, it’s vital to recognize that ASR’s effectiveness varies widely. Factors like audience, audio quality, live vs. recorded content and subject matter will influence the outcome. While ASR can excel in some scenarios, human intervention remains essential, especially for accuracy and contextual understanding. As ASR technology evolves, the balance between automation and human expertise will continue to change. 

When it comes to finding the right solution for your needs, there’s no substitute for expertise. Verbit works closely with its partners to provide the right level of captioning support for every project. Reach out to connect with a captioning expert who can help you find the best option for your needs.