Our Research Expert’s Take: How Speech Recognition Works in Multilingual Environments

By: Verbit Editorial
Hand holding a computer with a chat bubble with multiple languages listed around it

This blog is guest-authored by Irit Opher, VP of Research at Verbit. She has a PhD in physics from Tel Aviv University. Irit leads Verbit’s research team, where she works on leading algorithm development and exploration across diverse domains to enhance modern transcription systems. 

Irit Opher headshot

Speech recognition technology has become ingrained into our daily lives and routines. Whether you’re dictating a text message while driving, telling Alexa to play your favorite song or using your voice to tell your smart TV what show to play, voice recognition technology is all around you.  
Its usage in the workplace is growing substantially as well. From using it to dictate thoughts rather than taking notes or applying it to create word-for-word records or transcripts of important meetings, the use cases are seemingly endless. Speech recognition technology continues to improve with each use, opening the door for even more opportunities to consider it.  
However, automatic speech recognition (ASR) software still struggles to work well in certain environments or use cases. One of the most common instances is when trying to capture what speakers are saying when speaking and alternating between different languages. This ‘multilingual environment’ may present a hurdle for even today’s best ASR tools. 

Switching from one language to another, called “code switching,” is common in homes and workplaces alike. Yet, for speech recognition technology, this phenomenon can make it challenging to accurately capture what the speakers are saying with all the back and forth.  
As the research manager at Verbit, I’ve been exploring this technology’s ability to adapt to multilingual environments and how it’s improving. Here are my thoughts on the current state of this technology and insights for both everyday users of voice recognition in multilingual households and leaders in global businesses. 

A person using voice recognition technology on their phone

First, an explanation of ‘code switching’ 

Code switching occurs when people switch between languages throughout a conversation. Oftentimes, this happens in homes where immigrant parents speak one language while their children learn another. Communication in those households can involve fluid transitions in and out of the two languages, perhaps even within a single sentence. Imagine parents who move from Mexico to the US. The parents speak Spanish, but the children grow up speaking English. You might hear a conversation that fluidly switches between the two languages. In fact, there are several types of code switching that might occur in these multilingual environments. Here are a few categories of code switching

  • Inter-sentential code switching happens when the speaker switches between languages but keeps each sentence or clause in one language. For instance, they would speak a sentence in Spanish, followed by one in English. 
  • Intra-sentential code switching involves language switches within a single sentence or utterance. 
  • Extra-sentential code switching involves the addition of a word from one language into a sentence in another language.  

While multilingual families might be able to switch back and forth between two languages, automatic speech recognition (ASR) tools might struggle to adjust instantly to these transitions, especially if they are very frequent and occur within an utterance or sentence, and thus interfere with the ‘context flow.’ This can result in misrecognizing or missing out on some words.  

Code switching in professional settings 

In professional environments, code switching is increasingly common. Today’s international corporations involve teams from all over the world. Those teams may communicate with each other in different languages. Perhaps one team regularly speaks their native language with one another in the office but switches to English during international conference calls. These teams often jump back and forth between the two languages based on the context. The teams also may embed professional terminology in one language within sentences in another language.  

Code switching is common in the hi-tech industry, and Verbit is no different. For example, in our headquarters in Israel, it’s common for teams to switch from Hebrew to English when discussing more professional and technical subjects.  
In the Verbit setting, team members might be having a conversation in Hebrew, but rather than using or creating a new word in Hebrew for a type of technology being discussed, we might switch and borrow the technical term from English. This process can get even more challenging for an ASR trying to transcribe our conversations when sometimes we merge terms from two languages. This happens in Israel when we take an English verb or noun and use a Hebrew conjugation. The result isn’t a word in either language, but we understand what it means, and it helps us communicate.  

A group of people in a meeting with more participants tuning in remotely

Other challenges for speech recognition in multilingual environments  

Different languages aren’t the only hurdle speech recognition tools must overcome. In multilingual environments, people often also have different accents. As a result, even when people are speaking the same language, the words might sound completely different.  
Humans will vary in their ability to understand different accents. Individuals who are exposed to these accents more often are likely better at interpreting the speech correctly. When extreme accents are present, ASR typically may not be as good at using context to help it understand the heavily accented speech, making it less successful than people at accurately interpreting the words.  

Slang and informal language also get added into the mix at times. Like humans, AI will take time to master new terms. In a multilingual environment, this could be even more complicated as the ASR must determine whether a specific acoustic signal represents a term in a different language, a new slang term it hadn’t previously encountered or both. 

Using speech recognition in specific contexts, such as medical or legal transcripts with specialized terms can make for another potential complication when code switching occurs. For example, ASR might properly interpret terms like “headache” or “sore throat.” However, when faced with the medical terminology for those conditions, “cephalalgia” and “pharyngitis,” the results might be less accurate. Add this complication to the issue of multilingual environments, heavy accents and countless industries, and the challenges facing ASR become clear. 

Fortunately, there are ways to help improve speech recognition in these environments, starting with more and better data.  

A man speaking into his phone using voice recognition technology

How we’re building better data sets to improve Verbit’s ASR 

Data is critical for improving ASR’s ability to interpret speech in multilingual environments. At Verbit, we’re finding that expanding language model datasets can help to improve the technology’s ability to perform in these settings. Unfortunately, it’s not always possible to gather as much data as is needed. One way we’re tackling this task is through generative AI. Using generative AI, it’s possible to augment the data you do have that will feed the model. This process can improve the ASR’s ability to accurately interpret speech in different languages, diverse accents and highly technical fields with more data at its disposal to work with and learn from.  

The reason this can help is that general data sets may lack input from some languages, accents and highly specialized fields. For example, far more people speak about having headaches than complain about “cephalalgia.” Additionally, there will be more data from single language settings than from ones where code switching is common. As a result, the model may work well in most settings, but fail when speakers are using uncommon terminology or more than one language. 

Generative AI offers advantages over other, earlier forms of AI for this purpose because these tools can intuitively learn. A generative AI model shares similarities with the way that a child learns a language. For example, a child learning English will figure out that most plural words have an “s” at the end. If that child knows that the plural of “tree” is “trees” and “chair” is “chairs,” he can guess that the plural of “rock” is “rocks.” Generative AI can learn similarly, meaning that it can help us create larger data sets to train ASR in a wider range of environments, including those that are multilingual.  

It’s worth noting that while the data sets from generative AI are useful, they still have limitations. Generative AI can’t create something very different from the data it encountered. At some point, having more complete data matters, even if we use generative AI to get the very most out of that data.  

In my time at Verbit, navigating multilingual speech recognition continues to be a fascinating journey. While code switching poses challenges for ASR tools, we’re finding solutions like generative AI that can enrich our data diversity and improve our accuracy. As we do make these changes, Verbit will be even better positioned to provide support for diverse, multilingual communication in workplaces and beyond. 

Our committed researchers are constantly working so that Verbit’s solutions stay at the forefront of speech recognition technology. These ongoing efforts are why professionals in demanding fields like legal, corporate, media and education rely on Verbit for their captioning and transcription. Reach out to learn more about Verbit’s innovative solutions.