Expert Advice on Capturing Audio for More Accurate Transcription

This post was contributed by Dr. Elisha Rosensweig, Head of Data Science at Verbit. Elisha holds a Ph.D. in Computer Science from the University of Massachusetts Amherst and a M.Sc. in Computer Science from Tel Aviv University. Prior to his four years at Verbit, he served as Head of Israel Engineering at Chorus.ai and as an R&D Director at Nokia, formerly Alcatel-Lucent. He regularly speaks on panels and is an expert in data and its implications on AI, transcription, captioning and more.

Creating transcripts from audio files is becoming a necessary task across many industries. Transcripts preserve and transmit information from legal proceedings and important corporate meetings. They also serve as an accessibility tool, helping to make events and experiences more equitable. In these professional settings, the availability of a highly-accurate transcript is often crucial for the success of the event or process.

While traditionally, transcription was done solely by trained human transcribers, in recent years, we have seen the emergence of powerful tech tools that perform Automatic Speech Recognition (ASR). Today, one can select a standard human transcription, a pure machine transcription (e.g., Siri), or hybrid solutions that employ the skills of both machines and humans. At Verbit, we provide a variety of hybrid solutions tailored to different use cases, where we strive to combine the different strengths of machines and human transcribers in an optimal way.

Still, whatever method of transcription is used to produce the desired text, one major factor that will impact the quality of the transcript is the quality of the original audio we are trying to transcribe. The clearer and cleaner the audio, the more reliable and coherent the transcript will be. With that in mind, here is some advice we recommend to all our customers when they ask how they can improve the quality of their audio files to make the transcription process faster and the transcript better. By following these guidelines, we hope you can obtain more accurate transcripts while avoiding multiple rounds of time-consuming edits.

Preventing factors that lead to poor-quality audio files

There are several issues that can negatively impact the quality of an audio file. In what follows, we will review several common issues that degrade audio quality. First though, let’s start with a good rule of thumb you can use when in doubt:

If you find it hard or annoying to transcribe it, the computer will likely have difficulty as well.

Now, to the details!

Environmental considerations

Ever tried to talk to someone at a music festival with the music blaring all around and other people talking over you? When capturing audio from a crowded arena at a sporting event, there is often a great deal of background noise that can degrade the intelligibility of the audio. Cheering fans, loud music, announcers and the ongoing conversations of hundreds or thousands of people make for a lot of noise. Since these are human voices, they can make it extra hard for both humans and machines to know which voice to transcribe. However, even less obvious sounds can impact audio quality. For example, the ambient noise of an AC unit or a fan might impact the quality of the recording. These environmental settings can make it difficult for any transcription system to capture and transcribe the relevant discussion.

Another type of disturbance that stems from the environment is reverberations or echoes. For example, if a person speaks in an auditorium or lecture hall, the sound tends to bounce off many surfaces before reaching the microphone, creating a poor recording with a strong echo sound. These problems often occur in courtrooms and classrooms.

One experiment we conducted here at Verbit highlights the impact of noisy backgrounds on transcription times. We found that audio with medium levels of background noise can take a transcriber 30% longer to complete compared to a recording without such a noisy background. With additionally increased noise, the time to transcribe can even double.

speakers capturing their audio in a microphone while addressing a crowd in a lecture hall

Bad recording equipment and settings

Another issue that can greatly affect the quality of a recording is the type and setup of the recording equipment. Equipment problems may come in the form of low-quality microphones, the wrong type of microphone or the way that the recording party shares the file. A low-resolution audio file could result in a recording that sounds scratchy or where the speakers sound obscure, so a good rule of thumb is to keep the recording sampling frequency at 16 kHz or above. While transcribers might still be able to overcome the degraded quality, ASR will struggle with these files, making the transcription process become less efficient and more labor-intensive.

Even if the equipment captures good-quality recordings, compressing the file (for example, to make it easier to send) might cause the quality of the ASR to degrade. The reason for this is that compression tends to be lossy – that is, it does not retain all the information embedded in the original audio file. Thus, it’s always best to provide the original audio for transcription when possible.

Human recording errors

Even the best-quality microphones won’t produce high-quality recordings if they are not positioned correctly to best capture the speaker(s). Placing the mic too far from the speaker or too close to the speaker’s mouth can cause the audio to become echoey, muffled and distorted.

One interesting example – and one of the most challenging scenarios – is that of people speaking over one another. Even with the equipment in the right place, if multiple people are speaking at the same time, it can be challenging for both humans and machines to transcribe the audio. The reasons for this are manifold, both on the acoustic and algorithmic level as well as at a simple guideline level of figuring out which of the speakers to transcribe and which to ignore.

Despite the challenges low-quality recordings pose, it is still possible to produce a high-quality, accurate transcript for them. However, this will usually require extra time and effort, which will naturally translate into higher costs. In cases with extremely terrible recordings, it’s also likely that the final result will include some errors and inaudible sections, where even the best ASR and the most experienced transcribers can’t tell what the speakers are saying.

Best practices for improving audio quality

Now that we’ve listed the challenges bad audio poses, let’s move on to the good news: there are plenty of ways to capture better audio recordings. Let’s review some tips and tricks to achieve the best audio quality.

Take time to get the set up right

The right set-up will depend on the type of event, the location, the number of speakers and where they’ll be presenting from. Since each space will have unique challenges, it’s a good idea to run a test recording and review the recording before the event and make changes if the quality is poor.

If the venue is a large room, give a microphone to each speaker. However, don’t have each microphone on at the same time if the speakers are near one another. Instead, switch them on or off when each person is presenting. Although this arrangement is often the best, it isn’t always feasible. Another option is to have one mic and have the speakers take turns speaking into it when they have the stage.

a light says "recording" to indicate that an audio recording is in progress

Educate the speakers

Instruct your speakers to keep their microphones between 6 and 15 inches from their mouths. At this range the ASR output can be 100% accurate, whereas the same speaker speaking ten feet from the mic can result in almost 20% error rate, meaning about one in every five words would be incorrect.

The second major tip in this category is to better control the dynamic of the conversation. Specifically, speakers should avoid talking over one another, and if audience members can ask questions, the speaker should repeat those questions to ensure that the audio file captures the questions as well as the answers.

How to select the right microphone

Choosing the best microphone isn’t just about quality. There are three main categories of microphones, omni, bi-directional and unidirectional. These three categories each serve a different purpose.

Omni microphones capture sound from all directions. If there are multiple people speaking, it will pick up their voices whether they are in front of, next to or across from the microphone and main speaker.
Bidirectional microphones capture in two opposite directions. For example, if two people speak on a podcast or interview, a bidirectional microphone can record both equally.
Unidirectional microphones will only capture audio from one direction. This type of microphone can help eliminate background noises and may be a good option for a setting like a lecture. However, it’s critical that the unidirectional microphone is in the correct place to capture the intended person’s speech.

Regardless of the type of microphone, test it before the recording. Also, Verbit is happy to offer more information or recommendations for microphones, so don’t hesitate to reach out.

Don’t mess with the audio files

Processing the audio files for the recording can cause more problems than it solves. Professional transcription companies have their own way of optimizing audio files for their ASR and transcription processes.

Even a background noise-canceling AI microphone can cause issues with the recording. Fancy processing tools and software can hinder the performance of ASR as they change the original audio signal, which the ASR is optimized for.

Another thing to avoid at all costs is playing a recording into a microphone. For instance, don’t use a computer to record audio from a pre-recorded file on a smartphone. While this sometimes happens in courtrooms, it can create terrible quality audio files that are extremely difficult to transcribe.

Regardless of how carefully a recording is made, it’s still critical for the user to listen to that file. It’s worth it to know what the actual recording sounds like to avoid sending in something that is unintelligible.

a microphone in front of a crowd of listeners

How Verbit copes with bad audio

Sometimes, it’s necessary to transcribe audio files that aren’t the best quality. Fortunately, Verbit works to make it possible to create transcriptions even if the audio isn’t great. First, we train our ASR using examples from lots of acoustic conditions and noisy backgrounds to make it more robust and able to handle challenging audio files.

Verbit also uses a dual approach. The ASR creates a first draft, but human transcribers then review the file to make edits and corrections. The transcribers can often fix most of the ASR errors.

Additionally, Verbit gives its transcribers the ability to reduce background noise and adjust the sound quality to hear the conversation or lecture clearly. This, in turn, ensures more precise transcription and faster work by the transcribers.

Customers can also send in their files to help us train the AI to accommodate their needs. With these sample files, Verbit can offer faster, more accurate transcripts. Even in the case of live events, Verbit can work with customers to identify their issues and help them fix their audio output to ensure the best live transcription or captioning results.

Partnering with Verbit means having a team of experts ready to help you through each step of the process. To learn more about our transcription solutions and how to capture the best possible audio quality, reach out to Verbit today.