Transcribe audio files with OpenAI’s Whisper

By Jessie Hobb On Sep 27, 2022

Transcription of audio files with OpenAI’s Whisper

OpenAI recently open-sourced a neural network called Whisper. It allows you to transcribe (large) audio files like mp3 offline. OpenAI claims Whisper approaches human-level robustness and accuracy in English speech recognition.

Since there are already existing (open-source) models or packages like Vosk or NVIDIA NeMo out there, I was wondering how well Whisper can transcribe audio files.

This article shows you how to make use of Whisper and compares its performance with Vosk, another offline open-source speech recognition toolkit.

Whisper is an open-source, multilingual, general-purpose speech recognition model by OpenAI.
It needs only three lines of code to transcribe an (mp3) audio file.
A quick comparison with Vosk (another open-source toolkit) has shown that Whisper transcribes the audio of a podcast excerpt slightly better. The main difference is that Whisper offers punctuation. This makes the transcription easier to understand.
Scroll down to “Whisper” or click here (Gist) if you are interested in the code only.

Before we start with the transcription and comparison of the two models, we have to make sure that a few prerequisites are met.

Install ffmpeg

To be able to read audio files, we have to install a command line tool named ffmpeg first. Depending on your OS, you have the following options:

# Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg# MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

Install Whisper

Use the following pip install command to download Whisper.

pip install git+https://github.com/openai/whisper.git

Install Vosk (optional)

Since this article compares Whisper and Vosk, I would also show you how to install and configure Vosk. In case you are only interested in using Whisper, you can skip this part. The following command installs Vosk:

pip install vosk

Install pydub (optional)

To use Vosk, we first have to convert audio files in .wav files with one channel (mono) and a 16,000Hz sample rate. Whisper does this conversion as well, but we do not have to extra code it. The command below installs pydump:

pip install pydump

For Mac Users: Install Certificates

Whisper will load specific language models to transcribe the audio. In case you are a Mac user you might could later get the following message:

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)>

To fix this issue, please go into the folder of your python application (usually Applications> Python 3.x) and run the “Install Certificates.command” file. More information can be found here.

To compare both approaches, I would use the same audio source that I have used in my article about Vosk: A random Podcast episode.

Audio file from “Opto Sessions – Interviews with extraordinary investors” podcast. Episode #69 on spotify.

Please note: This was a random choice and I do not have any connections with the creators nor do get I paid for naming them. In case you see an error (e.g., 502 Bad Gateway) here, please reload the page.

Since the transcribed outcome of the 68 minutes podcast session would be quite long, I decided to slice out 60 seconds (otherwise, this article would contain more transcribed text than explanations). The challenge here is that the first 30 seconds are mainly part of the intro that contains quotes from different audio sources.

In case you just want to see how to transcribe audio files with Whisper, you do not have to apply the code below. You can read .mp3 files with Whisper directly.

With our prepared audio file, we can start the transcription of it by using Whisper and Vosk.

Whisper

Before we transcribe the respective audio file, we have to download a pre-trained model first. Currently, five model sizes are offered (table 1).

The authors mention on their GitHub page that for English-only applications, the .en models tend to perform better, especially for the tiny.en and base.en models, while the differences would become less significant for the small.en and medium.en models.

Whisper’s GitHub page contains more information about the performance of their models.

The code below shows how to download the language model and run the transcription:

With only 3 lines of code, we are able to transcribe an audio file. The transcription took on 65 seconds and can be seen below.

I am using an old Macbook Pro Mid 2015 (2,2 GHz Quad-Core Intel Core i7 and 16GB RAM).

Right, that's the number one technical indicator. You do best by investing for the warm return. If you can't explain what the business is doing, then that is a huge red flag. Some technological change is going to put you out of business. It really is a genuinely extraordinary situation. Hi everyone, I'm Ed Goffam and welcome to OptoSessions where we interview the top traders and investors from around the world uncovering the secrets to success. On today's show, I'm delighted to introduce Beth Kindig, a technology analyst with over a decade of experience in the private markets. She's now the co founder of I.O. Fund, which specialises in helping individuals gain a competitive advantage when investing in tech growth stocks. How does Beth do this? She's gained hands on experience over the years. Here's fast eye Organul..

The transcribed outcome is pretty good and comprehensive. As mentioned above, the intro of the podcast contains audio cuts (e.g., quotes) varying in their quality from different sources. Nevertheless, Whisper transcribes the audio pretty well and also takes care of punctuation (e.g., commas, full stops and question marks).

Vosk

Similar to Whisper, we also have to download a pre-trained model for Vosk. A list of all available models can be found here. I decided to go with one of the largest ones (1.8GB): vosk-model-en-us-0.22

The code below shows how to make use of Vosk:

Even though Whisper has a similar approach (sliding 30-second window), we have to do a bit more manual coding here.

First, we load our audio file (line 5) and our model (lines 10-11). Then we read the first 4000 frames (line 17) and hand them over to our loaded model (line 20). The model returns (in JSON format) the outcome which is stored as a dict in result_dict. We then extract the text value only and append it to our transcription list (line 24).

If there are no more frames to read (line 18), the loop stops and we catch the final results by calling the FinalResult() method (line 27). This method also flushes the whole pipeline.

After 35 seconds, the following result was created:

the price is the number one technical indicator you do best by investing for the longer term if you can't explain what the business is doing then that is a huge red flag some technological changes puts you out of business it really is a genuine be extraordinary situation hi everyone i am a gotham and welcome to opto sessions where we interview top traders and investors from around the world uncovering the secrets to success on today's show i'm delighted to introduce beth can dig a technology analyst with over a decade of experience in the private markets she's now the co-founder of iowa fund which specializes in helping individuals gain a competitive advantage when investing in tech growth stocks how does beth do this well she's gained hands-on experience over the years whilst i

To better compare it with Whiper’s output I paste it here again:

Right, that's the number one technical indicator. You do best by investing for the warm return. If you can't explain what the business is doing, then that is a huge red flag. Some technological change is going to put you out of business. It really is a genuinely extraordinary situation. Hi everyone, I'm Ed Goffam and welcome to OptoSessions where we interview the top traders and investors from around the world uncovering the secrets to success. On today's show, I'm delighted to introduce Beth Kindig, a technology analyst with over a decade of experience in the private markets. She's now the co founder of I.O. Fund, which specialises in helping individuals gain a competitive advantage when investing in tech growth stocks. How does Beth do this? She's gained hands on experience over the years. Here's fast eye Organul..

We can see that Vosk does a good job transcribing most of the words correctly. However, its biggest weak point is that it does not use punctuation. This makes it quite hard to read the transcription.