How to Perform Speech-to-Text and Translate Any Speech to English With OpenAI’s Whisper | by Zoumana Keita | Dec, 2022

By Jessie Hobb On Dec 14, 2022

How to use cutting-edge NLP models for audio transcription to text and machine translation.

OpenAI is a pure player in the field of Artificial Intelligence and has made accessible to the community many AI models including GPT, CLIP, etc.

Open-sourced by OpenAI, the Whisper models are considered to have approached human-level robustness and accuracy in English speech recognition.

This article will try to walk you through all the steps to transform long pieces of audio into textual information with OpenAI’s Whisper using the HugginFaces Transformers frameworks.

At the end of this article, you will be able to translate English and non-English audio into text.

Whisper models have been developed to study the capability of speech-processing systems for speech recognition and translation tasks. They have the capability of transcribing speech audio into text.

Trained on 680,000 hours of labeled audio data, which is reported by the authors to be one of the largest ever created in supervised speech recognition. Also, the model’s performance has been evaluated by training a series of medium-sized models on subsampled versions of the data corresponding to 0.5%, 1%, 2%, 4%, and 8% of the full dataset size as shown below.

5 different subsampled versions of the original training data (Image by Author)

This section covers all the steps from installing and importing the relevant modules to implementing the audio transcription and translation cases.

Installation and initializations

To begin, you need to have Python installed on your computer along with the Whisper library, and the latest stable version can be installed using the Python package manager pip as follows:

!pip install git+https://github.com/openai/whisper.git

Now, we need to install and import theffmpeg module which is used for audio and video processing. The installation process may differ depending on your operating system.

Since I am using a MAC, here is the corresponding process:

# Installation for MAC
brew install ffmpeg

Please refer to the correct code snippet for your case

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

What if you don’t want to bother with all of these configs?

→ Google collab could save your life in such a situation, and it also provides a free GPU that you can access as follows:

Runtime configuration to use GPU on Google Colab (Image by Author)

Using the nvidia-smi we can have the information about the GPU allocated to you, and here is mine.

!nvidia-smi

GPU information on my Google Colab (Image by Author)

Once you have everything installed, you can import the modules and load the model. In our case, we will be using the large model which has 1550M parameters and requires ~10Gigabyte VRAM memory. The processing can be longer or faster whether you are using a CPU or a GPU.

# Import the libraries 
import whisper
import torch# Initialize the device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the model 
whisper_model = whisper.load_model("large", device=device)

In the load_model() function, we use the device initiated in the line before. By default, the newly created tensors are created on the CPU if not specified otherwise.

Now is the time to start extracting audio files…

Audio Transcription

This section illustrates the strengths of Whisper for transcribing audio in different languages.

The general workflow in this section is the following.

Speech to text workflow of the article (Image by Author)

The first two steps are performed with the helper function below. But before that, we need to install the pytube library using the followingpip statement to download the audio from YouTube.

# Install the module
!pip install pytube# Import the module
from pytube import YouTube

Then, we can implement the helper function as follows:

def video_to_audio(video_URL, destination, final_filename):# Get the video
video = YouTube(video_URL)
# Convert video to Audio
audio = video.streams.filter(only_audio=True).first()
# Save to destination
output = audio.download(output_path = destination)
_, ext = os.path.splitext(output)
new_file = final_filename + '.mp3'
# Change the name of the file
os.rename(output, new_file)

The function takes three parameters:

video_URL the full URL of the YouTube video.
destination the location where to save the final audio.
final_filename the name to give to the final audio.

Finally, we can use the function to download the video and convert it into audio.

English transcription

The video used here is a 30 seconds motivational speech on YouTube from Motivation Quickie. Only the first 17 seconds correspond to the true speech and the rest of the speech is noise.

# Video to Audio
video_URL = 'https://www.youtube.com/watch?v=E9lAeMz1DaM'
destination = "."
final_filename = "motivational_speech"
video_to_audio(video_URL, destination, final_filename)# Audio to text
audio_file = "motivational_speech.mp3"
result = whisper_model.transcribe(audio_file)
# Print the final result
print(result["text"])

videoURL is the link to the motivational speech.
destination is my current folder corresponding to `. `
motivational_speech will be the final name of the audio.
whisper_model.transcribe(audio_file) applies the model on the audio file to generate the transcription.
The transcribe()function preprocess the audio with a sliding 30-second window, and perform an autoregressive sequence-to-sequence approach to make predictions on each window.
Finally, the print() statement generates the following result.

I don't know what that dream is that you have. 
I don't care how disappointing it might have been as you've 
been working toward that dream. 
But that dream that you're holding in your mind that it's possible.

Below is the corresponding video you can play to check the previous output.

Non-English transcription

In addition to English, Whisper can also deal with non-English languages. Let’s have a look at Alassane Dramane Ouattara’s interview on YouTube.

Similarly to the previous approach, we get the video, translate it to audio and get the content.

URL = "https://www.youtube.com/watch?v=D8ztTzHHqiE"
destination = "."
final_filename = "discours_ADO"
video_to_audio(URL, destination, final_filename)# Run the test
audio_file = "discours_ADO.mp3"
result_ADO = whisper_model.transcribe(audio_file)
# Show the result
print(result_ADO["text"])

→ Video discussion:

President Alassane’s discussion about Franc CFA on YouTube

→ Model result from the print() statement.

Below is the final result, and the result is mindblowing 🤯. The only information being misspelled is “Franc CFA” and the model recognized it as “Front CFA” 😀.

Le Front CFA, vous l'avez toujours défendu, bec et ongle, est-ce que vous 
continuez à le faire ou est-ce que vous pensez qu'il faut peut-être changer 
les choses sans rentrer trop dans les tailles techniques? Monsieur Perelman, 
je vous dirais tout simplement qu'il y a vraiment du n'importe quoi dans ce 
débat. Moi, je ne veux pas manquer de modestie, mais j'ai été directeur des 
études de la Banque Centrale, j'ai été vice-gouverneur, j'ai été gouverneur 
de la Banque Centrale, donc je peux vous dire que je sais de quoi je parle. 
Le Front CFA, c'est notre monnaie, c'est la monnaie des pays membres et nous 
l'avons acceptée et nous l'avons développée, nous l'avons modifiée. J'étais 
là quand la reforme a eu lieu dans les années 1973-1974, alors tout ce débat 
est un nonsense. Maintenant, c'est notre monnaie. J'ai quand même eu à 
superviser la gestion monétaire et financière de plus de 120 pays dans le 
monde quand j'étais au Fonds Monétaire International. Mais je suis bien placé 
pour dire que si cette monnaie nous pose problème, écoutez, avec les autres 
chefs d'État, nous prendrons les décisions, mais cette monnaie est solide, 
elle est appropriée. Les taux de croissance sont parmi les plus élevés sur le 
continent africain et même dans le monde. Le Côte d'Ivoire est parmi les dix 
pays où le taux de croissance est le plus élevé. Donc c'est un nonsense, 
tout simplement, de la démagogie et je ne souhaite même pas continuer ce débat 
sur le Front CFA. C'est la monnaie des pays africains qui ont librement 
consenti et accepté de se mettre ensemble. Bien sûr, chacun de nous aurait pu 
avoir sa monnaie, mais quel serait l'intérêt? Pourquoi les Européens ont 
décidé d'avoir une monnaie commune et que nous les Africains ne serons pas en 
mesure de le faire? Nous sommes très fiers de cette monnaie, elle marche bien, 
s'il y a des adaptations à faire, nous le ferons de manière souveraine.

Non-English transcription into English

In addition to speech recognition, spoken language identification, and voice activity identification, Whisper is also able to perform speech translation from any language into English.

In this last section, we will generate the English transcription of the following comic French video.

Comic video from YouTube

The process does not change that much from what we have seen above. The major change is the use of the taskparameter in the transcribe() function.

URL = "https://www.youtube.com/watch?v=hz5xWgjSUlk"
final_filename = "comic"
video_to_audio(URL, destination, final_filename)# Run the test
audio_file = "comic.mp3"
french_to_english = whisper_model.transcribe(audio_file, task = 'translate')
# Show the result
print(french_to_english["text"])

task=’translate’means that we are performing a translation task. Below is the final result.

I was asked to make a speech. I'm going to tell you right away, 
ladies and gentlemen, that I'm going to speak without saying anything. 
I know, you think that if he has nothing to say, he would better shut up. 
It's too easy. It's too easy. Would you like me to do it like all those who 
have nothing to say and who keep it for themselves? Well, no, ladies and 
gentlemen, when I have nothing to say, I want people to know. I want to make 
others enjoy it, and if you, ladies and gentlemen, have nothing to say, well, 
we'll talk about it. We'll talk about it, I'm not an enemy of the colloquium. 
But tell me, if we talk about nothing, what are we going to talk about? Well, 
about nothing. Because nothing is not nothing, the proof is that we can 
subtract it. Nothing minus nothing equals less than nothing. So if we can find 
less than nothing, it means that nothing is already something. We can buy 
something with nothing by multiplying it. Well, once nothing, it's nothing. 
Twice nothing, it's not much. But three times nothing, for three times nothing,
we can already buy something. And for cheap! Now, if you multiply three times 
nothing by three times nothing, nothing multiplied by nothing equals nothing, 
three multiplied by three equals nine, it's nothing new. Well, let's talk 
about something else, let's talk about the situation, let's talk about the 
situation without specifying which one. If you allow me, I'll briefly go over 
the history of the situation, whatever it is. A few months ago, remember, 
the situation, not to be worse than today's, was not better either. Already, 
we were heading towards the catastrophe and we knew it. We were aware of it, 
because we should not believe that the person in charge of yesterday was more 
ignorant of the situation than those of today. Besides, they are the same. 
Yes, the catastrophe where the pension was for tomorrow, that is to say that 
in fact it should be for today, by the way. If my calculations are right, 
but what do we see today? That it is still for tomorrow. So I ask you the 
question, ladies and gentlemen, is it by always putting the catastrophe that 
we could do the day after tomorrow, that we will avoid it? I would like to 
point out that if the current government is not capable of taking on the 
catastrophe, it is possible that the opposition will take it.

Congratulations 🎉! You have just learned how to perform speech-to-text and have also applied machine translation! There are so many use cases that can be solved from this model.

If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

GitHub of Whisper

Robust Speech Recognition via Large-Scale Weak Supervision