Build a Speech to Text Web App using Next.js

By Jessie Hobb On Mar 28, 2023

Let’s build a web app which transcribes and translates audio using OpenAI’s Whisper Model

Hello folks! I hope you all are doing well. Today, we will build a Speech to Text web app using Node.js and OpenAI’s API. We would use OpenAI’s API to use its Whisper Model, which lets us upload audio files in mp3 format, and provides us with its transcript. It can even translate audio in other languages to English text, which is incredible.

First of all, we would set up a new Node.js project so that we can start building our application. So, we will create a folder where we want to build our project and move into that folder using the command line, and then we can set up a new Node.js project using the following command:

npm init

After running this command, it will ask several questions, such as the app’s name, entry point, etc. We can keep them as default for now. After this, you would see it created a package.json file. This file would contain information about our application and which packages we have installed for this application.

So, the next step is to install the necessary node modules, i.e. packages, into our application so that we are ready to start building the application. We can do that by running the below command:

npm install express multer openai cors --save

We install these four packages and also use --save to add these packages to the package.json file. It makes it easier for someone cloning the repository to install all the required packages by just running npm install command once.

We also want to use the nodemon package in our application to help us automatically refresh and reload the server when it detects changes in the code so that we do not have to manually keep restarting the server again and again after each change. So, we will add it as a development dependency since it will only be used for help in the development, and we will not use it directly in the code. We can install it by using the following command:

npm install --save-dev nodemon

We now have all the necessary packages to start our development work. As we can see in the package.json file, there would be all the modules and packages we installed, along with a few details about the application. The package.json file would look like this:

{
"name": "speechtext",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"dependencies": {
"cors": "^2.8.5",
"express": "^4.18.2",
"multer": "^1.4.5-lts.1",
"openai": "^3.2.1"
},
"devDependencies": {
"nodemon": "^2.0.22"
}
}

So, as we can see, index.js is written in the main, signifying that the index.js file is the entry point for our application. If you remember, it was asked during the setup process when we ran the npm init command. If you would have left it to default, you would have the same entry point; otherwise, you would have the one you defined at that time.

Now, we will create a new file named index.js. You may name the file as you wish according to the entry point defined by you. We are considering index.js for the same.

index.js

So, we will now start building the index.js file. We will start by importing the required modules into our application. For the index file, we require express and cors. So, we start by requiring these two modules:

const express = require('express');
const cors = require('cors');

Next, we will create a new instance of the express application. Also, we will be setting up our application to use cors, handle json data and make the public folder contain the static files, which can then be accessed by the client side or the frontend.

const app = express();
app.use(express.static('public'));
app.use(express.json());
app.use(cors());

Next, we would like to have a separate file where we would define the APIs. We would be creating a folder named routes and inside of that, we will have a file named api.js where will be defining the GET and POST APIs needed in the application. To inform the application about this, we would add this line of code where we define the base URL and the file’s location where all APIs would be defined. It is a middleware that helps us set the routing for the application.

app.use('/', require('./routes/api'));

Next, we use an error-handling middleware function which will be used to handle any error occurring in the application.

app.use(function(err,req,res,next){
res.status(422).send({error: err.message});
});

Finally, we set up the application to listen for incoming requests on a specified port number, which we can either set by using environment variables or simply define.

app.listen(process.env.PORT || 4000, function(){
console.log('Ready to Go!');
});

We have used port 4000 for our application. We also have a simple console.log inside of it, which prints a message which logs to the console when the application is ready to receive requests.

The complete index.js file:

const express = require('express');
const cors = require('cors');const app = express();
app.use(express.static('public'));
app.use(express.json());
app.use(cors());
app.use('/', require('./routes/api'));
app.use(function(err,req,res,next){
res.status(422).send({error: err.message});
});
app.listen(process.env.PORT || 4000, function(){
console.log('Ready to Go!');
});

Next, we will be moving to the api.js file, which we created inside of the routes folder.

api.js

So, we will now start building the api.js file. We will start this by importing the required modules into the file. We will be importing express, multer and openai libraries.

const express = require("express");
const multer = require("multer");
const { Configuration, OpenAIApi } = require("openai");

Multer is a middleware which we are using to handle multipart/form-data, as we would be dealing with the upload of audio files.

From openai, we require the Configuration and the OpenAIApi modules that we would use to post API requests to the Whisper model.

We will then set up the express router and create an instance of the multer middleware.

const router = express.Router();
const upload = multer();

Next, we will configure OpenAI and create a new configuration instance. We require an OpenAI Secret Key, which we must put here as the API key. You can get the secret key from here.

const configuration = new Configuration({
apiKey: process.env.OPENAI_KEY,
});

Now, we create an async function which accepts a buffer containing the song data and returns the response received from the OpenAI’s Whisper model when we call its API.

async function transcribe(buffer) {
const openai = new OpenAIApi(configuration);
const response = await openai.createTranscription(
buffer, // The audio file to transcribe.
"whisper-1", // The model to use for transcription.
undefined, // The prompt to use for transcription.
'json', // The format of the transcription.
1, // Temperature
'en' // Language
)
return response;
}

As you can see above, we first create a new instance of the OpenAI class by using the configuration we defined earlier in our code. We then call the createTranscription function of OpenAI, and we use await keyword in this so that we wait for the response first before moving ahead.

We pass the required parameters in the function, which consist of the buffer which contains the song data, and the model to use for the transcription, which is whisper-1 in our case. We then leave the prompt undefined. You can give a prompt, too, if you like, which would help the model to better transcribe the audio by having a similar style to the prompt you provided. We define the data type we receive as json, set the temperature to 1 and define the language in which we want the output.

Next, we would define the GET request. We use sendFile to send an HTML file which contains our form where users can upload the audio files. We will be building the HTML files later. We are serving it on the base URL.

router.get("/", (req, res) => {
res.sendFile(path.join(__dirname, "../public", "index.html"));
});

Next, we define the POST request, which will handle the upload of audio files. We use the multer middleware to manage the file upload part. We then create a buffer from the audio file, which would contain the audio file’s data in a format that can be sent to the OpenAI API. We then set a name to the buffer using the original name of the uploaded audio file.

We then call the transcribe function, and once we get the response, we send a JSON back to the client. We send the transcription and the audio file name back to the frontend. We also have a catch method to handle any errors.

router.post("/", upload.any('file'), (req, res) => {
audio_file = req.files[0];
buffer = audio_file.buffer;
buffer.name = audio_file.originalname;
const response = transcribe(buffer);
response.then((data) => {
res.send({ 
type: "POST", 
transcription: data.data.text,
audioFileName: buffer.name
});
}).catch((err) => {
res.send({ type: "POST", message: err });
});
});

Finally, we export the router modules, which would then allow other files to import them.

module.exports = router;

So, the complete code for the api.js file:

const express = require("express");
const multer = require("multer");
const { Configuration, OpenAIApi } = require("openai");const router = express.Router();
const upload = multer();
const configuration = new Configuration({
apiKey: process.env.OPENAI_KEY,
});
async function transcribe(buffer) {
const openai = new OpenAIApi(configuration);
const response = await openai.createTranscription(
buffer, // The audio file to transcribe.
"whisper-1", // The model to use for transcription.
undefined, // The prompt to use for transcription.
'json', // The format of the transcription.
1, // Temperature
'en' // Language
)
return response;
}
router.get("/", (req, res) => {
res.sendFile(path.join(__dirname, "../public", "index.html"));
});
router.post("/", upload.any('file'), (req, res) => {
audio_file = req.files[0];
buffer = audio_file.buffer;
buffer.name = audio_file.originalname;
const response = transcribe(buffer);
response.then((data) => {
res.send({ 
type: "POST", 
transcription: data.data.text,
audioFileName: buffer.name
});
}).catch((err) => {
res.send({ type: "POST", message: err });
});
});
module.exports = router;

Now, we have completed all the backend parts. We will now be writing the HTML files and writing some frontend javascript code to handle form submissions and data saving in local storage and retrieving data from local storage.

We create a public folder, inside of which we will create two HTML files — index.html and transcribe.html.

We will start with the index.html file:

index.html

So, in this file, we will be building the page where we display the form to upload an audio file. We will be using Bootstrap CSS so that we will import it through CDN. We also include Bootstrap JS through CDN at the end of the HTML file.

We then create a simple card where I ask the user to upload the audio file. I make sure that the file being submitted is in .mp3 format as it is the only format accepted by OpenAI’s API. We display a button which, on clicking, submits the form.

We then have the javascript code, which we use to handle the form submission. So, at first, we stop the page from refreshing by preventing the default behaviour of the form submission event. We then take in the form data, i.e. the audio file and send it as a POST request to the backend. We then wait for a response and store it in a data variable.

If the data has the transcription available, we store the transcription and the audio file name in Local Storage so that we can access them on the next page where we need to show the transcript. There are multiple ways to pass information like we could have passed the information through URIs, but here we are using Local Storage to do so.

After saving the data to local storage, we change the window location to load the transcribe.html file.

<!DOCTYPE html>
<html>
<head>
<title>Speech to Text</title>
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-GLhlTQ8iRABdZLl6O3oVMWSktQOp6b7In1Zl3/Jr59b6EGGoI1aFkw7cmDA6j6gD" crossorigin="anonymous">
</head><body style="background-color: #f2f2f2;">
<div class="container mt-5">
<div class="row justify-content-center">
<div class="col-md-6">
<div class="card">
<div class="card-header">
Upload Audio File
</div>
<div class="card-body">
<form id="transcription-form" enctype="multipart/form-data">
<div class="form-group">
<label for="file-upload"><b>Select file:</b></label>
<input id="file-upload" type="file" name="file" class="form-control-file" accept=".mp3" style="margin-bottom: 20px">
</div>
<input type="submit" value="Transcribe" class="btn btn-primary"></input>
</form>
</div>
</div>
</div>
</div>
</div>
<script>
document.getElementById("transcription-form").addEventListener("submit", async function (event) {
event.preventDefault();
const formData = new FormData(event.target);
const response = await fetch("/", {
method: "POST",
body: formData,
});
const data = await response.json();
if (data.transcription) {
localStorage.setItem("transcription", data.transcription);
localStorage.setItem("audioFileName", data.audioFileName);
window.location.href = "/transcribe.html";
} 
else {
console.error("Error:", data.message);
}
});
</script>
<script  data-src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js" integrity="sha384-w76AqPfDkMBDXo30jS1Sgez6pr3x5MlQ1ZAGC+nuZB+EYdgRZgiwxhTBTkF7CXvN" crossorigin="anonymous"></script>
</body>
</html>

So, the above code builds the index.html file, which will display the form to the user where the user can upload the audio file.

Here is one screenshot of how that looks:

Next, we will build the transcribe.html file.

transcribe.html

So, in this file, we will display the transcript of the audio file uploaded by the user. So, we will again be using Bootstrap CSS and JS so that we will include those via CDN.

We then define some custom CSS to stylize the elements to make them look better. We then display the audio file name and the transcription of that audio file in a container.

In the javascript code at the bottom of this page, we get the audio file name and the transcript from the local storage and push that data to the respective HTML elements using the ids.

<!DOCTYPE html>
<html>
<head>
<title>Transcription</title>
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-GLhlTQ8iRABdZLl6O3oVMWSktQOp6b7In1Zl3/Jr59b6EGGoI1aFkw7cmDA6j6gD" crossorigin="anonymous"><style>
h1 {
margin-top: 20px;
margin-bottom: 10px;
font-size: 2.5rem;
font-weight: bold;
color: #333;
}
p {
font-size: 1.2rem;
color: #333;
margin-bottom: 30px;
}
.container {
margin-top: 50px;
margin-bottom: 50px;
max-width: 600px;
padding: 30px;
background-color: #fff;
box-shadow: 0 0 10px rgba(0,0,0,0.2);
border-radius: 5px;
}
</style>
</head>
<body style="background-color: #f2f2f2;">
<div class="container">
<h1>Audio File:</h1>
<p id="audioFileName"></p>
<h1>Transcription:</h1>
<p id="transcription"></p>
</div>
<script  data-src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js" integrity="sha384-w76AqPfDkMBDXo30jS1Sgez6pr3x5MlQ1ZAGC+nuZB+EYdgRZgiwxhTBTkF7CXvN" crossorigin="anonymous"></script>
<script>
const audioFileName = localStorage.getItem("audioFileName");
const transcription = localStorage.getItem("transcription");
document.getElementById("audioFileName").innerHTML = audioFileName;
document.getElementById("transcription").innerHTML = transcription;
</script>
</body>
</html>

I have tried transcribing two different small audio files which I recorded personally — one in English and the other in Hindi. Though the second audio file was recorded in Hindi, I wanted to see the output in English, thus testing its translational abilities too. It was highly accurate in transcribing both of the audio files. Though, in multiple runs, sometimes it produces a vague transcription which is not correct, but many times, transcriptions are mostly correct.

I am attaching the transcription screenshots below. Those are not entirely correct, but I would say it is about 85–90% correct in transcribing what I actually recorded in the audio files.

Transcription in English for a Hindi audio file

So, we have successfully built a speech-to-text web app using OpenAI’s API and Node.js. I hope you enjoyed building it and learnt something new from this article. You can also then alter the parameters to play around with it and compare the results to get a better idea of what works well in what scenario.

Thank you for reading the article. Some more articles you must read after this one are: