Deploy a Voice-Based Chatbot with BentoML, LangChain, and Gradio | by Ahmed Besbes | May, 2023

By Jessie Hobb On May 2, 2023

Here’s a one-minute demo of the app.

Video by the author — A quick demo

With the ever-increasing number of open-source ML models that solve a huge variety of tasks, software applications will gradually become some sort of AI application that integrates pre-trained models, self-trained models, or models accessed through APIs.

Given that many SOTA models are large and require powerful hardware and distributed deployment, fitting everything in one machine will not be a practical solution, especially if the application combines at least 2 or 3 models.

→ BentoML is a framework that helps solve this problem by letting user write simple Python code yet deploy models as distributed microservices.

I’ve been playing and experimenting with BentoML for a while now and it’s definitely my go-to solution to deploy machine learning models and services. With its own distribution format known as a bento, this library makes it easy to package everything ML-related into one place: source code and dependencies, API definitions, model weights, Docker image, etc.

Deploying is even easier since it relies on pushing that said bento to the cloud.
In this tutorial, we’ll first prototype the app, build the bento locally and push it to BentoCloud for deployment.

You can perform this last step on your own by self-managing a deployment platform (check the Yatai project for more details) or using a deployment utility called bentoctl that deploys your bento to a variety of cloud services.

If you want to learn more about BentoML and the different deployment strategies, you can have a look at my previous posts ⏬

The code of this project is available on Github. You can clone it and run the app locally or build a self-contained bento for later deployment.

We’ll use transformers with a couple of other libraries that process audio data and the popular LangChain package to easily integrate with Large Language Models (LLMs).

We’ll use poetry to manage the project’s dependencies.

git clone [email protected]:ahmedbesbes/BentoChain.git
cd BentoChain/
poetry install

After installing the packages, you’ll need to generate an SSL key and certificate. This will establish an HTTPS connexion that will be needed on modern browsers to allow the use of the microphone.

mkdir ssl
cd ssl
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -nodes

This project assumes no training. We will simply download the models’ weights from HuggingFace’s hub and save them as BentoML Models.

Saving the models as BentoML artifacts helps incorporate them in the bento archive so that they don’t constitute an external dependency.

After figuring out what models are needed precisely, you can first download them locally by simply initializing them.

import logging
import bentoml
from transformers import (
SpeechT5Processor,
SpeechT5ForTextToSpeech,
SpeechT5HifiGan,
WhisperForConditionalGeneration,
WhisperProcessor,
)logging.basicConfig(level=logging.WARN)
if __name__ == "__main__":
t5_processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
t5_model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
t5_vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
whisper_processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
whisper_model = WhisperForConditionalGeneration.from_pretrained(
"openai/whisper-tiny"
)
whisper_model.config.forced_decoder_ids = None

Then, you can save them as BentoML models by calling the bentoml.transformers.save_model function:

    saved_t5_processor = bentoml.transformers.save_model(
"speecht5_tts_processor", t5_processor
)
print(f"Saved: {saved_t5_processor}")saved_t5_model = bentoml.transformers.save_model(
"speecht5_tts_model",
t5_model,
signatures={"generate_speech": {"batchable": False}},
)
print(f"Saved: {saved_t5_model}")
saved_t5_vocoder = bentoml.transformers.save_model(
"speecht5_tts_vocoder", t5_vocoder
)
print(f"Saved: {saved_t5_vocoder}")
saved_whisper_processor = bentoml.transformers.save_model(
"whisper_processor",
whisper_processor,
)
print(f"Saved: {saved_whisper_processor}")
saved_whisper_model = bentoml.transformers.save_model(
"whisper_model",
whisper_model,
)
print(f"Saved: {saved_whisper_model}")

The full code is available in thetrain.py script and should be run once:

poetry shell
python train.py

Saving models ✅ — Screenshot by the author

Before going into more detail, let’s first clarify the data workflow to understand how the app works:

The user sends an audio message to the API server over an HTTP POST request
The API server redirects the audio message to the speech2text runner that transcribes it into text and sends it back
The API server takes the transcribed text message as input, passes it through a LangChain agent, generates a response, and sends it to the text2speech runner
The text2speech runner generates an audio clip from the input text and returns it to the API server which in turn sends it back to the user

The following diagram summarizes these steps.

The architecture of the app ✅ — Image by the author

What’s interesting about BentoML is that when deploying to the BentoCloud (or to any self-managed platform) the runners and the API server can be deployed separately on three different Kubernetes pods.

This provides 3 main benefits:

Separation of concerns: runners are focused on the compute and are decoupled from web serving
Customization: Each runner can have a specific hardware configuration depending on the task it’s performing: for example, the text2speech runner’s config will have a GPU while the speech2text runner won’t require one
Auto-scale: Runners will also auto-scale independently based on resource usage

To learn more about BentoML runners, have a look at this page.

Now that we have a global picture of the app, let’s focus on each runner:

This runner will rely on OpenAI’s Whisper model to transcribe audio to text. Specifically, it’ll use the tiny model.

This model will receive a tensor of input features and will generate a transcription.

The code is pretty straightforward: it just defines a SpeechToTextRunnable class that inherits from bentoml.Runnable, instantiates the model and the processor, and defines the inference method.

import torch
import bentomls2t_processor_ref = bentoml.models.get("whisper_processor:latest")
s2t_model_ref = bentoml.models.get("whisper_model:latest")
class Speech2TextRunnable(bentoml.Runnable):
SUPPORTED_RESOURCES = ("nvidia.com/gpu", "cpu")
SUPPORTS_CPU_MULTI_THREADING = True
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.processor = bentoml.transformers.load_model(s2t_processor_ref)
self.model = bentoml.transformers.load_model(s2t_model_ref)
self.model.to(self.device)
@bentoml.Runnable.method(batchable=False)
def transcribe_audio(self, tensor):
if tensor is not None:
predicted_ids = self.model.generate(tensor.to(self.device))
transcriptions = self.processor.batch_decode(
predicted_ids, skip_special_tokens=True
)
transcription = transcriptions[0]
return transcription

This runner performs the exact opposite task: it takes a text as input and generates a speech that will be represented by a NumPy array.

Note that a device must be declared as an attribute of the Text2SpeechRunnable class to support the GPU acceleration when it’s available.

import bentoml
import torch
from datasets import load_datasett2s_processor_ref = bentoml.models.get("speecht5_tts_processor:latest")
t2s_model_ref = bentoml.models.get("speecht5_tts_model:latest")
t2s_vocoder_ref = bentoml.models.get("speecht5_tts_vocoder:latest")
class Text2SpeechRunnable(bentoml.Runnable):
SUPPORTED_RESOURCES = ("nvidia.com/gpu", "cpu")
SUPPORTS_CPU_MULTI_THREADING = True
def __init__(self):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.processor = bentoml.transformers.load_model(t2s_processor_ref)
self.model = bentoml.transformers.load_model(t2s_model_ref)
self.vocoder = bentoml.transformers.load_model(t2s_vocoder_ref)
self.embeddings_dataset = load_dataset(
"Matthijs/cmu-arctic-xvectors",
split="validation",
)
self.speaker_embeddings = torch.tensor(
self.embeddings_dataset[7306]["xvector"]
).unsqueeze(0)
self.model.to(self.device)
self.vocoder.to(self.device)
@bentoml.Runnable.method(batchable=False)
def generate_speech(self, inp: str):
inputs = self.processor(text=inp, return_tensors="pt")
speech = self.model.generate_speech(
inputs["input_ids"].to(self.device),
self.speaker_embeddings.to(self.device),
vocoder=self.vocoder,
)
return speech.cpu().numpy()

In this section, we will create a service that defines the API routes that can be accessed when the bento is deployed.

We first start by initializing the two previous runners we defined:

import bentoml
import gradio as gr
from chatbot import create_block, ChatWrapper
from fastapi import FastAPI
from speech2text_runner import s2t_processor_ref, s2t_model_ref, Speech2TextRunnable
from text2speech_runner import (
t2s_processor_ref,
t2s_model_ref,
t2s_vocoder_ref,
Text2SpeechRunnable,
)speech2text_runner = bentoml.Runner(
Speech2TextRunnable,
name="speech2text_runner",
models=[s2t_processor_ref, s2t_model_ref],
)
text2speech_runner = bentoml.Runner(
Text2SpeechRunnable,
name="text2speech_runner",
models=[t2s_processor_ref, t2s_model_ref, t2s_vocoder_ref],
)

Then, we create a Service object that depends on them:

svc = bentoml.Service(
"voicegpt",
runners=[
text2speech_runner,
speech2text_runner,
],
)

Once the service is created, we will define two API routes:

generate_text: this route will take an array as input and generate a text by calling the speech2text_runner

@svc.api(input=bentoml.io.NumpyNdarray(), output=bentoml.io.Text())
def generate_text(tensor):
text = speech2text_runner.transcribe_audio.run(tensor)
return text

generate_speech: this route will take a text as input and generate an array as output by calling the text2speech_runner


@svc.api(input=bentoml.io.Text(), output=bentoml.io.NumpyNdarray())
def generate_speech(inp: str):
return text2speech_runner.generate_speech.run(inp)

We are not done yet with the service source code.

In this section, we will mount a FastAPI app as an HTTP endpoint on the “/chatbot” path.

This app will serve a Gradio chatbot interface that will interact with the two previously defined API routes: generate_text and generate_speech .

chat = ChatWrapper(generate_speech, generate_text)
app = FastAPI()
app = gr.mount_gradio_app(app, create_block(chat), path="/chatbot")
svc.mount_asgi_app(app, "/")

The “chat” variable is an object that gets the user’s audio input, transcribes it into text, passes it to LangChain, extracts the response, and returns a bunch of data that update the app’s interface and state.

The chat object is a callable that expects the following parameters:

api_key: OpenAI API key
audio_path: temporary file location when an audio file is recorded with the microphone
text_message: a text message sent instead of an audio file
history: a tuple of questions and the corresponding responses ((“Hello”, “hi”), (“How are you?”, “Fine, thank you. What about you?”))
chain: a ConversationChain object from LangChain

In the following code snippet, the ChatWrapper __call__’s method first checks the input data. If it’s in audio format, it transcribes it using the generate_text method, otherwise, it keeps it as is.

Then, it checks whether the OpenAI key is correctly loaded. If it’s not the case, it prints out the message “Please paste your Open AI key.”, along with the audio transcription.

If the key is correctly loaded, the LangChain agent runs and outputs a message that is then passed to the generate_speech method to produce the output audio.

class ChatWrapper:
def __init__(self, generate_speech, generate_text):
self.lock = Lock()
self.generate_speech = generate_speech
self.generate_text = generate_text
self.s2t_processor_ref = bentoml.models.get("whisper_processor:latest")
self.processor = bentoml.transformers.load_model(self.s2t_processor_ref)def __call__(
self,
api_key: str,
audio_path: str,
text_message: str,
history: Optional[Tuple[str, str]],
chain: Optional[ConversationChain],
):
"""Execute the chat functionality."""
self.lock.acquire()
try:
if audio_path is None and text_message is not None:
transcription = text_message
elif audio_path is not None and text_message in [None, ""]:
audio_dataset = Dataset.from_dict({"audio": [audio_path]}).cast_column(
"audio",
Audio(sampling_rate=16000),
)
sample = audio_dataset[0]["audio"]
if sample is not None:
input_features = self.processor(
sample["array"],
sampling_rate=sample["sampling_rate"],
return_tensors="pt",
).input_features
transcription = self.generate_text(input_features)
else:
transcription = None
speech = None
if transcription is not None:
history = history or []
# If chain is None, that is because no API key was provided.
if chain is None:
response = "Please paste your Open AI key."
history.append((transcription, response))
speech = (PLAYBACK_SAMPLE_RATE, self.generate_speech(response))
return history, history, speech, None, None
# Set OpenAI key
import openai
openai.api_key = api_key
# Run chain and append input.
output = chain.run(input=transcription)
speech = (PLAYBACK_SAMPLE_RATE, self.generate_speech(output))
history.append((transcription, output))
except Exception as e:
raise e
finally:
self.lock.release()
return history, history, speech, None, None

Remember the create_block function we saw earlier? This one takes a ChatWrapper instance as input and produces the UI.

chat = ChatWrapper(generate_speech, generate_text)
app = FastAPI()
app = gr.mount_gradio_app(app, create_block(chat), path="/chatbot")
svc.mount_asgi_app(app, "/")

Let’s break the UI into pieces to understand how the data flows exactly.

openai_api_key_textbox: This textbox expects you to paste your OpenAI key in it.

with block:
with gr.Row():
gr.Markdown("<h3><center>BentoML LangChain Demo</center></h3>")openai_api_key_textbox = gr.Textbox(
placeholder="Paste your OpenAI API key (sk-...)",
show_label=False,
lines=1,
type="password",
)

When the user pastes its key and submits it, the key is passed to the set_openai_api_key function that gets executed. This function then returns the loaded chain that is passed into the app’s state. That way, the chain object is not None and can be used when passed to the chat object.

def set_openai_api_key(api_key: str):
if api_key:
os.environ["OPENAI_API_KEY"] = api_key
chain = load_chain()
os.environ["OPENAI_API_KEY"] = ""
return chainagent_state = gr.State()
openai_api_key_textbox.change(
set_openai_api_key,
inputs=[openai_api_key_textbox],
outputs=[agent_state],
show_progress=False,
)

Here are the other UI components:

chatbot: It displays a chatbot output showing both user-submitted messages and responses.
audio: A widget that plays the user’s recorded audio clip
state: a global state of the app
audio_message: user’s submitted audio
text_message: user’s submitted text

Now, what happens when a user records an audio from the microphone? (same happens when he sends a text)

The chat object gets executed with a list of inputs from the UI [openai_api_key_textbox, audio_message, text_message, state, agent_state] and outputs a list of outputs that updates the following components [chatbot, state, audio, audio_message, text_message]

audio_message.change(
chat,
inputs=[
openai_api_key_textbox,
audio_message,
text_message,
state,
agent_state,
],
outputs=[chatbot, state, audio, audio_message, text_message],
show_progress=False,
)

To put it simply, this allows displaying the user’s questions and the bot’s answers as well as the history of the chat and the audio of the last response.

To serve the app locally, run the following command:

poetry shell 
bentoml serve service:svc --reload --ssl-certfile ssl/cert.pem --ssl-keyfile ssl/key.pem

This starts a SwaggerUI from which you can try the two endpoints.
This also serves the Gradio app on the “/chatbot” path.