Anatomy of LLM-Based Chatbot Applications: Monolithic vs. Microservice Architectural Patterns | by Marie Stephen Leo | May, 2023
A Practical Guide to Building Monolithic and Microservice Chatbot Applications with Streamlit, Huggingface, and FastAPI
With the advent of OpenAI’s ChatGPT, chatbots are exploding in popularity! Every business seeks ways to incorporate ChatGPT into its customer-facing and internal applications. Further, with open-source chatbots catching up so rapidly that even Google engineers seem to conclude they and OpenAI have “no moat,” there’s never been a better time to be in the AI industry!
As a Data Scientist building such an application, one of the critical decisions is choosing between a monolithic and microservices architecture. Both architectures have pros and cons; ultimately, the choice depends on the business’s needs, such as scalability and ease of integration with existing systems. In this blog post, we will explore the differences between these two architectures with live code examples using Streamlit, Huggingface, and FastAPI!
First, create a new conda environment and install the necessary libraries.
# Create and activate a conda environment
conda create -n hf_llm_chatbot python=3.9
conda activate hf_llm_chatbot# Install the necessary libraries
pip install streamlit streamlit-chat "fastapi[all]" "transformers[torch]"
Monolithic architecture is an approach that involves building the entire application as a single, self-contained unit. This approach is simple and easy to develop but can become complex as the application grows. All application components, including the user interface, business logic, and data storage, are tightly coupled in a monolithic architecture. Any changes made to one part of the app can ripple effect on the entire application.
Let’s use Huggingface and Streamlit to build a monolithic chatbot application below. We’ll use Streamlit to build the frontend user interface, while Huggingface provides an extremely easy-to-use, high-level abstraction to various open-source LLM models called pipelines.
First, let’s create a file utils.py containing three helper functions common to the front end in monolithic and microservices architectures.
clear_conversation()
: This function deletes all the stored session_state variables in the Streamlit frontend. We use it to clear the entire chat history and start a new chat thread.display_conversation()
: This function uses the streamlit_chat library to create a beautiful chat interface frontend with our entire chat thread displayed on the screen from the latest to the oldest message. Since the Huggingface pipelines API stores user_inputs and generate_responses in separate lists, we also use this function to create a single interleaved_conversation list that contains the entire chat thread so we can download it if needed.download_conversation()
: This function converts the whole chat thread to a pandas dataframe and downloads it as a csv file to your local computer.
# %%writefile utils.py
from datetime import datetimeimport pandas as pd
import streamlit as st
from streamlit_chat import message
def clear_conversation():
"""Clear the conversation history."""
if (
st.button("🧹 Clear conversation", use_container_width=True)
or "conversation_history" not in st.session_state
):
st.session_state.conversation_history = {
"past_user_inputs": [],
"generated_responses": [],
}
st.session_state.user_input = ""
st.session_state.interleaved_conversation = []
def display_conversation(conversation_history):
"""Display the conversation history in reverse chronology."""
st.session_state.interleaved_conversation = []
for idx, (human_text, ai_text) in enumerate(
zip(
reversed(conversation_history["past_user_inputs"]),
reversed(conversation_history["generated_responses"]),
)
):
# Display the messages on the frontend
message(ai_text, is_user=False, key=f"ai_{idx}")
message(human_text, is_user=True, key=f"human_{idx}")
# Store the messages in a list for download
st.session_state.interleaved_conversation.append([False, ai_text])
st.session_state.interleaved_conversation.append([True, human_text])
def download_conversation():
"""Download the conversation history as a CSV file."""
conversation_df = pd.DataFrame(
reversed(st.session_state.interleaved_conversation), columns=["is_user", "text"]
)
csv = conversation_df.to_csv(index=False)
st.download_button(
label="💾 Download conversation",
data=csv,
file_name=f"conversation_{datetime.now().strftime('%Y%m%d%H%M%S')}.csv",
mime="text/csv",
use_container_width=True,
)
Next, let’s create a single monolith.py file containing our entire monolithic application.
- OpenAI’s ChatGPT API costs money for every token in both the question and response. Hence for this small demo, I chose to use an open-source model from Huggingface called “facebook/blenderbot-400M-distill”. You can find the entire list of over 2000 open-source models trained for the conversational task at the Huggingface model hub. For more details on the conversational task pipeline, refer to Huggingface’s official documentation. When open-source models inevitably catch up to the proprietary models from OpenAI and Google, I’m sure Huggingface will be THE platform for researchers to share those models, given how much they’ve revolutionized the field of NLP over the past few years!
main()
: This function builds the frontend app’s layout using Streamlit. We’ll have a button to clear the conversation and one to download. We’ll also have a text box where the user can type their question, and upon pressing enter, we’ll call themonolith_llm_response
function with the user’s input. Finally, we’ll display the entire conversation on the front end using thedisplay_conversation
function from utils.monolith_llm_response()
: This function is responsible for the chatbot logic using Huggingface pipelines. First, we create a new Conversation object and initialize it to the entire conversation history up to that point. Then, we add the latest user_input to that object, and finally, we pass this conversation object to the Huggingface pipeline that we created two steps back. Huggingface automatically adds the user input and response generated to the conversation history!
# %%writefile monolith.py
import streamlit as st
import utils
from transformers import Conversation, pipeline# https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/pipelines#transformers.Conversation
chatbot = pipeline(
"conversational", model="facebook/blenderbot-400M-distill", max_length=1000
)
@st.cache_data()
def monolith_llm_response(user_input):
"""Run the user input through the LLM and return the response."""
# Step 1: Initialize the conversation history
conversation = Conversation(**st.session_state.conversation_history)
# Step 2: Add the latest user input
conversation.add_user_input(user_input)
# Step 3: Generate a response
_ = chatbot(conversation)
# User input and generated response are automatically added to the conversation history
# print(st.session_state.conversation_history)
def main():
st.title("Monolithic ChatBot App")
col1, col2 = st.columns(2)
with col1:
utils.clear_conversation()
# Get user input
if user_input := st.text_input("Ask your question 👇", key="user_input"):
monolith_llm_response(user_input)
# Display the entire conversation on the frontend
utils.display_conversation(st.session_state.conversation_history)
# Download conversation code runs last to ensure the latest messages are captured
with col2:
utils.download_conversation()
if __name__ == "__main__":
main()
That’s it! We can run this monolithic application by running streamlit run monolith.py
and interacting with the application on a web browser! We could quickly deploy this application as such to a cloud service like Google Cloud Run, as described in my previous blog post, and interact with it over the internet too!
Microservices architecture is an approach that involves breaking down the application into smaller, independent services. Each application component, such as the user interface, business logic, and data storage, is developed and deployed independently. This approach offers flexibility and scalability as we can modularly add more capabilities and horizontally scale each service independently of others by adding more instances.
Let’s split the Huggingface model inference from our monolithic app into a separate microservice using FastAPI and the Streamlit frontend into another microservice below. Since the backend in this demo only has the LLM model, our backend API server is the same as the LLM model from the picture above. We can directly re-use the utils.py file we created above in the frontend microservice!
First, let’s create a backend.py file that will serve as our FastAPI microservice that runs the Huggingface pipeline inference.
- We first create the pipeline object with the same model that we chose earlier, “facebook/blenderbot-400M-distill”
- We then create a ConversationHistory Pydantic model so that we can receive the inputs required for the pipeline as a payload to the FastAPI service. For more information on the FastAPI request body, please look at the FastAPI documentation.
- It’s a good practice to reserve the root route in APIs for a health check. So we define that route first.
- Finally, we define a route called
/chat
, which accepts the API payload as a ConversationHistory object and converts it to a dictionary. Then we create a new Conversation object and initialize it with the conversation history received in the payload. Next, we add the latest user_input to that object and pass this conversation object to the Huggingface pipeline. Finally, we must return the latest generated response to the front end.
# %%writefile backend.py
from typing import Optionalfrom fastapi import FastAPI
from pydantic import BaseModel, Field
from transformers import Conversation, pipeline
app = FastAPI()
# https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/pipelines#transformers.Conversation
chatbot = pipeline(
"conversational", model="facebook/blenderbot-400M-distill", max_length=1000
)
class ConversationHistory(BaseModel):
past_user_inputs: Optional
] = []
generated_responses: Optional
] = []
user_input: str = Field(example="Hello, how are you?")@app.get("/")
async def health_check():
return {"status": "OK!"}
@app.post("/chat")
async def llm_response(history: ConversationHistory) -> str:
# Step 0: Receive the API payload as a dictionary
history = history.dict()
# Step 1: Initialize the conversation history
conversation = Conversation(
past_user_inputs=history["past_user_inputs"],
generated_responses=history["generated_responses"],
)
# Step 2: Add the latest user input
conversation.add_user_input(history["user_input"])
# Step 3: Generate a response
_ = chatbot(conversation)
# Step 4: Return the last generated result to the frontend
return conversation.generated_responses[-1]
We can run this FastAPI app locally using uvicorn backend:app --reload
, or deploy it to a cloud service like Google Cloud Run, as described in my previous blog post, and interact with it over the internet! You can test the backend using the API docs that FastAPI automatically generates at the /docs
route by navigating to http://127.0.0.1:8000/docs.
Finally, let’s create a frontend.py file that contains the frontend code.
main()
: This function is precisely similar tomain()
in the monolithic application, except for one change that we call themicroservice_llm_response()
function when the user enters any input.microservice_llm_response()
: Since we split out the LLM logic into a separate FastAPI microservice, this function uses the conversation history stored in the session_state to post a request to the backend FastAPI service and then appends both the user’s input and the response from the FastAPI backend to the conversation history to continue the memory of the entire chat thread.
# %%writefile frontend.py
import requests
import streamlit as st
import utils# Replace with the URL of your backend
app_url = "http://127.0.0.1:8000/chat"
@st.cache_data()
def microservice_llm_response(user_input):
"""Send the user input to the LLM API and return the response."""
payload = st.session_state.conversation_history
payload["user_input"] = user_input
response = requests.post(app_url, json=payload)
# Manually add the user input and generated response to the conversation history
st.session_state.conversation_history["past_user_inputs"].append(user_input)
st.session_state.conversation_history["generated_responses"].append(response.json())
def main():
st.title("Microservices ChatBot App")
col1, col2 = st.columns(2)
with col1:
utils.clear_conversation()
# Get user input
if user_input := st.text_input("Ask your question 👇", key="user_input"):
microservice_llm_response(user_input)
# Display the entire conversation on the frontend
utils.display_conversation(st.session_state.conversation_history)
# Download conversation code runs last to ensure the latest messages are captured
with col2:
utils.download_conversation()
if __name__ == "__main__":
main()
That’s it! We can run this frontend application by running streamlit run frontend.py
and interacting with the application on a web browser! As my previous blog post described, we could quickly deploy to a cloud service like Google Cloud Run and interact with it over the internet too!
The answer depends on the requirements of your application. A monolithic architecture can be a great starting point for a Data Scientist to build an initial proof-of-concept quickly and get it in front of business stakeholders. But, inevitably, if you plan to productionize the application, a microservices architecture is generally a better bet over a monolithic one because it allows for more flexibility and scalability and allows different specialized developers to focus on building the various components. For example, a frontend developer might use React to build the frontend, a Data Engineer might use Airflow to write the data pipelines, and an ML engineer might use FastAPI or BentoML to deploy the model serving API with custom business logic.
Additionally, with microservices, chatbot developers can easily incorporate new features or change existing ones without affecting the entire application. This level of flexibility and scalability is crucial for businesses that want to integrate the chatbot into existing applications. Dedicated UI/UX, data engineers, data scientists, and ML engineers can each focus on their areas of expertise to deliver a polished product!
In conclusion, monolithic and microservices architectures have pros and cons, and the choice between the two depends on the business’s specific needs. However, I prefer microservices architecture for chatbot applications due to its flexibility, scalability, and the fact that I can delegate frontend development to more qualified UI/UX folk 🤩.
A Practical Guide to Building Monolithic and Microservice Chatbot Applications with Streamlit, Huggingface, and FastAPI
With the advent of OpenAI’s ChatGPT, chatbots are exploding in popularity! Every business seeks ways to incorporate ChatGPT into its customer-facing and internal applications. Further, with open-source chatbots catching up so rapidly that even Google engineers seem to conclude they and OpenAI have “no moat,” there’s never been a better time to be in the AI industry!
As a Data Scientist building such an application, one of the critical decisions is choosing between a monolithic and microservices architecture. Both architectures have pros and cons; ultimately, the choice depends on the business’s needs, such as scalability and ease of integration with existing systems. In this blog post, we will explore the differences between these two architectures with live code examples using Streamlit, Huggingface, and FastAPI!
First, create a new conda environment and install the necessary libraries.
# Create and activate a conda environment
conda create -n hf_llm_chatbot python=3.9
conda activate hf_llm_chatbot# Install the necessary libraries
pip install streamlit streamlit-chat "fastapi[all]" "transformers[torch]"
Monolithic architecture is an approach that involves building the entire application as a single, self-contained unit. This approach is simple and easy to develop but can become complex as the application grows. All application components, including the user interface, business logic, and data storage, are tightly coupled in a monolithic architecture. Any changes made to one part of the app can ripple effect on the entire application.
Let’s use Huggingface and Streamlit to build a monolithic chatbot application below. We’ll use Streamlit to build the frontend user interface, while Huggingface provides an extremely easy-to-use, high-level abstraction to various open-source LLM models called pipelines.
First, let’s create a file utils.py containing three helper functions common to the front end in monolithic and microservices architectures.
clear_conversation()
: This function deletes all the stored session_state variables in the Streamlit frontend. We use it to clear the entire chat history and start a new chat thread.display_conversation()
: This function uses the streamlit_chat library to create a beautiful chat interface frontend with our entire chat thread displayed on the screen from the latest to the oldest message. Since the Huggingface pipelines API stores user_inputs and generate_responses in separate lists, we also use this function to create a single interleaved_conversation list that contains the entire chat thread so we can download it if needed.download_conversation()
: This function converts the whole chat thread to a pandas dataframe and downloads it as a csv file to your local computer.
# %%writefile utils.py
from datetime import datetimeimport pandas as pd
import streamlit as st
from streamlit_chat import message
def clear_conversation():
"""Clear the conversation history."""
if (
st.button("🧹 Clear conversation", use_container_width=True)
or "conversation_history" not in st.session_state
):
st.session_state.conversation_history = {
"past_user_inputs": [],
"generated_responses": [],
}
st.session_state.user_input = ""
st.session_state.interleaved_conversation = []
def display_conversation(conversation_history):
"""Display the conversation history in reverse chronology."""
st.session_state.interleaved_conversation = []
for idx, (human_text, ai_text) in enumerate(
zip(
reversed(conversation_history["past_user_inputs"]),
reversed(conversation_history["generated_responses"]),
)
):
# Display the messages on the frontend
message(ai_text, is_user=False, key=f"ai_{idx}")
message(human_text, is_user=True, key=f"human_{idx}")
# Store the messages in a list for download
st.session_state.interleaved_conversation.append([False, ai_text])
st.session_state.interleaved_conversation.append([True, human_text])
def download_conversation():
"""Download the conversation history as a CSV file."""
conversation_df = pd.DataFrame(
reversed(st.session_state.interleaved_conversation), columns=["is_user", "text"]
)
csv = conversation_df.to_csv(index=False)
st.download_button(
label="💾 Download conversation",
data=csv,
file_name=f"conversation_{datetime.now().strftime('%Y%m%d%H%M%S')}.csv",
mime="text/csv",
use_container_width=True,
)
Next, let’s create a single monolith.py file containing our entire monolithic application.
- OpenAI’s ChatGPT API costs money for every token in both the question and response. Hence for this small demo, I chose to use an open-source model from Huggingface called “facebook/blenderbot-400M-distill”. You can find the entire list of over 2000 open-source models trained for the conversational task at the Huggingface model hub. For more details on the conversational task pipeline, refer to Huggingface’s official documentation. When open-source models inevitably catch up to the proprietary models from OpenAI and Google, I’m sure Huggingface will be THE platform for researchers to share those models, given how much they’ve revolutionized the field of NLP over the past few years!
main()
: This function builds the frontend app’s layout using Streamlit. We’ll have a button to clear the conversation and one to download. We’ll also have a text box where the user can type their question, and upon pressing enter, we’ll call themonolith_llm_response
function with the user’s input. Finally, we’ll display the entire conversation on the front end using thedisplay_conversation
function from utils.monolith_llm_response()
: This function is responsible for the chatbot logic using Huggingface pipelines. First, we create a new Conversation object and initialize it to the entire conversation history up to that point. Then, we add the latest user_input to that object, and finally, we pass this conversation object to the Huggingface pipeline that we created two steps back. Huggingface automatically adds the user input and response generated to the conversation history!
# %%writefile monolith.py
import streamlit as st
import utils
from transformers import Conversation, pipeline# https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/pipelines#transformers.Conversation
chatbot = pipeline(
"conversational", model="facebook/blenderbot-400M-distill", max_length=1000
)
@st.cache_data()
def monolith_llm_response(user_input):
"""Run the user input through the LLM and return the response."""
# Step 1: Initialize the conversation history
conversation = Conversation(**st.session_state.conversation_history)
# Step 2: Add the latest user input
conversation.add_user_input(user_input)
# Step 3: Generate a response
_ = chatbot(conversation)
# User input and generated response are automatically added to the conversation history
# print(st.session_state.conversation_history)
def main():
st.title("Monolithic ChatBot App")
col1, col2 = st.columns(2)
with col1:
utils.clear_conversation()
# Get user input
if user_input := st.text_input("Ask your question 👇", key="user_input"):
monolith_llm_response(user_input)
# Display the entire conversation on the frontend
utils.display_conversation(st.session_state.conversation_history)
# Download conversation code runs last to ensure the latest messages are captured
with col2:
utils.download_conversation()
if __name__ == "__main__":
main()
That’s it! We can run this monolithic application by running streamlit run monolith.py
and interacting with the application on a web browser! We could quickly deploy this application as such to a cloud service like Google Cloud Run, as described in my previous blog post, and interact with it over the internet too!
Microservices architecture is an approach that involves breaking down the application into smaller, independent services. Each application component, such as the user interface, business logic, and data storage, is developed and deployed independently. This approach offers flexibility and scalability as we can modularly add more capabilities and horizontally scale each service independently of others by adding more instances.
Let’s split the Huggingface model inference from our monolithic app into a separate microservice using FastAPI and the Streamlit frontend into another microservice below. Since the backend in this demo only has the LLM model, our backend API server is the same as the LLM model from the picture above. We can directly re-use the utils.py file we created above in the frontend microservice!
First, let’s create a backend.py file that will serve as our FastAPI microservice that runs the Huggingface pipeline inference.
- We first create the pipeline object with the same model that we chose earlier, “facebook/blenderbot-400M-distill”
- We then create a ConversationHistory Pydantic model so that we can receive the inputs required for the pipeline as a payload to the FastAPI service. For more information on the FastAPI request body, please look at the FastAPI documentation.
- It’s a good practice to reserve the root route in APIs for a health check. So we define that route first.
- Finally, we define a route called
/chat
, which accepts the API payload as a ConversationHistory object and converts it to a dictionary. Then we create a new Conversation object and initialize it with the conversation history received in the payload. Next, we add the latest user_input to that object and pass this conversation object to the Huggingface pipeline. Finally, we must return the latest generated response to the front end.
# %%writefile backend.py
from typing import Optionalfrom fastapi import FastAPI
from pydantic import BaseModel, Field
from transformers import Conversation, pipeline
app = FastAPI()
# https://huggingface.co/docs/transformers/v4.28.1/en/main_classes/pipelines#transformers.Conversation
chatbot = pipeline(
"conversational", model="facebook/blenderbot-400M-distill", max_length=1000
)
class ConversationHistory(BaseModel):
past_user_inputs: Optional
] = []
generated_responses: Optional
] = []
user_input: str = Field(example="Hello, how are you?")@app.get("/")
async def health_check():
return {"status": "OK!"}
@app.post("/chat")
async def llm_response(history: ConversationHistory) -> str:
# Step 0: Receive the API payload as a dictionary
history = history.dict()
# Step 1: Initialize the conversation history
conversation = Conversation(
past_user_inputs=history["past_user_inputs"],
generated_responses=history["generated_responses"],
)
# Step 2: Add the latest user input
conversation.add_user_input(history["user_input"])
# Step 3: Generate a response
_ = chatbot(conversation)
# Step 4: Return the last generated result to the frontend
return conversation.generated_responses[-1]
We can run this FastAPI app locally using uvicorn backend:app --reload
, or deploy it to a cloud service like Google Cloud Run, as described in my previous blog post, and interact with it over the internet! You can test the backend using the API docs that FastAPI automatically generates at the /docs
route by navigating to http://127.0.0.1:8000/docs.
Finally, let’s create a frontend.py file that contains the frontend code.
main()
: This function is precisely similar tomain()
in the monolithic application, except for one change that we call themicroservice_llm_response()
function when the user enters any input.microservice_llm_response()
: Since we split out the LLM logic into a separate FastAPI microservice, this function uses the conversation history stored in the session_state to post a request to the backend FastAPI service and then appends both the user’s input and the response from the FastAPI backend to the conversation history to continue the memory of the entire chat thread.
# %%writefile frontend.py
import requests
import streamlit as st
import utils# Replace with the URL of your backend
app_url = "http://127.0.0.1:8000/chat"
@st.cache_data()
def microservice_llm_response(user_input):
"""Send the user input to the LLM API and return the response."""
payload = st.session_state.conversation_history
payload["user_input"] = user_input
response = requests.post(app_url, json=payload)
# Manually add the user input and generated response to the conversation history
st.session_state.conversation_history["past_user_inputs"].append(user_input)
st.session_state.conversation_history["generated_responses"].append(response.json())
def main():
st.title("Microservices ChatBot App")
col1, col2 = st.columns(2)
with col1:
utils.clear_conversation()
# Get user input
if user_input := st.text_input("Ask your question 👇", key="user_input"):
microservice_llm_response(user_input)
# Display the entire conversation on the frontend
utils.display_conversation(st.session_state.conversation_history)
# Download conversation code runs last to ensure the latest messages are captured
with col2:
utils.download_conversation()
if __name__ == "__main__":
main()
That’s it! We can run this frontend application by running streamlit run frontend.py
and interacting with the application on a web browser! As my previous blog post described, we could quickly deploy to a cloud service like Google Cloud Run and interact with it over the internet too!
The answer depends on the requirements of your application. A monolithic architecture can be a great starting point for a Data Scientist to build an initial proof-of-concept quickly and get it in front of business stakeholders. But, inevitably, if you plan to productionize the application, a microservices architecture is generally a better bet over a monolithic one because it allows for more flexibility and scalability and allows different specialized developers to focus on building the various components. For example, a frontend developer might use React to build the frontend, a Data Engineer might use Airflow to write the data pipelines, and an ML engineer might use FastAPI or BentoML to deploy the model serving API with custom business logic.
Additionally, with microservices, chatbot developers can easily incorporate new features or change existing ones without affecting the entire application. This level of flexibility and scalability is crucial for businesses that want to integrate the chatbot into existing applications. Dedicated UI/UX, data engineers, data scientists, and ML engineers can each focus on their areas of expertise to deliver a polished product!
In conclusion, monolithic and microservices architectures have pros and cons, and the choice between the two depends on the business’s specific needs. However, I prefer microservices architecture for chatbot applications due to its flexibility, scalability, and the fact that I can delegate frontend development to more qualified UI/UX folk 🤩.