10 Ways BentoML Can Help You Serve and Scale Machine Learning Models | by Ahmed Besbes | Nov, 2022

By Jessie Hobb On Nov 17, 2022

Moving from Jupyter notebooks to production is not that difficult after all

If you’re a data scientist, you probably spend a lot of time developing intricate Jupyter notebooks to perform data analysis, build complex training pipelines, or compute statistics.

Jupyter notebooks are great for this and allow us to prototype ideas in no time.

But, what happens once you’re done with this work and you’re satisfied with your saved ML models? 🤔

This is where you start to think about deploying them to production. Have you thought this through when you started working?

Probably not. And you’re not to blame as this is not a data scientist’s core expertise. (although the industry is currently moving towards this)

In this tutorial, I will show how you can use a Python library called BentoML to package your machine models and deploy them very easily.

I will first introduce you to the concept of production ML. Then, I will introduce you to the tool and cover 10 ways BentoML can make your life easier.

PS: This is not a clickbait article: all of these reasons are valid and documented reasons. For each one, I will share the code, explanations, and my impressions.

Without further ado, let’s have a look 🔍

Once you train a model, you need to start thinking about sharing it with other teams.

If your other developers (e.g. backend or front devs) in your team want to use it, they need to interact with an API of some sort that wraps it. This API must be clear and documented, with explicit error logging and data validation.

If the DevOps team wants to manage the deployment of your model, it needs to handle its dependencies. It typically expects, at least, a Docker image that runs and serves your model.

If the product team wants to stress-test your model or showcase it to the client, then the API must scale to many concurrent requests.

Let’s sum it up here.

Bringing an ML model to life (i.e. production) comes with a lot of constraints:

the use and support of multiple ML frameworks (duh!)
creating an API and serving it with a minimum level of performance
reproducibility and dependency management
API Documentation
Monitoring, Logging, Metrics, etc.

Overwhelming isn’t it?

In the following 10 sections, we discover how BentoML achieves this through concepts, useful commands, and ML-related features.

BentoML is an end-to-end solution for model serving and deployment. It’s designed to help data scientists build production-ready endpoints with common MLOps best practices.

Is it yet another web framework?

Not exactly. BentoML packages everything you need in an ML project into a distribution format called a bento 🍱 (this is where the analogy makes sense, as a bento is originally a Japanese lunch box that holds a single portion meal consisting of a main dish and some sides)

More precisely, a bento is a file archive with all the source code of your model training and the APIs you defined for serving, the saved binary models, the data files, the Dockerfiles, the dependencies, and the additional configurations.

Everything fits together into a unit and is packaged into a standardized format.

You can think of a bento as a Docker image, but for ML.

A Bento is also self-contained. This simplifies model serving and deployment to any cloud infrastructure.

When your bento is built (we’ll see what that means in the following section), you can either turn it into a Docker image that you can deploy on the cloud or use bentoctl that relies on Terraform under the hood and deploys your bento to any cloud service and infrastructure (AWS Lambda or EC2, GCP Cloud Run, Azure functions, and more).

How to deploy a Bento — Image by the author

First, you need to run: pip install bentoml

Once installed, the bentoml command is added to your shell: this will be useful in the next sections.

You typically start using BentoML when your model is done training (this section is not impacted)

In fact, instead of saving your model somewhere on your filesystem, you can use BentoML to save it in a specific folder (called a model store). This helps provide a unique tag for each version of your models and ensures reproducibility.

In the following example, we save an SVC model trained on the iris dataset.

This generates a unique model tag that allows you to later fetch the corresponding model.

It also creates a folder named after the model tag. If you look inside this folder, we’ll find the binary file and a Yaml file named model.yaml that describes the model metadata.

Once a model is created and saved in the model store, you can turn it into an API endpoint that you can request. To do this, you first have to create a Service that calls your model runner, then use it to decorate a function.

In the following example, the classify function decorated with the api method runs the code when payload data (of type NumpyNdarray) is sent over an HTTP POST request to the /classify endpoint path.

You can then serve the model locally by running the service with the following command:

bentoml serve service:svc --reload

This will open up an HTTP local server that you can request using Python

or the interface (directly accessible at http://localhost:3000)

Request through Swagger UI — Images by the author

This is where you get to build a bento and see what files it has inside. This provides a standardized structure for all your projects independently from the underlying tooling.

To build your bento, you first need to create a file called bentofile.yaml

This file configures how the bento is built: it includes metadata, lists the useful source code, and defines the package list.

To build the bento, run the following command inside the folder that contains bentofile.yaml .

bentoml build

Now if we look at the bento and check what it has inside, we’ll see the following folder structure that contains the following:

The description and the schema of the API
A Dockerfile needed to build the Docker image
The Python requirements
The trained model as well as its metadata
The source code responsible for training the model and defining the API routes
A configuration (bentoml.yaml ) hat specifies bento build options

Once a bento is created, you can use the dockerize command to build a robust Docker image. This is a very useful feature provided by bentoml. If you were using FastAPI, you’d have to do this manually.

BentoML provides this simple command that builds the image for you.

 bentoml containerize iris_classifier:latest

Containerize the bento — Image by the author

Once the image is built, you can check it out on your system:

This Docker image is self-contained and is used to serve the bento locally or deploy it to the cloud.

docker run -it --rm -p 3000:3000 iris_classifier:jclapisz2s6qyhqa serve --production

Serve the bento from the container to the host — Image by the author

Runners are ML-specific units of computation that allow the inference process to be scaled independently from the web server.

This means that inference and web request handling run on two independent processes.

This also means that your inference pipeline can have an arbitrary number of runners and can be scaled vertically (by allocating more CPU). Each runner can also have a specific configuration (RAM, CPU vs GPU, etc.)

This architecture allows you, for example, to have a service that depends on three independent runners that handle different things.

In the following example, two runners (one that performs an OCR task, the other one a text classification) are run sequentially over an input image.

You can learn more about runners here.

Batching is the term for running a prediction on a group of N inputs instead of launching N sequential predictions.
This has advantages: it increases performance and throughput and leverages acceleration hardware (GPUs are well known for speeding up vectorized operations)

Web frameworks such as FastAPI, Flask, or Django don’t have a mechanism to handle batching.

BentoML, on the other hand, provides a nice solution for this.

It goes like this:

Multiple input requests are run in parallel
A proxy (i.e. a load balancer) distributes requests between workers (a worker is a running instance of an API server)
Each worker distributes the requests to the model runners that are in charge of inference
Each runner dynamically groups the requests in batches by finding a tradeoff between latency and throughput
Runners make predictions on each batch
Batch predictions are then split and released as individual responses

The beauty of this is the total transparency for a client who sends multiple parallel requests.

To enable batching, you need to set the batchable argument to True. Example:

Learn more about batching here.

There’s something neat about runners. You can compose them as you want to create customizable inference graphs. In the previous example, we looked at two runners running sequentially. (OCR -> Text classification).

In this example, we show that runners can also run concurrently by leveraging async requests.

Think about how many times you had to combine ML models into a single prediction pipeline. With BentoML, you can run these models concurrently and collect the result at the end.

The great thing about a bento is that, when it’s built, you can deploy it in two ways:

By pushing the Docker image to a registry and deploying it to the cloud
By using bentoctl, a utility library developed by the BentoML team to speed up the process of deployment

I strongly recommend using bentoctl. It helps deploy any built bento as a production-ready API endpoint on the cloud. It supports many cloud providers (AWS, GCS, Azure, Heroku) as well as multiple services within the same one (AWS Lambda, EC2, etc.)

I’ve recently deployed a (serverless) API endpoint to AWS Lambda with a spaCy model. It worked like a charm and all I needed was a built bento.

You can through the documentation to get that working, but all you need to do in a nutshell is the following:

install bentoml
install terraform (check this link)
setup the AWS CLI and configure your credentials (see the installation guide)
install bentoctl (pip install bentoctl)
built your bento
install the aws-lambda operator that allows deploying on AWS Lambda (bentoctl supports other operators as well): bentoctl operator install aws-lambda
generate the deployment files by running bentoctl init . This step interactively asks you to configure the deployment of your Lambda function (setting up the region, the memory, timeout, etc)
build the image necessary for deployment by running bentoctl build
This step prepares a Docker image and pushes it to a deployment registry.
deploy to Lambda by running 🚀 bentoctl apply -f deployment_config.yaml . After configuring the deployment, this step relies on Terraform to apply the changes

Once the deployment is done, you’ll be prompted with an API URL that you can request to interact with your model.

To delete your Lambda function, just runbentoctl destroy -f deployment_config.yaml and you’re good to go.

When you deploy a BentoML service or serve it locally, you have access to a Swagger UI that allows you to visualize and interact with the APIs resources without having any of the implementation logic in place.

This is generated from the OpenAPI specification with visual documentation, making it easy for back-end implementation and client-side consumption.

Example of a Swagger UI — Image by the author