Load Testing Simplified With SageMaker Inference Recommender | by Ram Vegiraju | Mar, 2023

By Jessie Hobb On Mar 7, 2023

Test TensorFlow ResNet50 on SageMaker Real-Time Endpoints

Image from Unsplash by **Amokrane Ait-Kaci**

In the past I’ve written extensively about the importance of load testing your Machine Learning models before deploying them into production. When it comes to real-time inference use-cases in specific it’s essential to ensure your solution meets your target latency and throughput. We’ve also explored how we can use the Python library, Locust to define scripts that can simulate our expected traffic patterns.

While Locust is an incredibly powerful tool, it can be difficult to setup and also requires a lot of iterations across the different hyperparameters and hardware you may be testing to identify the proper configuration for production. For SageMaker Real-Time Inference, a key tool to take a look at is SageMaker Inference Recommender. Rather than having to repeatedly run a Locust script across different configurations, you can essentially pass in an array of EC2 instance types to test your endpoint, as well as hyperparameters for your specific model container for more advanced deployments. In today’s blog we’ll take a look at how we can configure this feature and how we can simplify load testing SageMaker Real-Time endpoints.

NOTE: This article will assume basic knowledge of AWS, SageMaker, and Python. To understand what SageMaker Real-Time Inference is please take a look at the following starter blog.

Setup & Locally Test Inference

For development you can utilize either a SageMaker Classic Notebook Instance or a SageMaker Studio Kernel. For our environment we utilized a TensorFlow 2.0 Kernel with Python3 on a ml.t3.medium base instance.

The model we will be utilizing today is a pre-trained TensorFlow ResNet50 for Image Classification. We can first retrieve this model from the TensorFlow model hub and pull it into our notebook.

import os
import tensorflow as tf
from tensorflow.keras.applications import resnet50
from tensorflow.keras import backend
import numpy as np
from tensorflow.keras.preprocessing import image

Before we get to testing on SageMaker we want to locally test this model so we can get an idea of the type of input format we need to configure for our endpoint. For our sample data point, we will use a picture of my dog Milo when he was a puppy (he’s a behemoth now).

#model = tf.keras.applications.ResNet50()
tf.keras.backend.set_learning_phase(0)
model = resnet50.ResNet50()# Load the image file, resizing it to 224x224 pixels (required by this model)
img = image.load_img("dog.jpg", target_size=(224, 224))
# Convert the image to a numpy array
x = image.img_to_array(img)
# Add a forth dimension since Keras expects a list of images
x = np.expand_dims(x, axis=0)
# Scale the input image to the range used in the trained network
x = resnet50.preprocess_input(x)
print("predicting model")
predictions = model.predict(x)
predicted_classes = resnet50.decode_predictions(predictions, top=9)
print(predicted_classes)

Model Results (Screenshot by Author)

We’ve now verified the format in which our model expects inference so we can focus on getting it configured for SageMaker.

Prepare Model and Payload

SageMaker Inference Recommender expects two mandatory inputs: the model data and a sample payload. For both it expects it in a tarball format so we take our artifacts and zip them in a format that the service will understand.

For our model we can either take the model we already loaded earlier in the notebook or instantiate a new version. We download the model artifacts into a local directory with the necessary metadata for TensorFlow Serving.

export_dir = "00001"
tf.keras.backend.set_learning_phase(0)
model = tf.keras.applications.ResNet50()if not os.path.exists(export_dir):
os.makedirs(export_dir)
print("Directory ", export_dir, " Created ")
else:
print("Directory ", export_dir, " already exists")
# Save to SavedModel
model.save(export_dir, save_format="tf", include_optimizer=False)

We can then tar this into a model.tar.gz and upload it to an S3 bucket that we can point Inference Recommender to.

!tar -cvpzf model.tar.gz ./00001#upload data to S3
model_url = sagemaker_session.upload_data(
path="model.tar.gz", key_prefix="resnet-model-data"
)

We then take our sample image and convert it into a JSON for our model and save this to a tarball in a similar manner as we did for the model artifacts.

import json
payload = json.dumps(x.tolist())
payload_archive_name = "payload.tar.gz"with open("payload.json", "w") as outfile:
outfile.write(payload)
#create payload tarball
!tar -cvzf {payload_archive_name} payload.json
#upload sample payload to S3
sample_payload_url = sagemaker_session.upload_data(
path=payload_archive_name, key_prefix="resnet-payload"
)

Now that we have our inputs configured we can move onto the SageMaker portion of the project.

Create SageMaker Model & Track With Model Registry

SageMaker has a few specific objects specific to its service, an important one for us in this case is the SageMaker Model entity. This model entity consists of two core factors: model data and container/image. The model data can be your trained or pre-trained model artifacts that you provide in an S3 Bucket. The container is essentially the framework of your model. In this case we can retrieve the managed SageMaker TensorFlow image, but you can also build and push your own container if it’s unsupported by AWS Deep Learning Containers. Here we define this SageMaker Model object, utilizing the SageMaker Python SDK.

import sagemaker
from sagemaker.model import Model
from sagemaker import image_urismodel = Model(
model_data=model_url,
role=role,
image_uri = sagemaker.image_uris.retrieve(framework="tensorflow", region=region, version="2.1", py_version="py3", 
image_scope='inference', instance_type="ml.m5.xlarge"),
sagemaker_session=sagemaker_session
)

An optional step that you can utilize is registering your model with SageMaker Model Registry. Tracking hundreds of models can be a difficult process and with Model Registry you can essentially simplify model versioning and lineage so that you have all model entities in one core space. We can register a model with the following API call.

model_package = model.register(
content_types=["application/json"],
response_types=["application/json"],
model_package_group_name=model_package_group_name,
image_uri=model.image_uri,
approval_status="Approved",
framework="TENSORFLOW"
)

We can also view the model package that we just created in the SageMaker Studio Console.

In a real-world use-case you may have multiple models within a singular model package and you can approve the one that you choose to deploy to production.

Now that we have our Model object prepared we can run an Inference Recommender Job on this entity.

Inference Recommender Job

There are two types of Inference Recommender Jobs: Default and Advanced. With a Default Job, we can simply pass in our sample payload along with an array of EC2 instances that you want to test your model against. Underneath the hood, Inference Recommender will test your model against all these instances and track throughput and latency for your. We can utilize the right_size API call to kick off an Inference Recommender Job.

model_package.right_size(
sample_payload_url=sample_payload_url,
supported_content_types=["application/json"],
supported_instance_types=["ml.c5.xlarge", "ml.c5.9xlarge", "ml.c5.18xlarge", "ml.m5d.24xlarge"],
framework="TENSORFLOW",
)

This job will take approximately 35–40 minutes to complete as it will iterate across the different instance types that you have provided. We can then view these results in the SageMaker Studio UI.

Default Job Results (Screenshot by Author)

Here you can toggle cost, latency, and throughput on importance levels and get the optimal hardware configuration. You can also directly create your endpoint from the console if you are happy with the performance shown by the tests.

Lastly, if you want to test different hyperparameters for your container, this is also available through the Advanced Inference Recommender Job. Here you can specify hyperparameters that are tunable for your specific model container.

from sagemaker.parameter import CategoricalParameter 
from sagemaker.inference_recommender.inference_recommender_mixin import (  
Phase,  
ModelLatencyThreshold 
) hyperparameter_ranges = [ 
{ 
"instance_types": CategoricalParameter(["ml.m5.xlarge", "ml.g4dn.xlarge"]), 
'OMP_NUM_THREADS': CategoricalParameter(['1', '2', '3']), 
} 
]

Along with this, you can also configure the traffic patterns for your load test. For example, if you want to scale up the number of users across different intervals you can configure this behavior.

phases = [ 
Phase(duration_in_seconds=120, initial_number_of_users=2, spawn_rate=2), 
Phase(duration_in_seconds=120, initial_number_of_users=6, spawn_rate=2) 
]

You can also set thresholds, for example if you have a strict latency requirement of 200ms, this can be set as a stopping parameter if your configuration is not achieving these results.

model_latency_thresholds = [ 
ModelLatencyThreshold(percentile="P95", value_in_milliseconds=300) 
]

You can then kick off and view the results of the Advanced Job in a similar fashion to the default job.

model_package.right_size( 
sample_payload_url=sample_payload_url, 
supported_content_types=["application/json"], 
framework="TENSORFLOW", 
job_duration_in_seconds=3600, 
hyperparameter_ranges=hyperparameter_ranges, 
phases=phases, # TrafficPattern 
max_invocations=100, # StoppingConditions 
model_latency_thresholds=model_latency_thresholds
)

Additional Resources & Conclusion

You can find the code for this example and more at the link above. SageMaker Inference Recommender is a powerful tool that can automate the difficult portion of load testing setup. It’s important to note, however that at the moment there is no support for advanced hosting options such as Multi-Model and Multi-Container Endpoints, so for those use-cases utilizing a third party framework such as Locust will be necessary. As always any feedback is appreciated and feel free to reach out with any questions or comments, thank you for reading!