Load Testing SageMaker Multi-Model Endpoints | by Ram Vegiraju

Utilize Locust to Distribute Traffic Weight Across Models

Image from Unsplash by Luis Reyes

Productionizing Machine Learning models is a complicated practice. There’s a lot of iteration around different model parameters, hardware configurations, traffic patterns that you will have to test to try to finalize a production grade deployment. Load testing is an essential software engineering practice, but also crucial to apply in the MLOps space to see how performant your model is in a real-world setting.

How can we load test? A simple yet highly effective framework is the Python package: Locust. Locust can be used in both a vanilla and distributed mode to simulate up to thousands of Transactions Per Second (TPS). For today’s blog we will assume basic understanding of this package and cover the fundamentals briefly, but for a more general introduction please reference this article.

What model/endpoint will we be testing? SageMaker Real-Time Inference is one of the best options for serving your ML models on REST endpoints tailored for low latency, high throughput workloads. In this blog we’ll specifically take a look at an advanced hosting option known as SageMaker Multi-Model Endpoints. Here we can host thousands of models behind a singular REST endpoint and specify a target model we want to invoke for each API call. Load testing becomes challenging here because we are dealing with multiple points of invoke not a singular model/endpoint. While it’s possible to randomly generate traffic across all models, sometimes users will want to control what models receive more traffic. In this example, we’ll take a look at how you can distribute traffic weight across specific models so you can simulate your real-world use case as closely as possible.

NOTE: This article will assume basic knowledge of AWS and SageMaker, on the coding side Python fluency will be assumed along with a basic understanding of the Locust package. To understand load testing single model endpoints of SageMaker with Locust as a starter please reference this article.

Dataset Citation

In this example we’ll be using the Abalone dataset for a regression problem, this dataset is sourced from the UCI ML Repository (CC BY 4.0) and you can find the official citation here.

Creating A SageMaker Multi-Model Endpoint

Before we can get started with load testing, we have to create our SageMaker Multi-Model Endpoint. All development for the creation of the endpoint will occur on a SageMaker Notebook Instance on a conda_python3 kernel.

For this example we’ll utilize the Abalone dataset and run a SageMaker XGBoost algorithm on it for a regression model. You can download the dataset from the publicly available Amazon datasets. We will utilize this dataset to run training and create copies of our model artifacts to create a Multi-Model Endpoint.

#retreive data
aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train_csv/abalone_dataset1_train.csv .

We can first kick of a training job using the built-in SageMaker XGBoost algorithm, for a full guide on this process please reference this article.

model_path = f's3://{default_bucket}/{s3_prefix}/xgb_model'image_uri = sagemaker.image_uris.retrieve(
framework="xgboost",
region=region,
version="1.0-1",
py_version="py3",
instance_type=training_instance_type,
)
xgb_train = Estimator(
image_uri=image_uri,
instance_type=training_instance_type,
instance_count=1,
output_path=model_path,
sagemaker_session=sagemaker_session,
role=role
)
xgb_train.set_hyperparameters(
objective="reg:linear",
num_round=50,
max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.7,
silent=0,
)
xgb_train.fit({'train': train_input})

After we’ve completed this training job we will grab the generated model artifacts (model.tar.gz format in SageMaker) and create another copy of this artifact to have two models behind our Multi-Model Endpoint. Obviously in a real-world use case these models may be trained on different datasets or scale up to thousands of models behind the endpoint.

model_artifacts = xgb_train.model_data
model_artifacts # model.tar.gz artifact

%%shs3_bucket='sagemaker-us-east-1-474422712127'
for i in {0..1}
do
aws s3 cp model.tar.gz s3://$s3_bucket/mme-xgboost/xgboost-$i.tar.gz 
done

After we’ve made these two copies, we can specify the S3 path with our model artifacts for both models in our create_model Boto3 API call.

from time import gmtime, strftime
model_name = 'mme-source' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())print('Model name: ' + model_name)
print('Model data Url: ' + model_url)
create_model_response = client.create_model(
ModelName=model_name,
Containers=[
{
"Image": image_uri,
"Mode": "MultiModel",
"ModelDataUrl": model_url
}
],
ExecutionRoleArn=sagemaker.get_execution_role(),
)
print("Model Arn: " + create_model_response["ModelArn"])

We can define our instance type and count behind the endpoint in the endpoint configuration object, which we then feed to our create_endpoint API call.

#Step 2: EPC Creation
xgboost_epc_name = "mme-source" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_config_response = client.create_endpoint_config(
EndpointConfigName=xgboost_epc_name,
ProductionVariants=[
{
"VariantName": "xgboostvariant",
"ModelName": model_name,
"InstanceType": "ml.m5.xlarge",
"InitialInstanceCount": 1,
#"Environment": {} 
},
],
)
print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

#Step 3: EP Creation
endpoint_name = "mme-source" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=xgboost_epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

We can validate that our endpoint works with a sample data point invocation from the Abalone dataset. Notice that we specify a target model for a Multi-Model Endpoint, here we specify the model.tar.gz or model artifact that we want to invoke.

import boto3resp = runtime.invoke_endpoint(EndpointName=endpoint_name, Body=b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0', 
ContentType='text/csv', TargetModel = "xgboost-1.tar.gz")
print(resp['Body'].read())

This invoke_endpoint API call is essential as it is the point of contact we are evaluating in our load tests. We now have a functioning Multi-Model Endpoint, let’s get to testing!

Load Testing With Locust

Let’s do a quick primer on Locust before we dive into setup of our script. Locust is a Python framework that let’s you define user behavior with Python code. Locust defines an execution as a Task. A Task in Locust is essentially the API or in our case the invoke_endpoint call that we want to test. Each user will run the tasks that we define for them in a Python script that we build.

Locust has a vanilla mode that utilizes a single process to run your tests, but when you want to scale up it also has a distributed load generation feature that essentially allows you to work with multiple processes and even multiple client machines.

In this case we want to bombard our Multi-Model Endpoint with above 1000 TPS, so we need a powerful client machine that can handle the load that we are trying to generate. We can spin up an EC2 instance, in this case we use an ml.c5d.18xlarge and we will conduct our load testing in this environment to ensure we don’t run out of client side juice. To understand how to setup an EC2 instance, please read the following documentation. For our AMI we utilize a “Deep Learning AMI GPU TensorFlow 2.9.1 (Ubuntu 20.04)”, these Deep Learning AMIs come with a lot of pre-installed ML frameworks so I find them handy in these use-cases. Note that while we are using EC2 to test and invoke our endpoint, you can also use another client source as long as it has adequate compute power to handle the TPS Locust will generate.

Once you are SSHd into your EC2 instance we can get into defining our locust script. We first define the boto3 client that will be conducting our invoke_endpoint call that we are measuring. We’ve parameterized a few of these with a distributed shell script that we will cover later.

class BotoClient:
def __init__(self, host):#Consider removing retry logic to get accurate picture of failure in locust
config = Config(
retries={
'max_attempts': 100,
'mode': 'standard'
}
)
self.sagemaker_client = boto3.client('sagemaker-runtime',config=config)
self.endpoint_name = host.split('/')[-1]
self.region = region
self.content_type = content_type
self.payload = b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0'

Now is when we get specific for Multi-Model Endpoints. We define two methods, each method will hit one of our two target models.

#model that receives more traffic
def sendPopular(self):request_meta = {
"request_type": "InvokeEndpoint",
"name": "SageMaker",
"start_time": time.time(),
"response_length": 0,
"response": None,
"context": {},
"exception": None,
}
start_perf_counter = time.perf_counter()
try:
response = self.sagemaker_client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=self.payload,
ContentType=self.content_type,
TargetModel = 'xgboost-0.tar.gz'
)
response_body = response["Body"].read()
except Exception as e:
request_meta['exception'] = e
request_meta["response_time"] = (time.perf_counter() - start_perf_counter) * 1000
events.request.fire(**request_meta)

#model that receives rest of traffic
def sendRest(self):request_meta = {
"request_type": "InvokeEndpoint",
"name": "SageMaker",
"start_time": time.time(),
"response_length": 0,
"response": None,
"context": {},
"exception": None,
}
start_perf_counter = time.perf_counter()
try:
response = self.sagemaker_client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=self.payload,
ContentType=self.content_type,
TargetModel = 'xgboost-1.tar.gz'
)
response_body = response["Body"].read()
except Exception as e:
request_meta['exception'] = e
request_meta["response_time"] = (time.perf_counter() - start_perf_counter) * 1000
events.request.fire(**request_meta)

Now in the future if you have 200 models do you need a method for each? Not necessarily, you can specify the target model string to fit into the models you need. For example if you have 200 models and wanted 5 models to be invoked for a specific method we can set the TargetModel parameter to something like the following snippet.

f'xgboost-{random.randint(0,4)}.tar.gz' #specifies 5 models to receive traffic in method

The more specific you want to get the more methods you may have to define, but if you have a general idea that a certain number of models will receive the majority of the traffic then some string manipulation like the above will suffice.

Finally we can define the task weight via a decorator. Our first model now is three times more likely to receive traffic than the second.

class MyUser(BotoUser):#This model is 3 times more likely to receive traffic
@task(3)
def send_request(self):
self.client.sendPopular()
@task
def send_request_major(self):
self.client.sendRest()

With the task decorator we can define the weight and you can expand and manipulate these depending on your traffic pattern.

Lastly, there is also a shell script we have defined in this repository that you can utilize to increase or decrease traffic.

#replace with your endpoint name in format https://<<endpoint-name>>
export ENDPOINT_NAME=https://$1export REGION=us-east-1
export CONTENT_TYPE=text/csv
export USERS=200
export WORKERS=40
export RUN_TIME=2mg
export LOCUST_UI=false # Use Locust UI
#replace with the locust script that you are testing, this is the locust_script that will be used to make the InvokeEndpoint API calls. 
export SCRIPT=locust_script.py
#make sure you are in a virtual environment
#. ./venv/bin/activate
if $LOCUST_UI ; then
locust -f $SCRIPT -H $ENDPOINT_NAME --master --expect-workers $WORKERS -u $USERS -t $RUN_TIME --csv results &
else
locust -f $SCRIPT -H $ENDPOINT_NAME --master --expect-workers $WORKERS -u $USERS -t $RUN_TIME --csv results --headless &
fi
for (( c=1; c<=$WORKERS; c++ ))
do 
locust -f $SCRIPT -H $ENDPOINT_NAME --worker --master-host=localhost &
done

Here we are defining the parameters our locust script is reading, but also most importantly two Locust specific parameters in users and workers. Here you can define a number of users that will be distributed across different workers. You can scale these as multiples either up or down to try to achieve your target TPS. We can execute our distributed test by running the following command.

./distributed.sh <endpoint_name>

Once we kick this off we can see in our EC2 instance CLI a load test is up and running.

Locust Distributed Load Test (Screenshot by Author)

Monitoring

Before we conclude there’s a few different ways you can monitor your load tests. One is via Locust, as seen in the above screenshot you can track your TPS and latency in live. In the end a general results file is generated containing your end to end latency percentile metrics and TPS. To adjust the duration of the test please check the RUN_TIME flag in your distributed.sh script.

Lastly, to validate your load test results you can cross check with SageMaker CloudWatch Metrics which can be found in the Console.

Monitor Endpoint (Screenshot by Author)

With Invocation Metrics we can get an idea of invocations as well as latency numbers. With Instance Metrics we can see how well our hardware is saturated and if we need to scale up or down. To fully understand how to interpret these metrics please reference this documentation.

Hardware Metrics (Screenshot by Author)

Invocation Metrics (Screenshot by Author)

Here we can see we’ve scaled to nearly 77,000 invokes per minute which comes to a little above 1000 TPS as our Locust metrics showed. It’s best practice to track these metrics at both the instance and invocation level so that you can properly define AutoScaling for your hardware as necessary.

Additional Resources & Conclusion

The entire code for the example can be found in the link above. Once again if you are new to Locust and SageMaker Real-Time Inference, I strongly recommend you check out the starting blogs linked for both features. The load test scripts attached in this repository can easily be adjusted not just for SageMaker endpoints, but any APIs that you are hosting and need to be tested. As always any feedback is appreciate and feel free to reach out with any questions or comments, thank you for reading!

Utilize Locust to Distribute Traffic Weight Across Models

Image from Unsplash by Luis Reyes

Dataset Citation

In this example we’ll be using the Abalone dataset for a regression problem, this dataset is sourced from the UCI ML Repository (CC BY 4.0) and you can find the official citation here.

Creating A SageMaker Multi-Model Endpoint

#retreive data
aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train_csv/abalone_dataset1_train.csv .

We can first kick of a training job using the built-in SageMaker XGBoost algorithm, for a full guide on this process please reference this article.

model_path = f's3://{default_bucket}/{s3_prefix}/xgb_model'image_uri = sagemaker.image_uris.retrieve(
framework="xgboost",
region=region,
version="1.0-1",
py_version="py3",
instance_type=training_instance_type,
)
xgb_train = Estimator(
image_uri=image_uri,
instance_type=training_instance_type,
instance_count=1,
output_path=model_path,
sagemaker_session=sagemaker_session,
role=role
)
xgb_train.set_hyperparameters(
objective="reg:linear",
num_round=50,
max_depth=5,
eta=0.2,
gamma=4,
min_child_weight=6,
subsample=0.7,
silent=0,
)
xgb_train.fit({'train': train_input})

model_artifacts = xgb_train.model_data
model_artifacts # model.tar.gz artifact

%%shs3_bucket='sagemaker-us-east-1-474422712127'
for i in {0..1}
do
aws s3 cp model.tar.gz s3://$s3_bucket/mme-xgboost/xgboost-$i.tar.gz 
done

After we’ve made these two copies, we can specify the S3 path with our model artifacts for both models in our create_model Boto3 API call.

from time import gmtime, strftime
model_name = 'mme-source' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())print('Model name: ' + model_name)
print('Model data Url: ' + model_url)
create_model_response = client.create_model(
ModelName=model_name,
Containers=[
{
"Image": image_uri,
"Mode": "MultiModel",
"ModelDataUrl": model_url
}
],
ExecutionRoleArn=sagemaker.get_execution_role(),
)
print("Model Arn: " + create_model_response["ModelArn"])

We can define our instance type and count behind the endpoint in the endpoint configuration object, which we then feed to our create_endpoint API call.

#Step 2: EPC Creation
xgboost_epc_name = "mme-source" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_config_response = client.create_endpoint_config(
EndpointConfigName=xgboost_epc_name,
ProductionVariants=[
{
"VariantName": "xgboostvariant",
"ModelName": model_name,
"InstanceType": "ml.m5.xlarge",
"InitialInstanceCount": 1,
#"Environment": {} 
},
],
)
print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

#Step 3: EP Creation
endpoint_name = "mme-source" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=xgboost_epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

import boto3resp = runtime.invoke_endpoint(EndpointName=endpoint_name, Body=b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0', 
ContentType='text/csv', TargetModel = "xgboost-1.tar.gz")
print(resp['Body'].read())

This invoke_endpoint API call is essential as it is the point of contact we are evaluating in our load tests. We now have a functioning Multi-Model Endpoint, let’s get to testing!

Load Testing With Locust

class BotoClient:
def __init__(self, host):#Consider removing retry logic to get accurate picture of failure in locust
config = Config(
retries={
'max_attempts': 100,
'mode': 'standard'
}
)
self.sagemaker_client = boto3.client('sagemaker-runtime',config=config)
self.endpoint_name = host.split('/')[-1]
self.region = region
self.content_type = content_type
self.payload = b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0'

Now is when we get specific for Multi-Model Endpoints. We define two methods, each method will hit one of our two target models.

#model that receives more traffic
def sendPopular(self):request_meta = {
"request_type": "InvokeEndpoint",
"name": "SageMaker",
"start_time": time.time(),
"response_length": 0,
"response": None,
"context": {},
"exception": None,
}
start_perf_counter = time.perf_counter()
try:
response = self.sagemaker_client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=self.payload,
ContentType=self.content_type,
TargetModel = 'xgboost-0.tar.gz'
)
response_body = response["Body"].read()
except Exception as e:
request_meta['exception'] = e
request_meta["response_time"] = (time.perf_counter() - start_perf_counter) * 1000
events.request.fire(**request_meta)

#model that receives rest of traffic
def sendRest(self):request_meta = {
"request_type": "InvokeEndpoint",
"name": "SageMaker",
"start_time": time.time(),
"response_length": 0,
"response": None,
"context": {},
"exception": None,
}
start_perf_counter = time.perf_counter()
try:
response = self.sagemaker_client.invoke_endpoint(
EndpointName=self.endpoint_name,
Body=self.payload,
ContentType=self.content_type,
TargetModel = 'xgboost-1.tar.gz'
)
response_body = response["Body"].read()
except Exception as e:
request_meta['exception'] = e
request_meta["response_time"] = (time.perf_counter() - start_perf_counter) * 1000
events.request.fire(**request_meta)

f'xgboost-{random.randint(0,4)}.tar.gz' #specifies 5 models to receive traffic in method

Finally we can define the task weight via a decorator. Our first model now is three times more likely to receive traffic than the second.

class MyUser(BotoUser):#This model is 3 times more likely to receive traffic
@task(3)
def send_request(self):
self.client.sendPopular()
@task
def send_request_major(self):
self.client.sendRest()

With the task decorator we can define the weight and you can expand and manipulate these depending on your traffic pattern.

Lastly, there is also a shell script we have defined in this repository that you can utilize to increase or decrease traffic.

#replace with your endpoint name in format https://<<endpoint-name>>
export ENDPOINT_NAME=https://$1export REGION=us-east-1
export CONTENT_TYPE=text/csv
export USERS=200
export WORKERS=40
export RUN_TIME=2mg
export LOCUST_UI=false # Use Locust UI
#replace with the locust script that you are testing, this is the locust_script that will be used to make the InvokeEndpoint API calls. 
export SCRIPT=locust_script.py
#make sure you are in a virtual environment
#. ./venv/bin/activate
if $LOCUST_UI ; then
locust -f $SCRIPT -H $ENDPOINT_NAME --master --expect-workers $WORKERS -u $USERS -t $RUN_TIME --csv results &
else
locust -f $SCRIPT -H $ENDPOINT_NAME --master --expect-workers $WORKERS -u $USERS -t $RUN_TIME --csv results --headless &
fi
for (( c=1; c<=$WORKERS; c++ ))
do 
locust -f $SCRIPT -H $ENDPOINT_NAME --worker --master-host=localhost &
done

./distributed.sh <endpoint_name>

Once we kick this off we can see in our EC2 instance CLI a load test is up and running.

Locust Distributed Load Test (Screenshot by Author)

Monitoring

Lastly, to validate your load test results you can cross check with SageMaker CloudWatch Metrics which can be found in the Console.

Monitor Endpoint (Screenshot by Author)

Hardware Metrics (Screenshot by Author)

Invocation Metrics (Screenshot by Author)

Additional Resources & Conclusion

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.

Load Testing SageMaker Multi-Model Endpoints | by Ram Vegiraju | Feb, 2023

Utilize Locust to Distribute Traffic Weight Across Models

Dataset Citation

Creating A SageMaker Multi-Model Endpoint

Load Testing With Locust

Monitoring

Additional Resources & Conclusion

Utilize Locust to Distribute Traffic Weight Across Models

Dataset Citation

Creating A SageMaker Multi-Model Endpoint

Load Testing With Locust

Monitoring

Additional Resources & Conclusion

Related Posts