ML Latency No More. Common Ways to Reduce ML Prediction Latency. | by Moussa Taifi PhD | Apr, 2022 | Medium

By Jessie Hobb On Jun 2, 2022

Common Ways to Reduce ML Prediction Latency to Sub X ms

Common ways to reduce ML prediction latency. Image by author

Machine Learning (ML) systems don’t exist until they are deployed.

Unfortunately, prediction latency is one of those edges that hurt badly.

And, it hurts too late in the product cycle.

Stop optimizing that model! Focus on ML serving latency first.

FOCUS ON THE ML SERVING LATENCY FIRST.

THAT’S WHAT THE CLIENT SEES FIRST.

So what are some common ways to reduce ML Latency?

Here is one way to organize the known patterns of low-latency ML serving, I hope it helps:

(High resolution diagram here)

I. The Not-So-Rare Case of Excessive ML latency
II. Online vs Offline and Real-time vs Batch
— II.a. What Do Offline Prediction Pipelines Look Like?
— II.b. Online Prediction Is Where Latency Really Hurts
III. Asynchronous Online Predictions
— III.a Option 1: Push
— III.b Option 2: Poll
IV Synchronous Online Predictions
— IV.a. The Standard ML API
— IV.b. Serving vs Constructing Predictions
V. Only So Much Can Be Done To Optimize The Model Itself
— V.a. Supporting Model Components
— V.b. Core Model Components
VI. Features Are The Real Drag In This Whole Operation
— VI.a. The “Easy” Case With Static Ref Features
— VI.b. The “Less-Easy” Case With Dynamic Real-time Features
VII. Don’t Forget The Predictions!
— VII.a. Precomputing Predictions
— — VII.a.i If You Can Use Entities, Use Them!…
—VII.b. Caching Predictions
— — VII.b.i The Very Special Case of Real-time Similarity Matching
— — VII.b.ii Remember that Predicting on Combinations of Feature Values Gets Expensive Quickly

Latency is an all to common issue with ML. Don’t let this be this case with your code. So how do we minimize the prediction serving latency of ML systems? Here are some critical questions to ask yourself before starting your next ML project:

Does it need to be <100ms or offline?
Do you know your approximate “optimizing” and “satisficing” metrics thresholds?
Did you verify that your input features can be looked up in a low-read-latency DB?
Could you find anything that can be precomputed and cached?

The image below presents several ways to answer these questions. The following sections then discuss specific applications of them.

Common ways to reduce ML prediction latency, “ranked”. Image by author

ML models come in two flavors:

Offline predictions: This is used when you need to score a whole batch of data entries, and you have quite a bit of time before having to serve the predictions. In this case, you generate predictions based solely on historical data. Things like generating promotion campaigns “offline” for customers that we think will churn out of our service soon. Not that it’s easy, but you got some time before you need to return the predictions.
Online predictions: This is used to generate predictions on the fly. As new requests come in, the service uses the current context+historical information to generate predictions. That said, context is a broad term. Think of it as a combination of things like the current date+time, the last-N items viewed in a session, the contents of a new shopping basket, user device type/location, and any other helpful pieces of information that are not neatly organized in a historical data warehouse.

II.a. What Do Offline Prediction Pipelines Look Like?

In offline use cases, you don’t score one data point at a time. Instead, you collect many data points in a suitable storage location and generate predictions for all the target data points at once. This usually takes the form of a scheduled batch job, the frequency of which matches the needs of the business.

Some use cases that benefit from this mode of operation:

You need to optimize a store inventory. Your prediction job runs at 6:45AM. How much chicken will be sold at the gyro restaurant so that you can go pick some from the local Sam’s club.
You need to understand audiences by building weekly customer segments that group users around beneficial characteristics to find out who your best customers are. yes, groups.
You need to determine if your customers are delighted. Because people talk about your sandwiches, and they are getting greasier. So you tap into social media to find out.

One way to generate offline batch predictions is to use a “standard” ETL pipeline that includes the smarts to generate the predictions.

Check out this next diagram:

“Standard” offline scoring ETL pipeline. Image by author

The diagram above goes like this:

Upload the data to be scored to your storage.
Preprocess the data into something the model can consume.
Score the preprocessed data.
Store the scores somewhere accessible to the end-users.

For the kind of processing outlined above, no one cares if it takes 1 hour or 10 hours. However, since online predictions are those most affected by latency, let’s dig into those deeper.

II.b. Online Prediction Is Where Latency Really Hurts

For online predictions, the caller sends us a single data point to score. That caller expects the prediction to come back in less than 10ms. Here are some use cases that have that latency requirement:

Generating ad recommendations for an ad request when the browser loads a page.
Optimizing a bid in a competitive real-time bidding ad marketplace.
Predicting if a critical piece of equipment will fail in the next few seconds (based on sensor data).
Predicting the grocery delivery time based on the size of the order, the current traffic situation, and other contextual information about the order.

Asynchronous predictions represent scenarios in which the caller will ask for a prediction, but the generated predictions will be delivered later. The caller does not have to block to wait for the prediction to come back.

Two main ways to perform asynchonous online predictions:

III.a. Option 1: Push. The caller sends the required data to generate the predictions but does not wait for the response. For example, when using your credit card, you don’t want to wait for a fraud check response for every transaction. Normally, the bank will push a message to you if they find a fraudulent transaction.

For example, in a customer churn prediction system, the ML system is going to try to keep users that are susceptible of quitting our paid meal-delivery subscription service:

Caller: “Hey, I have this logged-in user. She is currently active but messing with her subscription settings. Should we send her a new retention promotion email before she quits our amazing service?”
Model: “Let me think about it. I’ll send you a push notification if we need to.”
…time…passes…
Model: “ Hey Caller, send this promotion email for 3 free meals to the customer so she keeps her paid subscription to our service.”

III.b. Option 2: Poll. The caller sends the required data and then periodically checks if a prediction is available. The models are set up to generate predictions and store the predictions in a read-optimized low latency DB.

Caller: “Hey, I have this logged-in user. She is currently active but messing with her subscription settings. Should we send her a new retention promotion email before she quits our amazing service?”
Model: “Let me think about it. I’ll update the prediction in the Predictions DB.”
Caller: “Any news?”
Caller: “Any news?”
Caller: “Any news?”
Caller: “Ah, OK, I see that we should send this person a retention promotion email. Thanks.”

For async online predictions, a typical data flow is illustrated below:

Pipeline pattern for generating asynchronous predictions using event messaging and stream processing. Image by author

In the async case, the messaging system helps choreograph the process as follows:

The client talks to the facade prediction API.
The prediction API service sends the available data to an input messaging system.
The event is then cleaned and dynamically enriched by the event processing service. Then the prediction request is forwarded to the scoring service.
The event processing system receives the predictions. After formatting, filtering, and post-processing the predictions, it writes the predictions into an output scored-event messaging system.
Then either the Prediction API keeps polling to get the scores, or there is a method to push notifications to the prediction API.

IV.a. The Standard ML API

The Synchronous online predictions are more latency-sensitive than the Async mode. The basic pattern for Synchronous Online Predictions is:

Serving pattern for generating Sync predictions using an ML gateway. Image by author

The model is deployed as an HTTP REST API.
The online application sends an HTTP request to the model, and blocks, waiting for the predictions to come back “immediately” (e.g., <10ms).
The model starts generating the prediction and sends it back to the caller as soon as the prediction is available.
If the model does not respond within the latency budget (it takes too long), the caller times out, and the online application says something like “prediction took too long, try again later”.

But that’s not all of course. The ML service has to do more tasks.

Namely, generate the prediction, preprocess and enrich the input request, and post-process the output prediction before giving it back to the caller. Even all that would still be manageable with a standard ML gateway that would orchestrate the whole thing.

However, because the ML gateway can be hit by spikes in traffic, you must also do the following tasks. The following becomes part of your job:

Securing the endpoint
Load balancing the prediction traffic
Auto-scaling the number of ML gateways

Have fun.

IV.b. Serving vs Constructing Predictions

We expect synchronous online predictions to return immediately. However, it is essential to realize that reducing latency relies on optimizing two distinct levels:

Prediction construction – This is where you reduce the time it takes a model to construct predictions from a fully formed, well-behaving, enriched and massaged prediction request.
Prediction serving – This is where the rest of the latency lives. This includes any pre-computing, pre-processing, enriching, massaging of input prediction events as well as any post-processing, caching, and optimizing the delivery of the output predictions.

At the prediction construction level, the optimizations focuses on building smaller models with minimal extra cruft and selecting the proper hardware to generate the predictions at the right price/latency point.

At the prediction serving level, the focus is on structuring the supporting historical datasets in quick-enough data stores and computing real-time contextual dynamic features. This usually comes after painfully realizing that reducing the latency of the prediction construction will not move the needle when there are ten other steps before and after involved in serving the fully-functional predictive service.

Finally, as we will discuss, if you are pre-computing predictions for feature combos and caching them for online serving, reducing the model prediction latency by 50% will not be as valuable as a similar reduction in latency for feature and prediction fetching.

V.a. Supporting Model Components

The first step towards reducing latency at the model level is removing any extra model cruft. During model development, experimentation, and tuning, it’s common to add many supporting components such as logging, hooks, multiple-heads, monitoring, ensembling, and pipeline transformers code paths to help debug the model. That tooling is helpful during model training, evaluation, and debugging but adds accidental complexity to the core model. It goes without saying that removing those won’t hurt your model’s performance but will improve the latency of predictions.

V.b. Core Model Components

The next step is to look at the core components of your model and decide what needs to go. But what can guide you in this process? The key is to understand the tradeoff between the model’s optimizing metric vs. the satisficing metric.

We usually care about the optimizing metric during development phase: The model’s predictive ability. That’s your usual MAP, MRR, Accuracy, Precision, MSE, Log Loss, etc.. Each new high or new low, in the corresponding direction of each metric, is a win in the offline world.

However, the optimizing metric has to be balanced with the satisficing metric. The satisficing metric cares about the context where this model will run. For example:

Is the model going to fit on my device in terms of storage size?
Can the model run with the type of CPUs on the device? Does it require GPUs?
Can the feature preprocessing finish within specific time bounds?
Does the model prediction satisfy the latency limits that our use case requires?

The idea here is to pick an upper bound for the satisficing metric, say 50 milliseconds latency, and use that to filter out models that need more time.

Then comes the part where you start messing with the model. The primary pointers are:

The smaller the model, the faster its response time.
The lower the number of input features, the faster the response time.

To reduce the size of the model, a couple of options are available:

Trim the number of levels in a tree model
Trim the number of trees in a random forest and gradient boosting tree model
Trim the number of layers in a neural network
Trim the number of variables in a logistic regression model

Your task here is to balance the prediction effectiveness with the latency requirements. Here is some guidance in how to do that:

1. Set the satisficing metric threshold.

2. Increase the complexity of the model until it hits the satisficing metric bound.

3. If the model latency meets the requirements, but the predictive effectiveness stays below the optimizing metric requirements, then ask yourself if your app can live with that.

4. If the predictive performance is not acceptable, then either experiment with a lighter model type or re-evaluate the optimizing metric requirements for your app.

To add to the confusion, two more computational dimensions need to be evaluated:

Try to use custom hardware, such as GPUs or specific inference chips.
Try to use custom compilation methods to optimize the model components.

However, optimizing the model only goes so far because sooner or later, you realize that a 10% improvement in the latency of the model construction will get crushed by distributed I/O operations for pre-processing input features, post-processing predictions, and delivering the predictions. For example if your total latency comes from 10ms for the prediction construction, and 90ms from the prediction serving, then saving 10% on the 10 ms is not going to help since you still have a monster 90ms to deal with in the prediction serving.

Reducing the prediction serving latency is what we cover next.

Features are the lifeblood of models. But unfortunately, the callers will probably not send you a fully built request. Instead, they will be sending whatever is available at the time. This is far from what you used offline to train the model. And quite far from what the model expects.

For example, for a grocery delivery estimation, the model will receive only the order_id. But the model probably will need much more than only the order_id. It will need to fetch information about the order, the customer, and the delivery person. In addition, it might need current traffic conditions around the delivery address and also a bunch of historical values related to previous delivery times in that zip code, previous shopping durations at the current store, and so on. Because all that is included in the model, fetching and processing the features becomes a high-stakes operation where the bulk of the prediction serving latency will come from.

As a mental model for thinking about features, we can split the input features into three camps:

User-supplied features: These come directly from the request.
Static reference features: These are infrequently updated values.
Dynamic real-time features: These values will come from other data streams. They are processed and made available continuously as new contextual data arrives.

VI.a. The “Easy” Case With Static Ref Features

Static features come in two flavors: Singles and Aggregates. The singles are attributes of a single entity. The number of rooms in a house, or the ID for the advertiser associated with a campaign. Aggregates are things like the median house price in the zip code or the average ad budget of campaigns targeting a specific audience segment.

These sorts of static features are helpful for predictive use cases such as:

Predicting the final sale price of a used car based on the zip code, mileage, model year, model type, and median prices for this model/year combination.
Recommending movies similar to the movies the user has previously watched.
Ranking which ad creatives to show to the user based on previous purchases and demographic information.

The issue is that raw static features initially live in an enterprise data warehouse. At prediction time, the ML gateway will need to retrieve features and create a request that complies with the needs of the ML model. Unfortunately, the typical data warehouse is not optimized for low latency queries. Instead, data warehouses are optimized for large aggregations, joins, and filtering on extensive star schemas. That’s not going to be suitable for low latency apps.

The ML gateway fetching pattern for static features is: “I need a single row with one column for each of the features of customer X.”

The standard method for low latency static features is to periodically extract the features and aggregates, and place them in a data store optimized for singleton lookup operations.

The two parallel flows go like this:

Offline:

The batch jobs do the following:

Read from the data warehouse.
Generate the singles and aggregate static features.
Load the features in the feature lookup API/DB.

Online:

The client sends an entity ID that needs predictions. For example, recommend a list of movies for user_id=”x”.
The entity is enriched/hydrated by the attributes present in the feature lookup API.
The ML gateway then consolidates the input features into a prediction request forwarded to the ML Model API.
When the ML Model API returns predictions, the ML gateway post-processes them and returns them to the client.

The good thing about static features is that they are…..well…..static; Not something that changes in real-time.

The standard pattern is to set up a batch job to update the static features. Unfortunately, that batch job costs quite a bit of cash if you run it every 15 minutes. So you exponentially lower the frequency of the update until the model’s optimizing metric starts complaining. Then you raise the frequency to its previous value. Automate that. Done.

We need a picture! We have a picture…

Here is what is happening in the data flow above:

Get the data from your data. Use something that can handle warehouse-level queries.
Process the data to do all the usual wrangling (joining, filtering, aggregating some numbers, and so on).
Run the feature engineering (what you are paid to do): Extract the cheapest set of features your model needs. Keep “ablating” the expensive features until the optimizing metric drops below your threshold. Then automate that.
Store each entity you are predicting on. Hopefully, somewhere with a minimal lookup time. Select a database that will give you the best lookup latency but stop optimizing the DB once the model predictive latency performance is doing well.

Now that you optimized the above, you can scale the feature generation in your organization. For example, let’s say that the features you create can be reused in a separate use-case by a random stroke of luck. Instead of having each ML pipeline regenerate the same features and waste time and money, build a feature store that has the following characteristics:

Enterprise-wide
Centralized
Discoverable

Here is what you do. The plan is flawless. Split the feature generation workflow into two parts:

Producing ML features (The Givers)
Discovering ML features (The Takers)

Producing the features is conceptually similar to the diagram above.

For the discovery, everyone on the ML team will need information about the customer, the product, and the channels. Start there. First, the singles, then the aggregates. Watch the fireworks.

I posit that discovering is more complicated than producing the features. The more complex the features, the harder they are to reuse. The customer min, max, and average ages are easier to share than product image embeddings, or catalog taxonomy.

FOCUS ON THE ML SERVING LATENCY FIRST.

THAT’S WHAT THE CLIENT SEES FIRST.

VI.b. The “Less-Easy” Case With Dynamic Real-time Features

The “Dynamic” nature of this option comes from using events.

Real-time features and output predictions management.

When would this happen? When you want to use the most recent events as features in your model. For example, imagine any of the following:

You are using user interactions in a browser session as features, and the customer interacts with “N” items, and you want to recommend the next item to watch.
You are predicting which “cookie maker” in your factory will fail because the butter’s temperature is not correct.
You are predicting delivery times of groceries so that you can tell customers when to be around so that their stuff doesn’t get stolen.

These real-time features pass through an event-stream processing pipeline. Compared to the batch case, you need to find ways to immediately update existing aggregated values as soon as the incoming data is available. For example, a shopping basket with $1+$1+$1 is different than one with $1+$1+$1000+$1 items.

For this, you need a streaming pipeline that does two things:

Generates features dynamically. The pipeline task takes events at one end and generates needed stats as fast as possible.
Generates predictions dynamically. The pipeline task takes the fresh features and calls the deployed model(s) to generate predictions.

The recommendation is to place the generated features in a low-latency database that is good at read/write performance. The output predictions should land in two locations:

Output predictions database
Output predictions stream

Real-time features and output predictions management with steps. Image by author.

So to summarize:

Fresh events land in your favorite messaging system. Then, they get picked up by the streaming pipeline. The generated features, probably aggregated over time windows, land in a low-latency feature store. Exiting features are updated with fresh values.
The streaming pipeline generates the predictions using the features and the model API.
The ML gateway receives client prediction requests. The gateway then checks if there are any predictions in the database, or the messaging system. Then the gateway returns them to the client. Finally, it optionally push them to the messaging system if some other system downstream is interested. Looking at you governance team.

If all the techniques so far in this post still do not make your prediction latency low enough, then the next optimization you need is precomputing and caching predictions.

You set up a batch scoring job that stores the predictions in a low read-latency DB. Memoization at its finest. Then the client does not have to call any live prediction service. The client pulls the pre-generated predictions from the DB directly.

Here is a diagram to visualize the pre-computation process:

Pre-computing and caching output predictions with steps. Image by author.

The tasks to be completed, outlined in the diagram above, are as follows:

Get the data to be scored.
Score it offline in a batch job.
Store the predictions in a DB that specializes in key-value records.
The ML gateway gets prediction requests, then fetches and returns the predictions.
The client gets the prediction and moves on.

You may ask: “This is all good and all, but what about the lookup keys? What should use there?”

Let’s split the problem into two types:

predictions for an entity
predictions for a combination of feature values

For the entity case, the prediction service receives a known entity ID. That ID is going to represent a domain entity in your use case. That could be a product_id, movie_id, order_id, device_id, etc. For example, predicting the next ad to show to a user_id that just loaded our product page. That would be a prediction at the Entity level.

For the combination of feature values case, the prediction service receives a combo of feature values. For example, say you are trying to predict if the user will buy something, and we want to show a promotion to nudge them. However, we might only get a combination of location, current shopping cart size, segment information, and product category. For situations like these, you might want to pick the top-N most frequent feature combinations, and generate predictions for those combinations offline. Then, when there is a client request, you can first check if you can just fetch a prediction for that combo of features.

VII.a. Precomputing predictions

VI.a.i If You Can Use Entities, Use Them!…

Watch out with precomputing predictions! Entity cardinality is the killer here.

You are in good shape if you are generating predictions for low cardinality entities. For example, predicting maintenance needs for hundreds of vehicles in an industrial fleet will probably not break the bank.

However, if you are generating predictions for high cardinality entities, then good luck. For example, if the product catalog is a 100 million items monster, you will need to lower your expectations and use some tricks. A favorite one is to generate predictions for the top-N most viewed products. Then, for the rest of the long tail of remaining products, you make the client wait while you call the model directly instead of pulling the predictions from the prediction store.

Here is a visual representation of that workflow:

Precomputing and caching predictions using entities. Image by author.

VI.b. Caching Predictions

VI.b.i The Very Special Case Of Real-time Similarity Matching

Say you have too many entities. You tried direct predictions, total precomputed predictions, or partial precomputed predictions but still nothing works. In that case, similarity matching is worth a try.

Something like the following will improve the prediction latency:

Train a model on the products’ similarity using product-user interactions or product-product co-location.
Extract the embeddings of the products.
Build an index of the embeddings using an approximate nearest neighbor method.
Load the index in the ML prediction service.
Use the index at prediction time to retrieve the similar product IDs.
Periodically update the index to keep things fresh and relevant.

“But what do I do if the index is too large, or the prediction latency is too high?” Reduce the embedding size to get a smaller index until the optimizing metric starts complaining. If you can’t get an acceptable optimizing+satisficing tradeoff, look elsewhere.

VII.b.ii Remember That Predicting On Combinations of Feature Values Gets Expensive Quickly

If entities are not available for your use case, because people usually like their privacy, try using feature values combinations. You will need a static hashing method to generate a key for each combination of values.

For example, say you have three features: country, gender, and song_category. Then you would generate a hash (county, gender, song_category) as the key. The order is important here: a hash(country, gender, song_category) will differ from a hash(song_category, country, gender). So pick a particular order and stick with it.

Be careful with the cardinality. The more categories, the higher the number of predictions generated. If you serve10 countries, 2 genders, and 40 song_categories, then that would mean you make 10x2x40=800 predictions. Just keep that in mind.

After deciding on the key, you precompute predictions for each key. Store each key value in a low-read-latency DB, and you are good to go. Again, remember that even with a solid key-value store, you still need to reduce the number of predictions stored by reducing the number of possible keys. Use the optimizing vs. satisficing metric method here as well. Keep adding categories, features, and keys while the model’s predictive performance increases. But stop when the prediction latency starts to complain.

Four things to keep in mind:

The DB will have lots of rows, but only a few columns. Choose a DB that handles single key lookups well.
Keep an eye on the categories’ cardinality and the number of keys generated. If you have a batch job doing this, then monitor the cardinality and raise alarms if you get a spike in new categories to count. That will prevent blowing up the DB lookup latency.
Continuous values are going to need to be bucketized. That’s going to be a hyper-parameter that you need to tune.
Any technique that can be used to lower the cardinality of categories is your friend. Lower the cardinality as much as your optimizing metric allows.

That’s it folks! What a trip!

Understanding the options available when working on a low-latency real-time online ML inference product has advantages:

First, you get to sound smart when chatting with your product manager.
Then you save yourself time by exploring the correct type of ML Latency optimization.
Then you don’t lose hope when your Engineering team says that your model takes too long for prod.
Finally, your product might succeed. Who knows, right?

Here is the map of patterns for future reference :

(High resolution diagram here)

And here is a discovery map for future reference:

I hope that you learned something about ML Latency. I sure did 🙂

Common Ways to Reduce ML Prediction Latency to Sub X ms

Common ways to reduce ML prediction latency. Image by author

Machine Learning (ML) systems don’t exist until they are deployed.

Unfortunately, prediction latency is one of those edges that hurt badly.

And, it hurts too late in the product cycle.

Stop optimizing that model! Focus on ML serving latency first.

FOCUS ON THE ML SERVING LATENCY FIRST.

THAT’S WHAT THE CLIENT SEES FIRST.

So what are some common ways to reduce ML Latency?

Here is one way to organize the known patterns of low-latency ML serving, I hope it helps:

(High resolution diagram here)

I. The Not-So-Rare Case of Excessive ML latency
II. Online vs Offline and Real-time vs Batch
— II.a. What Do Offline Prediction Pipelines Look Like?
— II.b. Online Prediction Is Where Latency Really Hurts
III. Asynchronous Online Predictions
— III.a Option 1: Push
— III.b Option 2: Poll
IV Synchronous Online Predictions
— IV.a. The Standard ML API
— IV.b. Serving vs Constructing Predictions
V. Only So Much Can Be Done To Optimize The Model Itself
— V.a. Supporting Model Components
— V.b. Core Model Components
VI. Features Are The Real Drag In This Whole Operation
— VI.a. The “Easy” Case With Static Ref Features
— VI.b. The “Less-Easy” Case With Dynamic Real-time Features
VII. Don’t Forget The Predictions!
— VII.a. Precomputing Predictions
— — VII.a.i If You Can Use Entities, Use Them!…
—VII.b. Caching Predictions
— — VII.b.i The Very Special Case of Real-time Similarity Matching
— — VII.b.ii Remember that Predicting on Combinations of Feature Values Gets Expensive Quickly

Does it need to be <100ms or offline?
Do you know your approximate “optimizing” and “satisficing” metrics thresholds?
Did you verify that your input features can be looked up in a low-read-latency DB?
Could you find anything that can be precomputed and cached?

The image below presents several ways to answer these questions. The following sections then discuss specific applications of them.

ML models come in two flavors:

Offline predictions: This is used when you need to score a whole batch of data entries, and you have quite a bit of time before having to serve the predictions. In this case, you generate predictions based solely on historical data. Things like generating promotion campaigns “offline” for customers that we think will churn out of our service soon. Not that it’s easy, but you got some time before you need to return the predictions.
Online predictions: This is used to generate predictions on the fly. As new requests come in, the service uses the current context+historical information to generate predictions. That said, context is a broad term. Think of it as a combination of things like the current date+time, the last-N items viewed in a session, the contents of a new shopping basket, user device type/location, and any other helpful pieces of information that are not neatly organized in a historical data warehouse.

II.a. What Do Offline Prediction Pipelines Look Like?

Some use cases that benefit from this mode of operation:

You need to optimize a store inventory. Your prediction job runs at 6:45AM. How much chicken will be sold at the gyro restaurant so that you can go pick some from the local Sam’s club.
You need to understand audiences by building weekly customer segments that group users around beneficial characteristics to find out who your best customers are. yes, groups.
You need to determine if your customers are delighted. Because people talk about your sandwiches, and they are getting greasier. So you tap into social media to find out.

One way to generate offline batch predictions is to use a “standard” ETL pipeline that includes the smarts to generate the predictions.

Check out this next diagram:

The diagram above goes like this:

Upload the data to be scored to your storage.
Preprocess the data into something the model can consume.
Score the preprocessed data.
Store the scores somewhere accessible to the end-users.

For the kind of processing outlined above, no one cares if it takes 1 hour or 10 hours. However, since online predictions are those most affected by latency, let’s dig into those deeper.

II.b. Online Prediction Is Where Latency Really Hurts

Generating ad recommendations for an ad request when the browser loads a page.
Optimizing a bid in a competitive real-time bidding ad marketplace.
Predicting if a critical piece of equipment will fail in the next few seconds (based on sensor data).
Predicting the grocery delivery time based on the size of the order, the current traffic situation, and other contextual information about the order.

Two main ways to perform asynchonous online predictions:

For example, in a customer churn prediction system, the ML system is going to try to keep users that are susceptible of quitting our paid meal-delivery subscription service:

Caller: “Hey, I have this logged-in user. She is currently active but messing with her subscription settings. Should we send her a new retention promotion email before she quits our amazing service?”
Model: “Let me think about it. I’ll send you a push notification if we need to.”
…time…passes…
Model: “ Hey Caller, send this promotion email for 3 free meals to the customer so she keeps her paid subscription to our service.”

Caller: “Hey, I have this logged-in user. She is currently active but messing with her subscription settings. Should we send her a new retention promotion email before she quits our amazing service?”
Model: “Let me think about it. I’ll update the prediction in the Predictions DB.”
Caller: “Any news?”
Caller: “Any news?”
Caller: “Any news?”
Caller: “Ah, OK, I see that we should send this person a retention promotion email. Thanks.”

For async online predictions, a typical data flow is illustrated below:

In the async case, the messaging system helps choreograph the process as follows:

The client talks to the facade prediction API.
The prediction API service sends the available data to an input messaging system.
The event is then cleaned and dynamically enriched by the event processing service. Then the prediction request is forwarded to the scoring service.
The event processing system receives the predictions. After formatting, filtering, and post-processing the predictions, it writes the predictions into an output scored-event messaging system.
Then either the Prediction API keeps polling to get the scores, or there is a method to push notifications to the prediction API.

IV.a. The Standard ML API

The Synchronous online predictions are more latency-sensitive than the Async mode. The basic pattern for Synchronous Online Predictions is:

The model is deployed as an HTTP REST API.
The online application sends an HTTP request to the model, and blocks, waiting for the predictions to come back “immediately” (e.g., <10ms).
The model starts generating the prediction and sends it back to the caller as soon as the prediction is available.
If the model does not respond within the latency budget (it takes too long), the caller times out, and the online application says something like “prediction took too long, try again later”.

But that’s not all of course. The ML service has to do more tasks.

However, because the ML gateway can be hit by spikes in traffic, you must also do the following tasks. The following becomes part of your job:

Securing the endpoint
Load balancing the prediction traffic
Auto-scaling the number of ML gateways

Have fun.

IV.b. Serving vs Constructing Predictions

We expect synchronous online predictions to return immediately. However, it is essential to realize that reducing latency relies on optimizing two distinct levels:

Prediction construction – This is where you reduce the time it takes a model to construct predictions from a fully formed, well-behaving, enriched and massaged prediction request.
Prediction serving – This is where the rest of the latency lives. This includes any pre-computing, pre-processing, enriching, massaging of input prediction events as well as any post-processing, caching, and optimizing the delivery of the output predictions.

V.a. Supporting Model Components

V.b. Core Model Components

However, the optimizing metric has to be balanced with the satisficing metric. The satisficing metric cares about the context where this model will run. For example:

Is the model going to fit on my device in terms of storage size?
Can the model run with the type of CPUs on the device? Does it require GPUs?
Can the feature preprocessing finish within specific time bounds?
Does the model prediction satisfy the latency limits that our use case requires?

The idea here is to pick an upper bound for the satisficing metric, say 50 milliseconds latency, and use that to filter out models that need more time.

Then comes the part where you start messing with the model. The primary pointers are:

The smaller the model, the faster its response time.
The lower the number of input features, the faster the response time.

To reduce the size of the model, a couple of options are available:

Trim the number of levels in a tree model
Trim the number of trees in a random forest and gradient boosting tree model
Trim the number of layers in a neural network
Trim the number of variables in a logistic regression model

Your task here is to balance the prediction effectiveness with the latency requirements. Here is some guidance in how to do that:

1. Set the satisficing metric threshold.

2. Increase the complexity of the model until it hits the satisficing metric bound.

3. If the model latency meets the requirements, but the predictive effectiveness stays below the optimizing metric requirements, then ask yourself if your app can live with that.

4. If the predictive performance is not acceptable, then either experiment with a lighter model type or re-evaluate the optimizing metric requirements for your app.

To add to the confusion, two more computational dimensions need to be evaluated:

Try to use custom hardware, such as GPUs or specific inference chips.
Try to use custom compilation methods to optimize the model components.

Reducing the prediction serving latency is what we cover next.

As a mental model for thinking about features, we can split the input features into three camps:

User-supplied features: These come directly from the request.
Static reference features: These are infrequently updated values.
Dynamic real-time features: These values will come from other data streams. They are processed and made available continuously as new contextual data arrives.

VI.a. The “Easy” Case With Static Ref Features

These sorts of static features are helpful for predictive use cases such as:

Predicting the final sale price of a used car based on the zip code, mileage, model year, model type, and median prices for this model/year combination.
Recommending movies similar to the movies the user has previously watched.
Ranking which ad creatives to show to the user based on previous purchases and demographic information.

The ML gateway fetching pattern for static features is: “I need a single row with one column for each of the features of customer X.”

The standard method for low latency static features is to periodically extract the features and aggregates, and place them in a data store optimized for singleton lookup operations.

The two parallel flows go like this:

Offline:

The batch jobs do the following:

Read from the data warehouse.
Generate the singles and aggregate static features.
Load the features in the feature lookup API/DB.

Online:

The client sends an entity ID that needs predictions. For example, recommend a list of movies for user_id=”x”.
The entity is enriched/hydrated by the attributes present in the feature lookup API.
The ML gateway then consolidates the input features into a prediction request forwarded to the ML Model API.
When the ML Model API returns predictions, the ML gateway post-processes them and returns them to the client.

The good thing about static features is that they are…..well…..static; Not something that changes in real-time.

We need a picture! We have a picture…

Here is what is happening in the data flow above:

Get the data from your data. Use something that can handle warehouse-level queries.
Process the data to do all the usual wrangling (joining, filtering, aggregating some numbers, and so on).
Run the feature engineering (what you are paid to do): Extract the cheapest set of features your model needs. Keep “ablating” the expensive features until the optimizing metric drops below your threshold. Then automate that.
Store each entity you are predicting on. Hopefully, somewhere with a minimal lookup time. Select a database that will give you the best lookup latency but stop optimizing the DB once the model predictive latency performance is doing well.

Enterprise-wide
Centralized
Discoverable

Here is what you do. The plan is flawless. Split the feature generation workflow into two parts:

Producing ML features (The Givers)
Discovering ML features (The Takers)

Producing the features is conceptually similar to the diagram above.

For the discovery, everyone on the ML team will need information about the customer, the product, and the channels. Start there. First, the singles, then the aggregates. Watch the fireworks.

FOCUS ON THE ML SERVING LATENCY FIRST.

THAT’S WHAT THE CLIENT SEES FIRST.

VI.b. The “Less-Easy” Case With Dynamic Real-time Features

The “Dynamic” nature of this option comes from using events.

When would this happen? When you want to use the most recent events as features in your model. For example, imagine any of the following:

You are using user interactions in a browser session as features, and the customer interacts with “N” items, and you want to recommend the next item to watch.
You are predicting which “cookie maker” in your factory will fail because the butter’s temperature is not correct.
You are predicting delivery times of groceries so that you can tell customers when to be around so that their stuff doesn’t get stolen.

For this, you need a streaming pipeline that does two things:

Generates features dynamically. The pipeline task takes events at one end and generates needed stats as fast as possible.
Generates predictions dynamically. The pipeline task takes the fresh features and calls the deployed model(s) to generate predictions.

The recommendation is to place the generated features in a low-latency database that is good at read/write performance. The output predictions should land in two locations:

Output predictions database
Output predictions stream

So to summarize:

Fresh events land in your favorite messaging system. Then, they get picked up by the streaming pipeline. The generated features, probably aggregated over time windows, land in a low-latency feature store. Exiting features are updated with fresh values.
The streaming pipeline generates the predictions using the features and the model API.
The ML gateway receives client prediction requests. The gateway then checks if there are any predictions in the database, or the messaging system. Then the gateway returns them to the client. Finally, it optionally push them to the messaging system if some other system downstream is interested. Looking at you governance team.

If all the techniques so far in this post still do not make your prediction latency low enough, then the next optimization you need is precomputing and caching predictions.

Here is a diagram to visualize the pre-computation process:

The tasks to be completed, outlined in the diagram above, are as follows:

Get the data to be scored.
Score it offline in a batch job.
Store the predictions in a DB that specializes in key-value records.
The ML gateway gets prediction requests, then fetches and returns the predictions.
The client gets the prediction and moves on.

You may ask: “This is all good and all, but what about the lookup keys? What should use there?”

Let’s split the problem into two types:

predictions for an entity
predictions for a combination of feature values

VII.a. Precomputing predictions

VI.a.i If You Can Use Entities, Use Them!…

Watch out with precomputing predictions! Entity cardinality is the killer here.

Here is a visual representation of that workflow:

VI.b. Caching Predictions

VI.b.i The Very Special Case Of Real-time Similarity Matching

Something like the following will improve the prediction latency:

Train a model on the products’ similarity using product-user interactions or product-product co-location.
Extract the embeddings of the products.
Build an index of the embeddings using an approximate nearest neighbor method.
Load the index in the ML prediction service.
Use the index at prediction time to retrieve the similar product IDs.
Periodically update the index to keep things fresh and relevant.

VII.b.ii Remember That Predicting On Combinations of Feature Values Gets Expensive Quickly

Four things to keep in mind:

The DB will have lots of rows, but only a few columns. Choose a DB that handles single key lookups well.
Keep an eye on the categories’ cardinality and the number of keys generated. If you have a batch job doing this, then monitor the cardinality and raise alarms if you get a spike in new categories to count. That will prevent blowing up the DB lookup latency.
Continuous values are going to need to be bucketized. That’s going to be a hyper-parameter that you need to tune.
Any technique that can be used to lower the cardinality of categories is your friend. Lower the cardinality as much as your optimizing metric allows.

That’s it folks! What a trip!

Understanding the options available when working on a low-latency real-time online ML inference product has advantages:

First, you get to sound smart when chatting with your product manager.
Then you save yourself time by exploring the correct type of ML Latency optimization.
Then you don’t lose hope when your Engineering team says that your model takes too long for prod.
Finally, your product might succeed. Who knows, right?

Here is the map of patterns for future reference :

(High resolution diagram here)

And here is a discovery map for future reference:

I hope that you learned something about ML Latency. I sure did 🙂

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.