Why Accurate Models Aren’t Always Useful | by Siddarth Ramesh | Sep, 2022

By Jessie Hobb On Sep 9, 2022

How economic utility functions help tie your models back to your customers

Photo by Afif Kusuma on Unsplash

Let me begin by saying that there is a lot of excellent technical content around how to evaluate your models. Metrics like F1 score, MSE, MAE, Huber Loss, precision, recall, cross-entropy loss, and many others are terms that have been discussed at length all over the internet. However, these metrics generally focus on fitting your model to your data, not optimizing your model to your business — at least in a direct way.

What’s often missing is a framework of economic analysis to optimize the utility of the model. Utility is defined simply as the amount of enjoyment or value a customer can get from a service — in this case your ML model.

While this concept is not taught in ML classrooms, I contend that doing an economic analysis and utility estimation is up there in terms of importance to build practical, lasting models out there in the real world. Until a coalition of technical and nontechnical stakeholders work together to develop an economic layer for the Machine Learning models, the business value and marginal utility of Machine Learning in your organization is not very well defined.

Note: This post is intended for technical ML folks as well as Product Managers and less technical stakeholders who work with AI products. There will be some math in this post, but I have included the high level conceptual steps in the conclusion section of this blog.

Imagine that you have a very indecisive friend who never knows whether a new movie is worth watching. Because you are an unbelievably great friend and an excellent ML practitioner, you decide to build a simple binary classification model to predict whether your friend will like or dislike an upcoming movie.

You do the hard work of labeling your preferences for many different movies your friend watched. You do some feature engineering and extract main actors, genre, director, and other features to add to your train data. You also add a label denoting whether your friend liked the movie (1) or not (0). In the end, you have a dataset like below.

Example movie preferences dataset, Image By Author

You follow the normal Machine Learning protocols and train a model with your favorite classifier and test it with a test set of 300 movies.

You find that your model has a pretty high accuracy — 90% of your labels were correctly predicted. Given this, you build an app powered by this model and deliver it to your friend, so they can start being more decisive!

A few weeks later, you get lunch with your friend and ask them whether they are using your app. Your friend hesitates and reveals that while many of the recommendations were accurate, there were still a couple that did not hit the mark and so they stopped using the app.

Classification Outcomes

So what happened? Why did your friend stop using the app?

To answer that question, we have to look a little deeper. In a binary classifier, you realize there are actually 2 ways you can be right and 2 ways you can be wrong.

The image below are the possible prediction outcomes. To understand False Positives, False Negatives, True Positives, and True Negatives please check out this link.

In your friend’s case, these are the definitions of TP, FP, FN, and TN

TP = You correctly predict a good movie for your friend (label = 1, prediction = 1)

TN = You correctly did not predict a bad movie for your friend (label = 0, prediction = 0)

FP = You incorrectly predict a bad movie for your friend. (label = 0, prediction = 1)

FN = You missed predicting a good movie for your friend. (label = 1, prediction =0)

Predictions and Outcomes in dataset, Image By Author

An Economic Analysis

At this stage, we begin to build our economic layer. The first phase consists of 2 steps:

List out all the benefits and costs associated with your model
Measure the dollar value of each benefit and cost

In the movie solution you designed for your friend, let’s say that the cost of an FP was $20 for a movie ticket and 2 hours of wasted time. There were also some negative points for the emotional damage done, which you convert into a particular dollar amount — in this case $12. The benefit of getting a correct prediction was worth about $50 to your friend. Your friend also wouldn’t have felt that bad if they missed out on a movie that your app failed to recommend, which you interpret at $5.

Some of these costs are already in dollars so those are easier to break down. Others costs are harder since they are time and emotional costs, which you have to have a deep understanding of your friend (or customer) to quantify. After some hard work, you consolidate the above into the following definitions for your friend’s benefits and costs.

A = Time Spent= $10

B = Emotional damage of watching a bad movie= $50

C = Emotional damage of missing a good movie= $12

D = Emotional happiness of watching a good movie = $50

E = Emotional happiness of missing a bad movie = $5

F = Price of Movie Ticket = $20

This difficult exercise of finding and measuring different costs and benefits and then translating it to a single interpretable currency is the missing but essential step to build your utility optimization layer.

You now construct a function to tie your friend’s quantified costs and benefits back to the classification outcomes. To keep the equations simple, we are adding the variable labels (A, B, C, etc) defined in the above equations. In this phase, we add up the different benefits and costs to create dollar values for each classification outcome.

For example, False Positive is $80 because you add the price of a movie ticket, the 2 hours of time spent watching the movie, and the emotional damage of watching a bad movie. True Positive is $10 because it was worth $50 to your friend to watch that movie but they spent $40 worth of time and money. To keep things very simple, we assume that every movie is 2 hours.

False Positive = F+2(A)+B = $20 +($10*2) +$50 = $80

False Negative = C = $12

True Positive = D-F-2(A) = $50 -$20 -($10*2) = $10

True Negative = E = $5

Benefit Cost Ratio

Now that all the costs and benefits have been tied back to our classification outcomes, we can bring in a utility function. This is different from tuning the most accurate model.

One example of an economic utility function that might be helpful is the benefit-cost ratio. This is a popular calculation done as part of a general financial and economic analysis. The high level interpretation of this ratio is that a value of more than 1 means that the benefit outweighed the cost and less than 1 means the opposite.

We can perform the following steps:

Count the number of TPs, TNs, FPs, and FNs
Use the costs you calculated for each outcome as weights to compute the benefit-cost ratio

Coming back to the movie example, recall that we had a test set of 300 movies and 90% were correctly predicted. After you follow the above directions, you arrive at the following number of data points for each type of classifier outcome.

TP predictions = 80 predictions

TN predictions = 190 predictions

FP predictions = 30 predictions

FN predictions = 0 predictions

As expected, 270 out of 300 predictions were correct (TP + TN). We can see there were 30 False Positives, and the cost of an FP was $80.

Our cost benefit ratio is the total amount of expected benefits divided by the total amount of expected costs.

[(Weights * TP) + (Weights * TN)] / [(Weights * FP) + (Weights * FN)]

Our expected benefits are TP and TN, and our expected costs are FP + FN. If we apply our weights computed for TP, TN, FP, and FN, we can create the following equation.

[($10*TP)] + [($5*TN)] / [($80*FP) + ($12*FN)]

If we apply all the computed numbers to the cost-benefit ratio:

[10(80) + 5(190)] / [80(30) + 12(0)] = 1750 / 2400 = .729

A value of above 1 means that the model is adding value. In our case, the value is .729. This means that the model is hurting your friend despite getting 270 out of 300 predictions correct. No wonder your friend stopped using the app!!

Economic Functions Outside of Binary Classification

We can extend this framework to other types of problems as well. For example in a multi-class classifier, you could measure the correct, incorrect, missed, and out of scope utterances and then apply corresponding weights to those measurements to create your benefits and costs. You could alternatively use a one-vs-all classifier to build deep metrics on a per-class level. Some class labels might be more important to have higher precision on than other labels, so maybe the economic metric should be set up to optimize for those subset of classes.

No matter what model you are using, it will never hurt to apply this economic analysis to help you decide the right settings for your model in production.

Actionable Steps and Deeper ML Metrics

Now that you quantified your friend’s preferences, you might decide that you need a model that will produce less recommendations for your friend. One way to accomplish this is to set a threshold, and measure how that threshold works against your utility function. In my experience in both larger and smaller companies, I have seen thresholds being arbitrarily set, and an economic layer would help add some context to why a threshold is the way it is. There are also other ways in which to assign thresholding for your model, which I will not go into here.

Deeper metrics like F1 do take precision and recall (and therefore classification outcomes) into account, which gives a better view into the model is performing compared to just accuracy. Concepts like sensitivity and specificity are well known and discussed concepts and have overlap with this post.

Even in these cases, economic functions are still valuable and can be used to assign monetary value to different outcomes and directly tie your models back to your customer’s needs.

Of course, the real world is much messier than the movie example. The process of consolidating benefits and costs might take several iterations and significant time. It is also very hard to measure aspects like “emotional damage”. To get a decent measure of this will require a very deep understanding of your customers, and even then it may not be 100% accurate. The weights you apply will likely not be correct and run the risk of being subjective versus customer centric.

To mitigate bias, the process of creating this economic layer falls on Machine Learning folks, Product Managers, and any stakeholders who have insights into the customers.

In this post, the following are the high level conceptual steps we took with the economic layer steps bolded:

Source, and prepare your data
Build and train a model
Generate predictions using a test set
Using the predictions, label the classification outcomes (TP, FP, TN, FN in the case of a binary classifier)
Understand the cost and benefits of the classification outcomes and translated to dollars. This requires ML folks to work with stakeholders who have a very good sense of what the customers want.
Weigh the classification outcomes with the benefit and cost values computed in step 5
Compute an economic utility function (or more than 1), in this example the function was a benefit-cost ratio
Optimize your model to your utility function. This might compromise accuracy. In the case of the benefit-cost ratio, we want the model to be above 1 and our original model was less than 1.

To conclude, while it’s generally agreed upon that Machine Learning and AI add a lot of value to your business, customers, and bottom line, it is important to start measuring and optimizing that value.