# How to frame a regression problem as a classification problem to account for uncertainty | by Jonathan Serrano | Aug, 2022

I bet you have been there: you have some data and you also have a target column which you are required to estimate using some ML technique. You name it: given some financial data estimate a price, some customer information estimate its lifetime value, and some mechanical information estimate wear or usage. This is a regression problem, the objective is to pinpoint the value of a continuous variable given a set of features. And here comes the interesting part. How certain are you of this value?

Let’s use the first example. You are presenting a price estimation model to a sales manager and, given the input, it estimates that the price should be $51.53. You are confident because the evaluation RMSE was $1.8; good enough for this price range but quite a problem for a $5 to $10 price range. The issue here is that the error is not evenly distributed between the price ranges. The model will be more accurate at certain ranges and not so much at others, and RMSE will not let you know how accurate it is for a certain estimation. In other words, you have no idea about the model uncertainty.

A similarity can be drawn from a finance domain. You have $100 and are willing to decide where to invest them among 3 possible assets. It would be odd if someone came and tell you to invest all of your money in a single asset since you would expect some advice in the form of “invert 60% here, 30% there, and the last 10% here”.

I bet you can see where the issue is going. In the given regression problem you can pinpoint a forecasted value, say $10.5, or provide a distribution in the form of “there is a 20% probability of the value being between 0 and 5, 70% probability of it being between 5 and 10 and just a 10% probability of it being more than 10.

This is graphically shown below.

There are pros and cons to both approaches.

**Pinpoint a value**

Pro

- Tell someone that the expected value is $10.5 and there will be no doubt about what it means. The pro is that a number is easily understandable.

Con

- As mentioned earlier. How certain are you of this value? The training RMSE does not tell you much about this particular case. The con is you have no idea how accurate a specific prediction is.

**Show a distribution**

Pro

- You know how certain the model is in the prediction. If, for example, the model is telling you a 90% probability for a bin you can be confident about it being in the corresponding range. If, on the other hand, the model gave you values close to 30% for each bin well, you are certain you know nothing.
*And even knowing u know nothing can be useful.*The pro here is that the model accounts for uncertainty.

Con

- The main issue with this approach might be interpretability. Anyone aware of the concept of probability distribution will easily understand what to do with the result. If you are lucky enough to deal with savvy shareholders then you are done, however, you will have a more challenging time if they are not.

At this point, you may be wondering where the classification part comes in. Well, see the histogram above. Do you get the idea? If you don’t here it is.

Instead of pinpointing the value, estimate the probability of it’s value belonging to a bin, this is: classify each sample into a bin.

Now let’s make this idea clear with some code and a real example. First, we will train a regressor to pinpoint the salary, then make the changes to frame the regression problem as a classification problem.

## The regression task

We will use the Data Science Job Salaries [1] available on Kaggle.

The dataset looks like this.

I will not go into the details of cleaning and preparing the data, however in general these steps comprise removing unused columns, dealing with categorical data, and scaling the dataset values. Finally, we will use XGBoost [2]to train a regressor.

# Prepare dataX = df.drop(['Unnamed: 0','salary', 'salary_currency','salary_in_usd'], axis=1)

X = pd.get_dummies(X)y = df['salary_in_usd']

# Split into train test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)# Scale data

t = MinMaxScaler()

t.fit(X_train)

X_train = t.transform(X_train)

X_test = t.transform(X_test)# Fit model no training datafrom xgboost import XGBRegressor

model = XGBRegressor()

model.fit(X_train, y_train)# Make predictions for test datay_pred = model.predict(X_test)

predictions = [round(value) for valueiny_pred]from sklearn.metrics import mean_squared_errorrms = mean_squared_error(y_test, y_pred, squared=False)

# Output: 41524.12746933382

# Not quite good but doing this to compare to the distributive approach

If we take a look at the first five predictions they look like this.

`[71284, 44957, 131146, 165507, 100765]`

For each element, the regressor aims to pinpoint a numeric value of the corresponding salary. Just as expected. I don’t know if it is me, but when I look at those numbers I feel like lacking some context. How sure can I be that those numbers are correct?

The regressor achieved an RMSE of 41, 524. I bet we can do better, however, the purpose is not to train a super-precise classifier but to contrast the differences between framing the problem as a regression problem or a classification problem. So now let’s move to the classification part.

## The classification problem

Earlier we said the core idea of framing a regression problem as a classification problem was to, instead of pinpointing a numeric value, estimate the probability that a sample belongs to a set of fixed bins, i.e. classify each sample into a bin.

So, the first obvious step is determining the set of fixed bins. To accomplish this we will need to take a look at a histogram of the training data we have.

And this is it. Now we have a clear idea of where are most of the salaries located, the minimum and the maximum.

So let’s define the bins now. To do this we will solely consider a principle that makes classification problems easier: make sure that the dataset is not very unbalanced. And we will do this by carefully choosing the number of bins so that the count of elements belonging to each class is roughly similar.

For now, let’s choose a count of 5 bins to find each bin’s limits and element count.

*# Let’s calculate the numeric histogram from the actual test target*

hist, bin_edges = np.histogram(y_test, bins=5)

hist, bin_edges

Output

`(array([33, 45, 28, 15, 1]),`

array([ 4000., 68000., 132000., 196000., 260000., 324000.]))

The first bins’ count is quite balanced, however, we have a problem with the last bins which we will fix by aggregating the data in these bins.

Note that at this moment our targets in *y* are just a bunch of floating numbers (the salaries), which are unsuitable for a classification task since such a task requires a sample and a label showing to which class the sample belongs. So the task ahead is to write some code to map a float into an integer label.

So take a look at the following code.

Simple right? We receive a float number and return a label according to its value. The code is self-explanatory, just note the last *elif *statement: this is where the aggregating occurs. The code says that “everything larger than 196,000 will be assigned a label of 4”.

I hear you asking why are those values hard-coded. And the answer is because this is just an example. I bet there are wiser ways to choose the bin boundaries, however, I just wanted to keep this simple to keep moving fast.

Now let’s train the classification using XGBoost again. But first, we need to convert the *float y targets* into a set of distributive y targets. This code will do.

# convert each y in training and test sets to a classy_train_distributive = [map_float_to_class(y) for y in y_train]

y_test_distributive = [map_float_to_class(y) for y in y_test]

Now train the classification task for real.

# Import the classifierfrom xgboost import XGBClassifier

from sklearn.metrics import accuracy_scoremodel = XGBClassifier()# Train to get a labelmodel.fit(X_train, y_train_distributive)# Get the labelspredictions = model.predict(X_test)# How good does it workaccuracy_score(y_test_distributive, predictions)# Output: 0.5327868852459017

Using default parameters we get an accuracy of 0.53, you bet it can be improved. For now, let’s move on.

If we take a look at the first five predictions they look like this.

`array([1, 1, 2, 3, 2])`

They are just labels… Not quite informative. We are interested not in which label got selected but in the underlying distribution among which the selected label was picked.

This code will do the trick.

# Instead of the labels get the probability distributionspredictions = model.predict_proba(X_test)

Now the first five predictions look like this.

`array([[2.76227575e-03, 8.35917711e-01, 1.60156339e-01, 8.88690702e-04,`

2.75029975e-04],

[2.02982640e-03, 7.47869253e-01, 2.40976274e-01, 8.92745517e-03,

1.97156842e-04],

[9.52042465e-04, 4.67287423e-03, 4.85075146e-01, 2.80507535e-01,

2.28792369e-01],

[7.99451722e-04, 8.09144005e-02, 1.74629122e-01, 7.32445538e-01,

1.12115145e-02],

[1.12233951e-03, 8.29923898e-02, 8.98771644e-01, 1.67797077e-02,

3.33951320e-04]], dtype=float32)

Well… We have more information! Since an image is worth a thousand words (or arrays in this case) we should better make a plot.

# Plot some of the predicted salary distributionsfig, ax = plt.subplots(3, 3)

fig.set_size_inches(16, 6)# Random indexes picked

ax[0, 0].bar(labels, predictions[1])

ax[0, 1].bar(labels, predictions[5])

ax[0, 2].bar(labels, predictions[8])ax[1, 0].bar(labels, predictions[10])

ax[1, 1].bar(labels, predictions[15])

ax[1, 2].bar(labels, predictions[20])ax[2, 0].bar(labels, predictions[40])

ax[2, 1].bar(labels, predictions[47])

ax[2, 2].bar(labels, predictions[52])plt.tight_layout()

fig.suptitle("Some calculated distributions samples")

plt.show()

This is the output. I find this solution way more expressive than a simple floating number with no context. For a sample, the model returns a distribution, and from the distribution, you can obtain a lot of information.

Take a look at row 1 and column 0. If you had just used the label provided by the first classifier perhaps you would be committing an error, since the difference between the probability of that particular sample belonging to either of those two larger classes is quite small. In this case, perhaps you should dig a bit more into the candidate’s profile to make a decision.

In contrast, look at row 2 and column 1. The model is quite certain that a salary in this range is correct.

Now, look at row 0 and column 2. The model is 3 times more confident that the salary should be somewhere in the second bin, nevertheless the third bin is quite relevant. Which one to choose? You could reach for some other data to make the final decision: perhaps the hiring department has a high rotation and a higher salary might help to reduce it, or, on the other hand, there is a wide demand for this profile and, hence, you can hire at the lower range.

We could make a similar analysis on the remaining distributions, but the bottom line is the same: distributions provide more data to make decisions compared to single values. Judge yourself.

In this post, I argued for the benefits of transforming a regression problem into a classification problem by means of a custom function that maps floating values into a class described, by an integer. Each class represents a range of values, i.e., a fixed support. As a result, instead of having a model that pinpoints a value given a sample, it provides a set of probabilities for the fixed support.

The advantage of a distributive model is it provides more context to make decisions, and the disadvantage is that it does not supply one single solution. One cannot have it all.

Do you have an application where this solution might be useful?

I bet you have been there: you have some data and you also have a target column which you are required to estimate using some ML technique. You name it: given some financial data estimate a price, some customer information estimate its lifetime value, and some mechanical information estimate wear or usage. This is a regression problem, the objective is to pinpoint the value of a continuous variable given a set of features. And here comes the interesting part. How certain are you of this value?

Let’s use the first example. You are presenting a price estimation model to a sales manager and, given the input, it estimates that the price should be $51.53. You are confident because the evaluation RMSE was $1.8; good enough for this price range but quite a problem for a $5 to $10 price range. The issue here is that the error is not evenly distributed between the price ranges. The model will be more accurate at certain ranges and not so much at others, and RMSE will not let you know how accurate it is for a certain estimation. In other words, you have no idea about the model uncertainty.

A similarity can be drawn from a finance domain. You have $100 and are willing to decide where to invest them among 3 possible assets. It would be odd if someone came and tell you to invest all of your money in a single asset since you would expect some advice in the form of “invert 60% here, 30% there, and the last 10% here”.

I bet you can see where the issue is going. In the given regression problem you can pinpoint a forecasted value, say $10.5, or provide a distribution in the form of “there is a 20% probability of the value being between 0 and 5, 70% probability of it being between 5 and 10 and just a 10% probability of it being more than 10.

This is graphically shown below.

There are pros and cons to both approaches.

**Pinpoint a value**

Pro

- Tell someone that the expected value is $10.5 and there will be no doubt about what it means. The pro is that a number is easily understandable.

Con

- As mentioned earlier. How certain are you of this value? The training RMSE does not tell you much about this particular case. The con is you have no idea how accurate a specific prediction is.

**Show a distribution**

Pro

- You know how certain the model is in the prediction. If, for example, the model is telling you a 90% probability for a bin you can be confident about it being in the corresponding range. If, on the other hand, the model gave you values close to 30% for each bin well, you are certain you know nothing.
*And even knowing u know nothing can be useful.*The pro here is that the model accounts for uncertainty.

Con

- The main issue with this approach might be interpretability. Anyone aware of the concept of probability distribution will easily understand what to do with the result. If you are lucky enough to deal with savvy shareholders then you are done, however, you will have a more challenging time if they are not.

At this point, you may be wondering where the classification part comes in. Well, see the histogram above. Do you get the idea? If you don’t here it is.

Instead of pinpointing the value, estimate the probability of it’s value belonging to a bin, this is: classify each sample into a bin.

Now let’s make this idea clear with some code and a real example. First, we will train a regressor to pinpoint the salary, then make the changes to frame the regression problem as a classification problem.

## The regression task

We will use the Data Science Job Salaries [1] available on Kaggle.

The dataset looks like this.

I will not go into the details of cleaning and preparing the data, however in general these steps comprise removing unused columns, dealing with categorical data, and scaling the dataset values. Finally, we will use XGBoost [2]to train a regressor.

# Prepare dataX = df.drop(['Unnamed: 0','salary', 'salary_currency','salary_in_usd'], axis=1)

X = pd.get_dummies(X)y = df['salary_in_usd']

# Split into train test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)# Scale data

t = MinMaxScaler()

t.fit(X_train)

X_train = t.transform(X_train)

X_test = t.transform(X_test)# Fit model no training datafrom xgboost import XGBRegressor

model = XGBRegressor()

model.fit(X_train, y_train)# Make predictions for test datay_pred = model.predict(X_test)

predictions = [round(value) for valueiny_pred]from sklearn.metrics import mean_squared_errorrms = mean_squared_error(y_test, y_pred, squared=False)

# Output: 41524.12746933382

# Not quite good but doing this to compare to the distributive approach

If we take a look at the first five predictions they look like this.

`[71284, 44957, 131146, 165507, 100765]`

For each element, the regressor aims to pinpoint a numeric value of the corresponding salary. Just as expected. I don’t know if it is me, but when I look at those numbers I feel like lacking some context. How sure can I be that those numbers are correct?

The regressor achieved an RMSE of 41, 524. I bet we can do better, however, the purpose is not to train a super-precise classifier but to contrast the differences between framing the problem as a regression problem or a classification problem. So now let’s move to the classification part.

## The classification problem

Earlier we said the core idea of framing a regression problem as a classification problem was to, instead of pinpointing a numeric value, estimate the probability that a sample belongs to a set of fixed bins, i.e. classify each sample into a bin.

So, the first obvious step is determining the set of fixed bins. To accomplish this we will need to take a look at a histogram of the training data we have.

And this is it. Now we have a clear idea of where are most of the salaries located, the minimum and the maximum.

So let’s define the bins now. To do this we will solely consider a principle that makes classification problems easier: make sure that the dataset is not very unbalanced. And we will do this by carefully choosing the number of bins so that the count of elements belonging to each class is roughly similar.

For now, let’s choose a count of 5 bins to find each bin’s limits and element count.

*# Let’s calculate the numeric histogram from the actual test target*

hist, bin_edges = np.histogram(y_test, bins=5)

hist, bin_edges

Output

`(array([33, 45, 28, 15, 1]),`

array([ 4000., 68000., 132000., 196000., 260000., 324000.]))

The first bins’ count is quite balanced, however, we have a problem with the last bins which we will fix by aggregating the data in these bins.

Note that at this moment our targets in *y* are just a bunch of floating numbers (the salaries), which are unsuitable for a classification task since such a task requires a sample and a label showing to which class the sample belongs. So the task ahead is to write some code to map a float into an integer label.

So take a look at the following code.

Simple right? We receive a float number and return a label according to its value. The code is self-explanatory, just note the last *elif *statement: this is where the aggregating occurs. The code says that “everything larger than 196,000 will be assigned a label of 4”.

I hear you asking why are those values hard-coded. And the answer is because this is just an example. I bet there are wiser ways to choose the bin boundaries, however, I just wanted to keep this simple to keep moving fast.

Now let’s train the classification using XGBoost again. But first, we need to convert the *float y targets* into a set of distributive y targets. This code will do.

# convert each y in training and test sets to a classy_train_distributive = [map_float_to_class(y) for y in y_train]

y_test_distributive = [map_float_to_class(y) for y in y_test]

Now train the classification task for real.

# Import the classifierfrom xgboost import XGBClassifier

from sklearn.metrics import accuracy_scoremodel = XGBClassifier()# Train to get a labelmodel.fit(X_train, y_train_distributive)# Get the labelspredictions = model.predict(X_test)# How good does it workaccuracy_score(y_test_distributive, predictions)# Output: 0.5327868852459017

Using default parameters we get an accuracy of 0.53, you bet it can be improved. For now, let’s move on.

If we take a look at the first five predictions they look like this.

`array([1, 1, 2, 3, 2])`

They are just labels… Not quite informative. We are interested not in which label got selected but in the underlying distribution among which the selected label was picked.

This code will do the trick.

# Instead of the labels get the probability distributionspredictions = model.predict_proba(X_test)

Now the first five predictions look like this.

`array([[2.76227575e-03, 8.35917711e-01, 1.60156339e-01, 8.88690702e-04,`

2.75029975e-04],

[2.02982640e-03, 7.47869253e-01, 2.40976274e-01, 8.92745517e-03,

1.97156842e-04],

[9.52042465e-04, 4.67287423e-03, 4.85075146e-01, 2.80507535e-01,

2.28792369e-01],

[7.99451722e-04, 8.09144005e-02, 1.74629122e-01, 7.32445538e-01,

1.12115145e-02],

[1.12233951e-03, 8.29923898e-02, 8.98771644e-01, 1.67797077e-02,

3.33951320e-04]], dtype=float32)

Well… We have more information! Since an image is worth a thousand words (or arrays in this case) we should better make a plot.

# Plot some of the predicted salary distributionsfig, ax = plt.subplots(3, 3)

fig.set_size_inches(16, 6)# Random indexes picked

ax[0, 0].bar(labels, predictions[1])

ax[0, 1].bar(labels, predictions[5])

ax[0, 2].bar(labels, predictions[8])ax[1, 0].bar(labels, predictions[10])

ax[1, 1].bar(labels, predictions[15])

ax[1, 2].bar(labels, predictions[20])ax[2, 0].bar(labels, predictions[40])

ax[2, 1].bar(labels, predictions[47])

ax[2, 2].bar(labels, predictions[52])plt.tight_layout()

fig.suptitle("Some calculated distributions samples")

plt.show()

This is the output. I find this solution way more expressive than a simple floating number with no context. For a sample, the model returns a distribution, and from the distribution, you can obtain a lot of information.

Take a look at row 1 and column 0. If you had just used the label provided by the first classifier perhaps you would be committing an error, since the difference between the probability of that particular sample belonging to either of those two larger classes is quite small. In this case, perhaps you should dig a bit more into the candidate’s profile to make a decision.

In contrast, look at row 2 and column 1. The model is quite certain that a salary in this range is correct.

Now, look at row 0 and column 2. The model is 3 times more confident that the salary should be somewhere in the second bin, nevertheless the third bin is quite relevant. Which one to choose? You could reach for some other data to make the final decision: perhaps the hiring department has a high rotation and a higher salary might help to reduce it, or, on the other hand, there is a wide demand for this profile and, hence, you can hire at the lower range.

We could make a similar analysis on the remaining distributions, but the bottom line is the same: distributions provide more data to make decisions compared to single values. Judge yourself.

In this post, I argued for the benefits of transforming a regression problem into a classification problem by means of a custom function that maps floating values into a class described, by an integer. Each class represents a range of values, i.e., a fixed support. As a result, instead of having a model that pinpoints a value given a sample, it provides a set of probabilities for the fixed support.

The advantage of a distributive model is it provides more context to make decisions, and the disadvantage is that it does not supply one single solution. One cannot have it all.

Do you have an application where this solution might be useful?

**Denial of responsibility!**Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.