Techno Blender
Digitally Yours.

How To Stratify Data in Machine Learning Projects to Significantly Improve Model Performance | by Graham Harrison | Jun, 2022

0 88


How and when to stratify data in machine learning projects to ensure that predictions are accurate and meaningful using just 7 lines of Python code

Photo by Emery Muhozi on Unsplash

Background

I recently worked on a real-world machine learning project which initially produced a set of predictions that were rejected by the domain experts because they could not accept that the future would out-turn in the way the model was predicting.

The causes of the problem revolved around the changing nature of the aspect of the business represented by the data over time i.e. the future was not going to out-turn like the past.

This can mean that a machine learning model may not be accurate enough to be meaningful but eventually a solution was found by stratifying the data which prompted me to write this article to share the solution so that it can be used by other data scientists who encounter a similar problem.

Getting Started

The first thing we need is some data. The real-world data used in the project cannot be shared so I have created a fictitious dataset from scratch using the Faker library. This means that there are no license restrictions on the data and it may be used or re-used for learning and development purposes.

Let’s start by importing the libraries we need and then reading in the fictitious data sets …

Image by Author

CompletedCustomerLoans.xlsx represents fictitious loans that have completed in one of two statuses or classifications stored in the LoanStatus target feature. “Repaid” loans completed successfully and were repaid by the customer and “Defaulted” loans were not repaid and completed unsuccessfully.

LiveCustomerLoans.xlsx has exactly the same format but does not have a LoanStatus as they are still live and yet to complete, either successfully or unsuccessfully.

Understanding the Problem

A good use of machine learning wold be to train a model on the completed loans that could predict what the future loan status will be for the live loans.

However, we need two things for a model to work in this way –

  • A set of reliable, accurate and consistent historic data
  • For the future to broadly turn out in a similar way to the past

The second factor is where we have some problems with this data set. An important factor in this data is the LoanType which can be Bronze, Silver, Gold or Platinum.

The way in which the makeup of loans is set to change over time can be visualised as follows –

Image by Author
Image by Author

These charts highlight the problem. Historic / completed loans have a high instance of Bronze, Silver and Gold loans but the live / uncompleted loans do not. It looks like our fictitious loans business is switching all future customers into Platinum loans and discontinuing the other loan types.

If the performance of loans in terms of Repaid vs. Defaulted differs for the different loan types then the machine learning model will produce skewed predictions.

If you have a situation like this you may need to stratify the training data based on the proportions found in the live data to improve the accuracy of the predictions.

The remainder of this article shows how this can be done in less than 5 lines of Python code …

Implementing the Solution (The First 5 Lines of Code)

A re-usable function to stratify a dataset based on the proportions of any combination of features found in another dataset can be implemented as follows –

Image by Author
Image by Author

Understanding the Solution

The stratify function achieves its purpose in just 3 lines of code which is the Pythonic way but reducing the number of lines of code can decrease the readability so let’s pick it apart to understand what is going on …

This is the first line of code …

stratify_cols contains a list of data fileds to use to construct the weights.

The return type of this line of code is a Pandas Series where index contains the range of values found in the population data and the value contains the percentage of rows found with this value.

normalize=True tells value_counts to return percentages instead of actual counts. These values are then multiplied by the total number of records found in the sample data (sample.shape[0]) to arrive at the number of records required in the sample dataset to reflect the proportions found in the population.

<class 'pandas.core.series.Series'>
MultiIndex([('Platinum',),
( 'Silver',),
( 'Bronze',),
( 'Gold',)],
names=['LoanType'])
LoanType
Platinum 2032
Silver 367
Bronze 170
Gold 170
dtype: int32

Here is the next line of code. It is doing quite a lot which I will explain in stages …

Lets start at the beginning. sample.groupby(stratify_cols) returns a DataFrameGroupBy object with a group for each unique combination found in stratify_cols. This can be visualised as follows …

Bronze
Gold
Platinum
Silver

group.sample(n=weights[group.name]) then uses the group name to index the weights Series and extract the number of records to sample …

170

The last part simply acknowledges that there may be values in the sample data not found in the population data so the inline if statement simply samples zero rows should any data found matching this condition –

group.sample(n=weights[group.name] if group.name in weights else 0

The parameter replace=True in the sample puts each sampled row back into the data so it can be resampled if necessary and random_state is set to enable reproducible results.

Lastly the magic of lambda functions means that the sample returned for each group is joined together to produce a single DataFrame.

To explain the 3rd and final line of code let’s take a quick look at df_return

Image by Author

This does not look quite right. Pandas has returned a MultiIndex and this needs to be made to look like the simple index that was found in the original data without losing the index values.

This took quite a bit of research and documentation reading but it turns out that the index can be restored to its original format by dropping the multi-part that relates to the features that were used to build the index in the first place –

Image by Author

Now we have a function that can stratify any dataset by the proportions found in another dataset but what if we simply know the desired proportions but we do not have a population dataset?

This can be implemented with just one additional line of code in the build_weights function and a small change to stratify to accept the weights as an optional parameter.

Please note that sample_size is set to 0 if the list of values represents the number of rows. If instead values contains a set of floating point numbers that add up to 1 (i.e. percentages) the sample_size contains the number of records to multiply the percentages by to sample.

The single line of code in build_weights is fairly self explanatory. It uses inline if statements to pass the appropriate values for the data, and index parameters. It did take a fair bit of research to find out how to construct a MultiIndex for the instance where there is more than one level but the finished code is very clean …

Bronze      1000
Silver 500
Gold 250
Platinum 100
dtype: int64

or …

Bronze      1000
Silver 500
Gold 250
Platinum 100
dtype: int64
Image by Author

Conclusion

In this article the case for when stratifying data can significantly improve the performance of a machine learning model. This has been presented by citing a real-world project and then demonstrated using fictitious and synthetic data.

A simple function has been developed in just 3 lines of Python code that can stratify any dataset given the proportions found in another dataset, thereby preparing data for modelling that matches the conditions described.

Finally an additional function has been provided that can build the weights to pass into the function that performs the stratify to meet the case where the data scientist knows what the proportions should be but does not have access to a dataset that contains the right proportions.

Thank you for reading!

If you enjoyed reading this article, why not check out my other articles at https://grahamharrison-86487.medium.com/? Also, I would love to hear from you to get your thoughts on this piece, any of my other articles or anything else related to data science and data analytics.

If you would like to get in touch to discuss any of these topics please look me up on LinkedIn — https://www.linkedin.com/in/grahamharrison1 or feel free to e-mail me at [email protected].

If you would like to support the author and 1000’s of others who contribute to article writing world-wide by subscribing, please use the following link (note: the author will receive a proportion of the fees if you sign up using this link at no extra cost to you).


How and when to stratify data in machine learning projects to ensure that predictions are accurate and meaningful using just 7 lines of Python code

Photo by Emery Muhozi on Unsplash

Background

I recently worked on a real-world machine learning project which initially produced a set of predictions that were rejected by the domain experts because they could not accept that the future would out-turn in the way the model was predicting.

The causes of the problem revolved around the changing nature of the aspect of the business represented by the data over time i.e. the future was not going to out-turn like the past.

This can mean that a machine learning model may not be accurate enough to be meaningful but eventually a solution was found by stratifying the data which prompted me to write this article to share the solution so that it can be used by other data scientists who encounter a similar problem.

Getting Started

The first thing we need is some data. The real-world data used in the project cannot be shared so I have created a fictitious dataset from scratch using the Faker library. This means that there are no license restrictions on the data and it may be used or re-used for learning and development purposes.

Let’s start by importing the libraries we need and then reading in the fictitious data sets …

Image by Author

CompletedCustomerLoans.xlsx represents fictitious loans that have completed in one of two statuses or classifications stored in the LoanStatus target feature. “Repaid” loans completed successfully and were repaid by the customer and “Defaulted” loans were not repaid and completed unsuccessfully.

LiveCustomerLoans.xlsx has exactly the same format but does not have a LoanStatus as they are still live and yet to complete, either successfully or unsuccessfully.

Understanding the Problem

A good use of machine learning wold be to train a model on the completed loans that could predict what the future loan status will be for the live loans.

However, we need two things for a model to work in this way –

  • A set of reliable, accurate and consistent historic data
  • For the future to broadly turn out in a similar way to the past

The second factor is where we have some problems with this data set. An important factor in this data is the LoanType which can be Bronze, Silver, Gold or Platinum.

The way in which the makeup of loans is set to change over time can be visualised as follows –

Image by Author
Image by Author

These charts highlight the problem. Historic / completed loans have a high instance of Bronze, Silver and Gold loans but the live / uncompleted loans do not. It looks like our fictitious loans business is switching all future customers into Platinum loans and discontinuing the other loan types.

If the performance of loans in terms of Repaid vs. Defaulted differs for the different loan types then the machine learning model will produce skewed predictions.

If you have a situation like this you may need to stratify the training data based on the proportions found in the live data to improve the accuracy of the predictions.

The remainder of this article shows how this can be done in less than 5 lines of Python code …

Implementing the Solution (The First 5 Lines of Code)

A re-usable function to stratify a dataset based on the proportions of any combination of features found in another dataset can be implemented as follows –

Image by Author
Image by Author

Understanding the Solution

The stratify function achieves its purpose in just 3 lines of code which is the Pythonic way but reducing the number of lines of code can decrease the readability so let’s pick it apart to understand what is going on …

This is the first line of code …

stratify_cols contains a list of data fileds to use to construct the weights.

The return type of this line of code is a Pandas Series where index contains the range of values found in the population data and the value contains the percentage of rows found with this value.

normalize=True tells value_counts to return percentages instead of actual counts. These values are then multiplied by the total number of records found in the sample data (sample.shape[0]) to arrive at the number of records required in the sample dataset to reflect the proportions found in the population.

<class 'pandas.core.series.Series'>
MultiIndex([('Platinum',),
( 'Silver',),
( 'Bronze',),
( 'Gold',)],
names=['LoanType'])
LoanType
Platinum 2032
Silver 367
Bronze 170
Gold 170
dtype: int32

Here is the next line of code. It is doing quite a lot which I will explain in stages …

Lets start at the beginning. sample.groupby(stratify_cols) returns a DataFrameGroupBy object with a group for each unique combination found in stratify_cols. This can be visualised as follows …

Bronze
Gold
Platinum
Silver

group.sample(n=weights[group.name]) then uses the group name to index the weights Series and extract the number of records to sample …

170

The last part simply acknowledges that there may be values in the sample data not found in the population data so the inline if statement simply samples zero rows should any data found matching this condition –

group.sample(n=weights[group.name] if group.name in weights else 0

The parameter replace=True in the sample puts each sampled row back into the data so it can be resampled if necessary and random_state is set to enable reproducible results.

Lastly the magic of lambda functions means that the sample returned for each group is joined together to produce a single DataFrame.

To explain the 3rd and final line of code let’s take a quick look at df_return

Image by Author

This does not look quite right. Pandas has returned a MultiIndex and this needs to be made to look like the simple index that was found in the original data without losing the index values.

This took quite a bit of research and documentation reading but it turns out that the index can be restored to its original format by dropping the multi-part that relates to the features that were used to build the index in the first place –

Image by Author

Now we have a function that can stratify any dataset by the proportions found in another dataset but what if we simply know the desired proportions but we do not have a population dataset?

This can be implemented with just one additional line of code in the build_weights function and a small change to stratify to accept the weights as an optional parameter.

Please note that sample_size is set to 0 if the list of values represents the number of rows. If instead values contains a set of floating point numbers that add up to 1 (i.e. percentages) the sample_size contains the number of records to multiply the percentages by to sample.

The single line of code in build_weights is fairly self explanatory. It uses inline if statements to pass the appropriate values for the data, and index parameters. It did take a fair bit of research to find out how to construct a MultiIndex for the instance where there is more than one level but the finished code is very clean …

Bronze      1000
Silver 500
Gold 250
Platinum 100
dtype: int64

or …

Bronze      1000
Silver 500
Gold 250
Platinum 100
dtype: int64
Image by Author

Conclusion

In this article the case for when stratifying data can significantly improve the performance of a machine learning model. This has been presented by citing a real-world project and then demonstrated using fictitious and synthetic data.

A simple function has been developed in just 3 lines of Python code that can stratify any dataset given the proportions found in another dataset, thereby preparing data for modelling that matches the conditions described.

Finally an additional function has been provided that can build the weights to pass into the function that performs the stratify to meet the case where the data scientist knows what the proportions should be but does not have access to a dataset that contains the right proportions.

Thank you for reading!

If you enjoyed reading this article, why not check out my other articles at https://grahamharrison-86487.medium.com/? Also, I would love to hear from you to get your thoughts on this piece, any of my other articles or anything else related to data science and data analytics.

If you would like to get in touch to discuss any of these topics please look me up on LinkedIn — https://www.linkedin.com/in/grahamharrison1 or feel free to e-mail me at [email protected].

If you would like to support the author and 1000’s of others who contribute to article writing world-wide by subscribing, please use the following link (note: the author will receive a proportion of the fees if you sign up using this link at no extra cost to you).

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment