Which Feature Engineering Techniques improve Machine Learning Predictions? | by Suhas Maddali | Nov, 2022

By Jessie Hobb On Nov 11, 2022

Understanding the various feature engineering techniques can be handy for an ML practitioner. After all, features are one of the most determining factors about how machine learning and deep learning models perform in real-time.

When it comes to machine learning, the thing that one can do to improve the ML model predictions would be to choose the right features and remove the ones that have negligible effect on the performance of the models. Therefore, selecting the right features can be one of the most important steps needed for a data scientist or a machine learning engineer who are often tasked with work to do especially in building those intricate models that are able to generalize well on the test data set respectively.

Considering the task, for example, of predicting whether a person is going to suffer from a heart disease, one of the strongest indicators that can have a good impact would be the body mass index (BMI). Failing to consider this feature and not using it in our dataset when we are trying to predict the levels of blood pressure (BP) that a person might have can often lead to less accurate results. In this case, BMI can be a strong indicator of a person to be suffering from these medical conditions. Hence it can be important to consider this feature as it would have a strong impact in the outcome.

Considering another case study of predicting whether a person is going to default on a loan or not. Before lending loan to a person, the bank under consideration would ask a set of questions such as the salary, net worth and their credit history before lending it. If we were to give the task for a human to decide whether a person must be given a loan based on a set of factors such as the ones mentioned above, he/she would go over the total salary and their overall credit history.

Similarly, when the data is given to the ML models in the same way it is given to human, it would learn to get important representations for it to decide whether a person would be paying back a loan. If we were to remove features such as salary, the ML model would be missing the key pieces of information for it to perfectly decipher whether a person is going to be paying back the loan. Hence, it can be quite faulty in its predictions as one of the most important features (salary) is missing from the data. Therefore, this highlights the importance of having the right features for our machine learning and deep learning models to be performing well on the test set and the real-time data respectively.

Various Featurization Techniques in Machine Learning

Now that we got to know about the importance of deciding the right features in determining the predictive quality of our models, we will now go ahead and look for various featurization techniques that aid in our model predictions and improve their results.

Imputation

This is a method by that we fill in the missing values in the data. There are a large number of datasets that we find on the internet such as the toy datasets that contain almost all the features and labels without having anomalies or missing data. However, this can be far from true in real life as most of the real-world data contain missing values. Therefore, specific steps must be taken to ensure that the values that are missing are somehow filled.

There are various methods at which we can perform imputation. We might either fill the missing values with the mean or the average of the feature. There are other methods such as median imputation and mode imputation of the features. As a result, we are not having data that contains missing values by performing these methods.

If we are predicting whether a person would be defaulting on a loan or not, we would be using salary to be one of the important features for our machine learning model. However, salary information for all the participants might not be present in our data. Therefore, one of the best approaches would be to impute or fill those missing values with the mean of the entire salary feature respectively.

Scaling

We tend to give a different set of features to our models based on which it would determine the best ones to use to predict the outcome or the target variable. It is to be noted, however, that the features that we are using can have different scales when we initially receive the data.

Take, for example, features that are useful to determine house prices. In such a case, the features might be the number of bedrooms and interest rates. We cannot compare the 2 features as the number of bedrooms is measured in units while the interest rates are measured in dollars ($) respectively. If we were to give this data to our ML models, it would simply understand that dollars are a large number of units higher than the number of bedrooms feature respectively. However, this is far from true as we have seen above. Therefore, it is important to perform the scaling operations of the features before giving them to the models for prediction.

Normalization

This is one way in which we perform the operation of scaling where the maximum and the minimum value are taken for individual features under consideration before transforming other values in the data. We ensure that the features have a minimum value of 0 and a maximum value of 1. This would ensure that we are able to produce the best results with our models and get good predictions.

Taking an example of whether a customer would be churning (leaving) or staying in internet service, features such as monthly charges and tenure are some important features. Taking look at monthly charges which can be in dollars ($) while the tenure can be either in years or month units. Since they are of a different scale, normalization can be quite handy in this scenario and ensures that we get the best model predictions.

Standardization

Standardization is similar to normalization in converting the features except that we transform the data in such a way that we get an output that has unit variance and zero mean for each and every individual feature. As we have already seen that having different scales for various features can oftentimes confuse the model in assuming that one feature is more important than the other just because of the scale of the data, performing the operation of standardization can help in ensuring that we are getting the best possible predictions. Therefore, this is a step that is often taken by machine learning practitioners in building the best predictions.

When predicting the prices of cars, we take into account features such as the number of cylinders and mileage respectively. Since these 2 features are not of similar scale, we would have to perform standardization where we could have a common ground between the features before giving the models for prediction.

One Hot Encoding

Imagine a scenario where there are a large number of categorical features in our data. Some of the categorical features in our data can include features such as countries, states, names, etc. We see that from these features, we only generate the occurrence of these instances without getting a numerical representation.

For our ML models to work well and make use of the data, categorical features (as seen above) should be converted to numerical features for the models to perform the computation. Therefore, we perform this step of one hot encoding so that the categorical features are converted to numerical features.

Now one might question how this is actually done by the algorithm. It would simply consider each of the categories per feature as an individual column. The presence or absence of a particular category would be either marked a 1 or a 0. We would be making the value 1 if we find that a specific category is present or vice-versa.

Response Coding

This is another method that is quite similar to one hot encoding in that it would work with categorical data. However, the procedure by which it converts categorical features to numerical features is different from the earlier method.

In response coding, we are mostly interested in the mean value of our target per category. For example, take the case of determining housing prices. In order to predict the housing prices per various localities, we would be grouping the localities and finding the mean house price per locality. Later, we would be replacing locality with that specific mean house price per locality to represent the numerical value which was earlier a categorical feature. As a result, our model can inherently learn about how much of an impact a neighborhood has in determining housing prices. Therefore, response coding can be quite handy in this scenario.

Considering the problem of predicting car prices, there might be cars such as SUVs or Sedans. The price can be determined sometimes by these 2 features. Therefore, response coding can be useful where this categorical feature (car type) is converted using response coding. We take the mean price of the SUVs alone and Sedans alone. If we have SUV as the car type, we replace it with the mean price of the SUV car segment. When we consider the car type as Sedan, we replace it with the mean price of the Sedan car segment respectively.

Handling Outliers

Outliers are data points that are considered anomalies in the data. However, it is also important to note that some outliers in the data can be quite useful and important for the model to rightly determine the outcome. If we find that there are a large number of outliers in the data, it can skew the model in giving the right predictions for outliers without being able to generalize well for real-time data. Therefore, we would have to take the right steps to ensure that we remove them before training the models and putting them into production.

There are various methods that could be followed to remove the outliers in the data. Some of them include finding the standard deviation from each of the features. If the data points lie 3 standard deviations above or below the mean, we can automatically classify them as outliers and remove them so that they would not affect the machine learning model predictions.

Taking into account whether a person is going to be defaulting on a loan or not, there might be information about the salary of a person. The salary information might not always be accurate and they might be quite a lot of outliers in this feature. Training our ML model with this data can often lead to it performing poorly on the test set or unseen data. Therefore, the best step would to be remove the outliers from the data before giving it to the ML models. This can be done by understanding the standard deviation of the salaries and the values that are above or below 3 standard deviations are automatically removed for the models to make robust predictions.

Log Transformation

This is a technique that could be used when we find that there is a heavy skew in the data. If there are a lot of skews i.e. the data contains a large number of values concentration in a particular region while a few outliers and data points that are far away from the mean, there is a higher chance of our model not taking and understanding this complex relationship.

Therefore, we would be using the log transformation to convert this data and reduce the skewness so that the model is more robust to outliers and is able to generalize well on the real-time data. Log transformation can be a handy feature engineering technique that boosts the performance of ML models respectively.

Similar to the above problem of predicting whether a person would be defaulting on a loan or not, we can also apply log transformation to salaries as we see a lot of skew in general in the salary information. A large number of people (around 80 percent) get basic salaries while a small set of people (around 20 percent) receive large amounts. There is quite a skew in the data which could actually be removed with the use of log transformation.

Conclusion

After going through this article, I believe that you were able to understand various feature engineering techniques that are important for your machine learning models. Using the best feature engineering techniques at the right time can be truly handy and generate valuable predictions for companies to use as a result of using artificial intelligence.

If you like to get more updates about my latest articles and also have unlimited access to the medium articles for just 5 dollars per month, feel free to use the link below to add your support for my work. Thanks.

https://suhas-maddali007.medium.com/membership

Below are the ways where you could contact me or take a look at my work.

GitHub: suhasmaddali (Suhas Maddali ) (github.com)

YouTube: https://www.youtube.com/channel/UCymdyoyJBC_i7QVfbrIs-4Q

LinkedIn: (1) Suhas Maddali, Northeastern University, Data Science | LinkedIn

Medium: Suhas Maddali — Medium