Boost Machine Learning Model Performance through Effective Feature Engineering Techniques | by Suhas Maddali | Feb, 2023

By Jessie Hobb On Feb 23, 2023

Learn the right feature engineering techniques when applied to credit card fraud detection problem that improves the overall accuracy of machine learning models

Machine learning and data science are used in a large number of industries. One of the most popular applications of data science is in the field of finance. A lot of companies are trying to automate tasks such as whether to give loans to lenders or not to whether a transaction is fraudulent or non-fraudulent. In addition to this, there are other areas such as customer segmentation and credit scoring that are used in finance to learn various behavioral traits of customers and determine their overall credit score.

While the technology is impressive in automating a wide variety of tasks, failing to perform the right feature engineering with the dataset used to train ML models could oftentimes lead to poor performance on the test set (data not seen by models). Therefore, effective and efficient strategies in the space of feature engineering must be performed so as to ensure that the models perform well during the testing and production phases respectively.

This article mainly focuses on implementing machine learning models (xgboost) on credit card fraud detection dataset and aims to determine the difference in model performance before and after performing feature engineering. In this way, we get to understand the importance of feature engineering to get the best predictions from the models. We will follow a list of iterative steps such as reading the data, performing exploratory data analysis (EDA), training an ML model without feature engineering and finally, perform feature engineering to see an improvement in the performance of ML models as a result of this transformation. Let us now go over each of these steps and highlight some of the key insights during each of these steps.

The first step would be to read the dataset for your fraud analysis. Most of the time, data is recorded in ‘csv’ format. Therefore, there is a library in python called ‘pandas’ that is used to read ‘csv’ files. We will use this library to read the data as shown below.

Note: The dataset was downloaded from https://www.kaggle.com/datasets/kartik2112/fraud-detection under CC0: Public Domain license

Now that the data is read, we are taking a look at a list of columns and their non-null values. There are columns such as ‘Unnamed: 0’ and ‘trans_num’ that do not add a lot of meaning for our machine learning models to determine whether a transaction is fraudulent or non-fraudulent. In the subsequent sections, we take steps to remove features that are not important.

As shown from the data, there are no missing values and each of the features have different set of datatypes which must be kept in mind when performing feature engineering.

This step is followed to understand and analyze the data more thoroughly. As a result, we are able to find missing values and outliers in the data. Failure to remove them would lead to results having a lot of skew and a significant drop in ML model performance. In addition to this, there are issues such as bias that impact the model as it would be trying to learn too much from these data points without having generalization capabilities. As shown in the below code cells, we look at each of the steps and the output generated as a result of performing exploratory data analysis (EDA).

Countplot of Fraud Vs. Non-Fraud (Image by Author)

It can be seen from the above that there are a large number of cases of non-fraudulent transaction as compared to fraudulent transaction. This is because in real-life, we tend to rarely see credit card frauds when compared to the non-fraudulent ones that are often done.

We see that large number of transaction data is taken from the state of Texas (TX), followed by Newyork (NY) and other states. When doing feature engineering, we can find the average transaction for each state to determine whether a transaction is fraudulent. In addition, other features such as the minimum and maximum transaction amounts could be added in feature engineering to improve ML model performance.

Average transaction amount per state plot (Image by Author)

After grouping the data on the basis of state and finding the average transaction amount and arranging them, the state of Delaware tends to have significantly higher transaction amount. This clearly shows that there is an anomaly in our data for the state of Delaware. Therefore, steps must be taken to remove those categories that contain significantly higher or lower values.

Categories for fraudulent transaction plot (Image by Author)

From the plot, it is evident that in fraudulent cases, there is a higher occurrence of categories in the name of groceries. In other words, fraudulent transactions can mostly occur by simply organizing it as grocery purchase. This would be useful in our feature engineering as it helps the ML model in determining the chances of fraud based on the type of purchase.

Categories for non-fraudulent transaction plot (Image by Author)

Non-fraudulent cases, on the other hand, tend to take place mostly if the category is ‘gas_transport’. This is followed by ‘home’ along with other categories. By looking at the above 2 plots, we can maintain a count of total fraudulent and non-fraudulent transactions for each of the categories listed as it would give our ML model a good understanding of various categories through which frauds are organized.

Heatmap of correlation (Image by Author)

It is shown in the correlation plot that most of the transactions that are fraudulent also have high correlation with the total amount drawn. Thus, this feature could be quite useful in our model predictions of occurrence of fraudulent transactions.

Now that we have explored the data and understood it thoroughly, it is time to move ahead with essential feature engineering strategies. Since some of the data present in columns have string values, we cannot directly feed them to our models. Therefore, we follow a few sets of feature engineering strategies such as one-hot encoding to convert these categories into features.

After the categorical features are converted to numerical values, it is time to remove the previous categorical features that contain string values as done in the above code cell.

We stack each of the values for training and test set with the encoded values as shown. It is to be noted that there are other encoding strategies such as TFIDF, Word2Vec, etc. but for now, we focus on bag of words representation for encoding categories of features.

Examining the shape of the data, we get the additional columns added as a result of performing one-hot encoding for the categorical features.

Now that the categorical features are converted to numerical features and appended to the original dataset, it is time to apply machine learning models (xgboost) for our problem of predicting the likelihood of a transaction being fraudulent.

When training a model, we first initialize it and use the ‘.fit’ attribute to train the model by giving the input and the output data to it. We tend to get different set of default hyperparameters that were set to train the model.

Predictions generated from the test set are compared to that of the actual output that were set aside for testing the performance of the model.

It is seen that there is a high accuracy on the non-fraudulent cases whereas in fraudulent cases, we tend to see an overall f1-score to be 0.75 respectively. Therefore, we can take the right steps to either improve the number of samples of fraudulent cases by using various methods. But for now, the algorithm does a decent job in making its predictions.

There are a list of feature engineering strategies that we can explore that improves model performance. The most common approaches we use are standardization and normalization. This ensures that features are created with a similar scale with respect to the remaining features so that the ML algorithm doesn’t prioritize features that have high standard deviation.

In addition to this, more features can be generated based on the existing set of features that can improve model performance. In this way, the model learns important representations that help it to determine whether a transaction is fraudulent or non-fraud. Let us go over each of these steps in great details in this section.

Scaling

It is an operation where the input features that are of different scales are converted to a form all features are weighed equally. In the dataset, it can be seen that the feature ‘city_pop’ and ‘amt’ have different scales as city_pop is just a count while ‘amt’ is measured in dollars ($). Scaling operation ensures that each of the features having different scales are distributed such that there is no difference in their scale after conversion.

There are two popular methods of scaling which are standardization and normalization. Let us apply each of the methods and tabular the performance of the ML model in each of these cases.

Standardization

It scales the values of a feature such that the resulting mean has a value of 0 and a standard deviation of 1. This conversion is performed by subtracting the mean of a feature from data points and dividing the result with the standard deviation of the feature.

StandardScaler is a popular library that is used to perform the standardization operation. After importing the library, steps are taken to transform both the training and the test set.

Normalization

Normalization is also a popular option for feature engineering. In this method, the maximum and minimum values in a dataframe are taken into consideration before implementing the operation. After identifying this information, the values for individual features are transformed such that each of the value has have a possible set of minimum values of 0 and a maximum value of 1 respectively.

We use ‘normalize’ that is used to ensure that values lie in the range of 0 and 1. The same operation is performed on the test set. It is to be noted that the minimum and maximum values are taken from the training set and not the test set as doing so would lead to data leakage, giving rise to inflated performance.

Creating New Features

It is now time to add new features that could improve the model performance. In order to increase the performance, adding the most relevant features for the model would be useful. Having domain expertise can help to a large extent in adding those features based on the existing ones. Based on our knowledge, let us add a few features that can improve performance.

We have added ‘euclidean_distance’ feature as this would give a good understanding for the algorithm rather than it only taking the longitude and latitude information between the merchant and the buyer.

In addition, we are also converting data on the basis of transaction amount and determining whether it is a large transaction or not.

Classification Report (Image from Author)

By adding features such as euclidean distance and flagging whether a transaction amount was high or low, we tend to see an improvement in the overall F1-score of the model for the positive class (fraud transactions).

After going through this article, you should have a firm understanding of the importance of feature engineering in improving model performance. While feature engineering had a good impact in this problem, there might be other problems that might not require a lot of feature engineering but instead can require more data or better ML models to predict the target variable. Therefore, the type of feature engineering to select depends to a large extent on the dataset used and the relationship between the features and the target variable.