Synthetic Data to Help Fraud Machine Learning Modelling | by Cornellius Yudha Wijaya | Sep, 2022

By Jessie Hobb On Sep 20, 2022

Synthetic data could help mitigate fraud cases

Fraud cases are common in any business industry and cause massive financial loss. Small or Big, every business would face the fraud problem whether they like it or not — as long as there are people with bad intentions.

Many efforts have been exerted in machine learning fraud detection research to mitigate the fraud problem, yet there is still no perfect solution. It’s understandable because every business has different requirements, and data is constantly evolving.

Even though there is no perfect solution, there are some ways to improve the model. One of the solutions is to use synthetic data. What is synthetic data, and how could it help fraud detection? Let’s get into it.

Synthetic Data is data created via computer technology and doesn’t exist in the real world. In other words, we can define Synthetic Data as data produced without being collected directly.

Synthetic data isn’t a new thing in the data world. However, with the progression of technology, synthetic data has become more critical and impacted various industries. Why it causes impact, let’s see multiple applications of synthetic data in the data science world:

Generate enormous amount of data without collection effort
Creating a dataset reflecting the real-world situation
Overcome data usage privacy
Simulate a not-yet condition that happens
Mitigate the data imbalance situation

The list would keep going on as the research of synthetic data is still ongoing. The point is that synthetic data is helpful in data science and impacts the industry.

Additionally, synthetic data could be classified depending on how the data were created and stored. The classification is:

Full Synthetic Data: The synthetic data were based on the original, but the user did not include any original data. The dataset only contains the synthetic data yet has similar properties to the original.
Partial Synthetic Data: A combination of original and synthetic data on the variable level. This category is often used if we want to replace certain variables, such as sensitive data, with synthetic ones.
Hybrid Synthetic Data: The creation of the data comes from both real and synthetic data. The underlying distribution and relationship between the variables were intact, but the dataset would contain both original and synthetic data (not only the variable level).

We have learned about synthetic data and its usefulness, but how could it help fraud machine learning development? We need to step back and take a look at the typical case of fraud dataset.

Fraud is an act of deception to gain profit without legal process—every business faces this problem and a potential loss. However, the amount of fraudulent cases in the business is inherently way lower than non-fraud cases. Why? Because most people were honest, the business would be destroyed if the circumstances were reversed.

The success of the fraud prevention data science project would come to two things: The business strategy and the fraud model.

We, as data scientists, need to understand the business, but the business strategy would be the other responsibility. Instead, we would need to focus on improving our fraud model. How hard is it, though, to develop a fraud model?

As I have mentioned previously, fraud cases rarely happen, but each case could cause a lot of losses. It means fraud modelling often has what we called an imbalance data case.

In general, imbalanced cases cause our prediction model to predict the majority of cases all the time if there are no defining features to classify the target — which most cases were. This case leads to high accuracy metrics but not with precision or recall.

So, how are Synthetic Data related to the imbalance data case? Research has shown Synthetic Data could help mitigate the imbalance problem by oversampling the minority data and creating a balanced dataset. For example, a paper by Dina et al. (2022) shows that synthetic data generated by CTGAN improved accuracy by 8% compared to the same ML model trained on unbalanced data.

The most renowned strategy for synthetic data balancing is SMOTE, although the technique hardly provides advantages in complex data. That is why we would try another approach for data synthesis — mainly involving the GAN model as it is proven helpful in increasing ML performances.

While synthetic data have their advantage, please be aware that the research is still new and beware of the cons when applying synthetic data to the modeling process, including:

With increased data complexity, the synthetic data generated might not represent the real-world population. This would lead the model to learn false insights and have a faulty prediction.
The synthetic data quality depends on the dataset used to generate the data. Bad original data would produce bad synthetic data, which causes the model to have an inaccurate output.

If you have understood the risk and weaknesses of using synthetic data in the modelling, let’s try a hands-on approach to see how imbalanced data helps fraud.

For our example, I would use the Vehicle Insurance Claim Fraud Detection dataset provided in Kaggle by Shivam Bansal (License: CC0: Public Domain). The dataset has a business problem detecting which customers would commit fraud on their claims.

To make things easier, I would use Pandas-Profiling to do the EDA. I have written more in-depth regarding the package if you want to read further here. Let’s start by exploring the dataset to see the overall features and the imbalanced data.

import pandas as pd
from pandas_profiling import ProfileReportprofile = ProfileReport(df)profile

Overall, we have around 33 variables and 15420 observations, with most of the data being in categorical type. No missing data, so we don’t need to do any missing data treatment. Let’s check the variable target to see the distribution.

As we can see from the summary above that the target ‘FraudFound’ is severely imbalanced. Only 6% of data or 923 observations were fraud compared to the non-fraud data.

For the next part, let’s build a classifier model to predict Vehicle Insurance Fraud. I would not use all the datasets and do some categorical encoding for our training purpose.

#Selecting some data I assume would be useful
df = df[['AccidentArea', 'Sex', 'MaritalStatus', 'Age', 'Fault', 'PolicyType', 'VehicleCategory', 'VehiclePrice', 'Deductible', 'DriverRating','Days_Policy_Accident','Days_Policy_Claim','PastNumberOfClaims', 'AgeOfVehicle','BasePolicy', 'FraudFound_P']]df = pd.get_dummies(df, columns = df.select_dtypes('object').columns, drop_first = True)

After the data cleaning, I would try to create the training data to train the model.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_reportX_train, X_test, y_train, y_test = train_test_split(df.drop('FraudFound_P',axis = 1), df['FraudFound_P'], train_size = 0.7,stratify = df['FraudFound_P'], random_state = 100)model = RandomForestClassifier(random_state = 100)
model.fit(X_train, y_train)

Using the Random Forest model, I would try to evaluate the fraud model initially.

y_pred = model.predict(X_test)print(classification_report(y_test, y_pred))

As we can see from the result above, most predictions would predict non-fraud cases. We already expected this case, so we would try using additional synthetic data to increase the model performance.

First, we need to install the package. I would use the model coming from the Ydata-synthetic package to make things easier. For this example, I would use the Conditional Wassertein GAN with Gradient Penalty (CWGAN-GP) model to produce the synthetic data. The model is good for balancing the dataset.

pip install ydata-synthetic

After the installation, I would set up the dataset for the CWGAN-GP model to train on. Also, I would create synthesized data based on the train data we have split. Why did I do this? Because I want to avoid data leaks by including the synthetic data in the test dataset.

X_train_synth = X_train.copy()
X_train_synth['FraudFound_P']= y_train

As I want to train the synthesizer only for the minority data, I would create a dataset that only consists of the fraud cases.

X_train_synth_min = X_train_synth[X_train_synth['FraudFound_P'] == 1].copy()

The next step is the development of the CWGAN-GP model. Let’s initiate the model with the parameter before we train the model.

from ydata_synthetic.synthesizers.regular import CWGANGP
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters#Model selection
synth_model = CWGANGP#Setting the parameters of the CWGANGP model, you could experiment with thisnoise_dim = 61
dim = 128
batch_size = 128
log_step = 100
epochs = 200
learning_rate = 5e-4
beta_1 = 0.5
beta_2 = 0.9
models_dir = './cache'#Setting the parametersgan_args = ModelParameters(batch_size=batch_size, lr=learning_rate, betas=(beta_1, beta_2),noise_dim=noise_dim,layers_dim=dim)train_args = TrainParameters(epochs=epochs, sample_interval=log_step)#Initiate the modelsynthesizer = synth_model(gan_args, n_critic =10, n_clasess = 10)

With all the parameters and the model initiated, we are ready to train the CWGAN-GP model. If you find it hard to train on your local laptop, try to move into Google Colab to provide more power.

We must determine which columns were numerical and categorical to train the model. For this example, I would treat all the columns as numerical columns.

synthesizer.train(data = X_train_synth_min, train_arguments = train_args, num_cols = list(X_train_synth_min.drop('FraudFound_P', axis = 1).columns), cat_cols = [], label_col = 'FraudFound_P')

As our fraud data is small, it should not take too much time to train (except if you increase the parameter). Then, we would synthesize our data with the trained model. For example, I could synthesize 100000 sample data from the model.

import numpy as npsynth_data = synthesizer.sample(condition = np.array([1]),n_samples = 100000)

With the data synthesized, I would fill our previous training data with the synthesized data to balanced out the dataset.

#Sampling the data to fill the training data
minority_synth_data = synth_data[synth_data['FraudFound_P']==1].sample(9502)X_train_synth_true = pd.concat([X_train_synth, minority_synth_data]).reset_index(drop = True).copy()X_train_synth_true['FraudFound_P'].value_counts()

As we can see from the above example, we could balance our previous imbalanced dataset with the current synthesized data. Let’s see how the model performance is trained with the balanced data.

model.fit(X_train_synth_true.drop('FraudFound_P', axis =1), X_train_synth_true['FraudFound_P'])y_pred = model.predict(X_test)print(classification_report(y_test, y_pred))

There is a slight increase in the model performance with the balanced dataset compared to the original one. It’s not much increase because we haven’t properly selected the features and experimented with another model. However, this simple example proves that synthesized data could help increase fraud modelling performance.