How to Solve Every ML Problem with Low-Code | by Patrick Brus | Nov, 2022

By Jessie Hobb On Nov 15, 2022

An Introduction to the Python Library PyCaret

According to Fortune Business Insights [1], the global machine learning (ML) market size is expected to grow from 15 billion USD in 2021 to 209 billion USD in 2029. This means that the demand for data scientists is rising, while the supply of talents remains scarce.

For this reason, a new role was introduced. The role of a citizen data scientist.

The role of a citizen data scientist can be described as follows [2]:

According to Gartner, a citizen data scientist is a person who creates or generates models that leverage predictive or prescriptive analytics, but whose primary job function is outside of the field of statistics and analytics.

A citizen data scientist therefore shall solve ML problems without having to have much knowledge in the fields of statistics and analytics.

To enable these persons to solve ML problems without having to run through the explorative data analysis, data preprocessing, data cleaning, model selection and all the other required steps, a low-code and end-to-end machine learning and model management tool is required.

This is where the Python library PyCaret comes into play.

In this article, I will introduce the Python library PyCaret to you. This library allows you to train an ML model easily with only a few lines of code!

I will directly demonstrate the power of this library to you by solving a concrete ML problem and providing you with the code snippets.

So in case you are interested in becoming a citizen data scientist, or you are already a data scientist but want to learn how to ease the training process, then continue reading!

Let’s first dive deeper into PyCaret. The following quote from the official PyCaret documentation summarizes PyCaret and its purpose quite well [3]:

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.

According to the documentation, PyCaret can be used for classification, regression, clustering, anomaly detection and natural language processing.

They also provide tutorials and even datasets for giving the user the possibility to quickly test this library.

As this is a Python library, you can easily install it via pip with the following command:

pip install pycaret

PyCaret itself is a Python wrapper around several other ML libraries like scikit-learn, CatBoost, XGBoost and many more.

The library allows you to solve ML problems without the need to be an experienced data scientist, so perfectly suited for the citizen data scientist.

But it also helps data scientists to quickly solve ML problems, as also only parts of this library can be used for solving an ML problem. Data scientists, for example, could prepare the data themselves and then make use of PyCaret to find the optimal model for that data.

Let’s now show the power and easiness of using PyCaret for solving a concrete ML problem.

You can find the complete notebook containing all the code in my Github repository here.

PyCaret offers several datasets that can be used to quickly test the power of their library.

You can simply get all available datasets by running the following command:

from pycaret.datasets import get_datacredit_data = get_data("index")

One of the datasets is the credit dataset, which is the one that I am using in this article. The dataset is called “Default of Credit Card Clients Dataset”. You can also find more information on that dataset on Kaggle.

It basically contains information on default payments of credit card clients in Taiwan from April 2005 to September 2005.

You can load that dataset with the following command:

# load credit dataset
credit_data = get_data("credit")

This dataset contains 24000 samples in total with 24 features per sample.

As a first step, I create a hold-out test set. This is then at the end used to see the “real” performance of the final model on real world data.

For this, I am making use of scikit-learn’s train-test split function:

from sklearn.model_selection import train_test_split# create a train and test split
data_train, data_test = train_test_split(credit_data, test_size=0.1)

Setting up Data Pipeline and Environment

The first step in solving an ML problem is always to set up your data pipeline and your environment. Setting up your data pipeline typically includes running some exploratory data analysis to better understand the data, feature engineering to get the most out of your data, data cleaning to remove missing values, and many more.

With PyCaret, all these steps are part of the setup() function. This function basically creates all the transformations for preparing the data for training.

It takes in a Pandas dataframe and the name of the target column.

On top of these values, you can add a lot more optional ones for setting up your data pipeline. For example, you can normalize your data, transform it to be more Gaussian like, ignore features with low variance, remove features that have high inter-correlations, choose a data imputation type for imputing missing data, apply dimensionality reduction, remove outliers, compute group feature statistics for creating new features and many more. You can find all input options for the classification use case here.

So let’s now take the dataframe that is meant for training and run the setup() function with the following options:

# call the pycaret setup function to setup the transformation pipeline
setup = setup(data=data_train, 
target="default",
normalize=True,
transformation=True, # make data more gaussian like
ignore_low_variance=True, # ignore features with low variance
remove_multicollinearity=True, # remove features with inter-correlations higher than 0.9 -> feature that is less correlated with target is removed
group_features = [['BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'], # features with related characteristics
['PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']],
use_gpu=True # use GPU when available, otherwise CPU                                               
)

And done. By running this function, you are applying preprocessing, normalization, transformation, some feature engineering and removing highly correlated features. All that with just one function call and a few lines of code. That is awesome!

Comparing All Models

The next step of solving an ML problem is to compare several different models for finding the best option.

PyCaret again offers a function that evaluates the performance of all estimators available in the model library by using a k-fold cross validation.

Let’s run this function now and return the 3-best options:

from pycaret.classification import *best_3_models = compare_models(n_select=3)

This function will then run a 10-fold cross validation to evaluate the performance of several classification models and outputs several metrics.

You can find the output of running this function for the credit card dataset in Figure 1.

Figure 1: Results of comparing all available classification models (Image by author).

A model with good AUC score and F1 score is the gradient boosting classifier. I therefore decided to continue with this one for the remaining part of this article.

I did not choose a good model based on the accuracy, as the dataset itself is imbalanced and the accuracy would not deliver a good estimate of the performance. But I will come back to this point later in this article.

Optimize Final Model

Now that we have a model that we can proceed with, it is time to optimize this model.

PyCaret again, of course, offers a single function for tuning the model. But before tuning, we have to call the create_model() function. This function trains and evaluates a model using cross validation. But this function is only making use of the default parameters of the selected classifier for training it.

For tuning the hyperparameters, there is a function called tune_model(). This function takes in a trained model and optimizes the hyperparameters based on a selected metric to optimize on and several other options. One option is to set the search algorithm, which can be random, grid search, Bayesian search and many more. The default is random grid search.

Let’s now train a gradient boosting classifier with 5-fold cross validation and tune the hyperparameters afterwards using F1-Score as the metric to optimize on:

final_classifier = create_model("gbc", fold=5)tuned_classifier_F1 = tune_model(final_classifier, optimize="F1")

The not tuned model has an AUC score of 77.4% and an F1-Score of 36.2%, while the tuned model has an AUC score of 77.3% and an F1-Score of 47%.

We can now also get all the hyperparameters that led to the best model results by running the following:

plot_model(tuned_classifier_F1, plot="parameter")

PyCaret also offers several other plots out of the box.

One example plot is the ROC curve, that can be plotted by running the following function call:

plot_model(tuned_classifier_F1, plot="auc")

Figure 2 shows the resulting ROC curves.

Figure 2: ROC curves returned by PyCaret function call (Image by author).

You can also get back the importance of your features on predicting the target variable (Figure 3):

plot_model(tuned_classifier_F1, plot="feature")

Figure 3: Feature importance plot returned by PyCaret function call (Image by author).

Or we can also take a look into the confusion matrix, as the dataset is imbalanced (Figure 4):

plot_model(tuned_classifier_F1, plot="confusion_matrix")

Figure 4: Confusion matrix returned by PyCaret function call (Image by author).

There are also plenty of other options available for plotting, that can be found on the official user documentation.

As you can see in the confusion matrix, the classifier performs well on the majority class (class 0) and not really on the minority class (class 1). This would be a huge issue in the real world, as it is probably way more important to get class 1 correct than class 0, as predicting that a customer will pay its default but in reality he is not paying it would lead to losses.

We can therefore try to fix the imbalance in the dataset for retrieving a better model. For this, we only have to slightly adapt the setup() function, by setting the fix_imbalance input parameter to true:

# call the pycaret setup function to setup the transformation pipeline
setup = setup(data=data_train, 
target="default",
normalize=True,
transformation=True, # make data more gaussian like
ignore_low_variance=True, # ignore features with low variance
remove_multicollinearity=True, # remove features with inter-correlations higher than 0.9 -> feature that is less correlated with target is removed
group_features = [['BILL_AMT1', 'BILL_AMT2','BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6'], # features with related characteristics
['PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']],
fix_imbalance=True, # fix the imbalance in the dataset
use_gpu=True # use GPU when available, otherwise CPU                                               
)

We can then again tune the model on optimizing the F1-Score:

# train baseline model
final_classifier = create_model("gbc", fold=5)# optimize hyperparameters by optimizing F1-Score
tuned_classifier_F1 = tune_model(final_classifier, optimize="F1")
# plot new confusion matrix
plot_model(tuned_classifier_F1, plot="confusion_matrix")

Figure 5: Confusion matrix of imbalanced optimized model returned by PyCaret function call (Image by author).

The new model now has an AUC score of 77% and an F1-Score of 53.4%. The F1-Score improved from 47% to 53.4%!

The confusion matrix (Figure 5) now also shows that the performance on class 1 has improved! Thats great. But this comes at the cost of a worse performance on class 0, which typically is not as critical as wrongly predicting class 1.

This also gets clear when comparing the recall scores of both models. The old model has a recall of 37%, while the new model has a recall of almost 60%. This means that the new model finds 60% of all not paided defaults, while the old one only finds 37%.

Test Final Model on Hold-Out Test Set

The last step is to test the final model on the hold-out test set to see how it performs on real world and unseen data.

For this, we can now make use of the hold-out test set created in the beginning of this article:

predict_model(tuned_classifier_F1, data=data_test)

This function will take the tuned gradient boosting classifier and make predictions on the hold-out test set. It then returns the metrics of the predictions.

The AUC score is at 77% and the F1-Score is at 54.3%. This shows us that the model is not overfitted on our training and validation data and is performing similar on real world data.

In this article, the PyCaret Python library was introduced and applied on a real world dataset for solving a classification problem. It clearly shows how easily an ML problem can be solved by using this library, and I only scratched the surface. There are plenty of other options available that you can explore and evaluate.

This library is therefore a perfect tool for citizen data scientists for solving ML problems without much knowledge in statistics and other relevant fields. But it can also boost the efficiency and speed of coming to a solution for a data scientist.

I did also not dive deeper into the dataset itself, as the intention of this article was to show the power of this library. Of course, the final estimator could still be optimized and PyCaret also offers a lot more options that could be further investigated.

But that would go beyond the scope of this article.