Ace your Machine Learning Interview — Part 6 | by Marcello Politi | Nov, 2022

By Jessie Hobb On Nov 10, 2022

Dive into Decision Trees using Python

Today I keep going with the sixth article of my “Ace your Machine Learning Interview” series by talking about Decision Trees!

In case you are interested in the previous articles in this series, I leave the links here:

Ace your Machine Learning Interview — Part 1: Dive into Linear, Lasso and Ridge Regression and their assumptions
Ace your Machine Learning Interview — Part 2: Dive into Logistic Regression for classification problems using Python
Ace your Machine Learning Interview — Part 3: Dive into Naive Bayes Classifier using Python
Ace your Machine Learning Interview — Part 4: Dive into Support Vector Machines using Python
Ace your Machine Learning Interview — Part 5: Dive into Kernel Support Vector Machines using Python

Introduction

We have now reached the time to introduce Decision Trees. Let’s start right away by saying that this algorithm is loved by those who highly value the explainability of models. That is, being able to explain why the Machine Learning model gave a certain output result. Imagine you go to the bank to apply for a mortgage. The mortgage is denied to you. And you then ask why the mortgage was denied to you. Well if the banker answers you, “mmm… I don’t know, the computer says the mortgage has been denied so I can’t do anything,” that would not be too nice. This is what would happen using a Machine Learning algorithm like neural networks. Using a Decision Tree instead, the banker would be able to give you exact reasons. “Your mortgage was denied for 3 main reasons : (1) your salary (2) your age (3) your history” for example.

I want to tell you right away that this algorithm though used in a simple way is not very ‘powerful’, it easily leads to overfitting of the data. But we will see later how to improve it in an Ensembling method.

Decision Tree

When I think of a Decision Tree, I always imagine having a dataset, with only 2 features for simplicity of visualizations, and slicing this dataset like a pie.

Only the slices you can make must be parallel to the features that are the axes.

Cuts with Decision Tree (Image By Author)

In the image above we had a dataset in 2 dimensions composed of several points. Each of these points was associated with a class: triangle, circle or cross. The decision tree did one cut at a time with the purpose of isolating the points of each particular class in a single space. This way at the moment of inference when we have to classify a new point, we only need to see in which space it will be placed and we can classify it immediately.

More in Detail

But what do those cuts mean? Let us consider the first cut on feature 1, the green cut. Suppose that feature 1 is the petal_length feature of the Iris dataset. Then the green cut that divides the space into two parts at the f1 point of the axis simply asks “is your petal length value greater or less than f1?” After that, we will ask the same kind of question but for another value on feature axis 2. Then again the same thing with feature 1 with the blue cut and so on…

Eventually, we will have classified the points correctly each in its own subspace.

And so this is where the explainability of the model comes from! When we classify a new data item all we are doing is answering a series of such questions. When the point is classified with a certain label A we will be perfectly able to say why!

For ease of visualization, these questions are represented in the form of a tree that we must traverse from top to bottom in order to classify data.

Decision Tree Questions (Image By Author)

These types of trees do not have to be binary they can also be d-ary, but often libraries implementations as in the case of sklearn use binary trees.

How to find the best cut?

Now the question that arises is “but on which feature should I make the first cut and what is the value of that feature on which I should make the cut ?”

start at the tree root and split the data on the feature that results in the largest information gain (IG)

Basically, we start from a node in our tree, at first from the root node. Each time we do a split (cut), some data from our dataset will go into the left child of that node and some into the right child. If one of these children now contains data from our dataset that all have the same label we will say that the node is pure. Otherwise, we will continue to split the remaining dataset points in the node with other splits. A pure node is called a leaf in the tree. A node though can become a leaf even if it is not pure. In this case, the label associated with this node is the value of the majority class it contains within.

So if a leaf contains 10 items of data that have class A as their label, the value of that leaf will be A. But in case a leaf contains 8 data with label A and 2 with label B, we will say that the value of that leaf is still A.

So now at inference time when we are given as input a data item whose label we don’t know, we will just go through the tree and see which leaf we end up in. We will classify the new data item accordingly.

What is purity?

We intuitively understood that we do these cuts or splits to try to increase purity at each node, that is, to have data of a single class at each node. But how is purity or rather impurity formally described? Usually, in two ways, we can use either Entropy or The Gini impurity.

At each split, we want to reduce the impurity of each node. However, we can also say that with each split we want to increase Information Gain. That is, after each split, I will have more and more information about the node and I will be more and more sure that I am classifying it correctly. So we can say that we want to create a Decision Tree with the goal of maximizing Information Gain (IG).

Information Gain in DT (Image By Author)

Overfitting

Decision Trees are not widely used because they are prone to overfitting. In fact, consider that you can classify any kind of dataset, you’ll just have to do continuous splits until you’ve created a node leaf for every single point of input data. Obviously, such DT will have very poor generalization capabilities. What can be done then? We can set regularization parameters. For example, we could say that the tree should be no deeper than 5 splits, in this way we could avoid overfitting. The most common parameters are the following.

max depth of the tree
min sample split: min number of samples a node must have before it can be split
min sample leaf: min number of samples a leaf node must have

Another widely used method is post-pruning. Once the tree is created we go and cut some leaves or subtrees, so we will have a model that will have greater generalization capabilities.

Let’s code!

As usual, we are going to use the Iris dataset. The dataset is provided by sklearn under an open license, it can be found here. The dataset is as follows.

Iris Dataset

We will only use two features of the Iris Dataset for visualization purposes. So let’s load and standardize our data. In though feature scaling is not required for DT (cool!).

Import Data

Now, with 2 simple lines of code, we are going to create and fit oud DT model. And we are going to plot the decision boundaries to see if the model worked and was able to classify our data.

A very cool feature is the one that lets you plot the entire decision tree so that the model is not a black box but you can know whats is happening behind it.

Let’s sum up the features of Decision Trees.

Advantages:

Clear Visualization: the algorithm is simple to understand, interpret and visualize. The output of a DT can be interpreted easily by humans.
Decision trees look like if-else statements, easy to understand
Both for regression and classification
Can handle both categorical and continuous variables
Automatically handle missing values
Robust to outliers
Training is fast
Feature scaling is not required.

Disadvantages:

It generally leads to overfitting. In order to fit the data (even noisy data), it keeps generating new nodes.
It is unstable. Adding new data points can lead to re-generation of the overall tree.
Non-suitable for large datasets. The tree may grow too complex and lead to overfitting.

😁

Marcello Politi

Linkedin, Twitter, CV