How Random Forests & Decision Trees Decide: Simply Explained With An Example In Python | by Serafeim Loukas | Jun, 2022


What is a decision tree? What is a random forest? How do they decide to split the data? How do they predict class labels in classification tasks? Everything explained in 10 minutes.

A visual example of a random forest consisting of 3 decision trees (Source).

Hi folks! I hope you are doing well. I have intented to write this post for a long time, so here it is! Let’s get started!

  • Random forests is an ensemble learning method for classification, regression and other tasks [1]. It is a very well-known machine learning (ML) algorithm that is widely used!
  • Each random forest consists of multiple decision trees (which makes sense; a forest contains many trees!).
  • In classification tasks, the output of the random forest is the class of the samples. E.g., you could use animal images as inputs (in a vectorized form) and predict the animal species (dog or cat).
  • In classification tasks, we need to have a labeled dataset. This refers to knowing the labels/classes of each sample (cat/dog in our previous example).

Before we can understand how Random Forests (RF) work and decide, we need to understand first what are the Decision Trees and how they work.

Example 1: The ideal case

Let’s assume that we have a labeled dataset with 10 samples in total. 5 of these samples belong to the dog class (blue) and the remaining 5 to the cat class (red).

Here is some Python code to create the dataset and plot it:

X , y = make_classification(n_samples=10, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, random_state=0)c = ["blue" if i==0 else "red" for i in y]plt.figure(dpi=200)
plt.scatter(X[:,0], X[:,1], c=c)
plt.xlabel("x")
plt.ylabel("y")
plt.axvline(x=0.9, label="split at x=0.9", c = "k", linestyle="--")
plt.legend()
plt.show()
Fig 1: Example of a dataset. Figure made in python by the author.

What the Decision Trees do is simple: they find ways to split the data in a way such as that separate as much as possible the samples of the classes (increasing the class separability).

In the above example, the perfect split would be a split at x=0.9 as this would lead to 5 red points being at the left side and the 5 blue at the right side (perfect class separability). Each time we split the space/data like that, we actually build a decision tree with a specific rule.

The decision tree would look like the following:

Fig 2: The decision tree. Split at x=0.9. Figure made by the author in smartdraw.

Here we initially have the root node containing all the data and then, we split the data at x=0.9 leading to two branches leading to two leaf nodes. Since we cannot split the data more (we cannot add new decision nodes since the data are perfectly split), the decision tree construction ends here.

No need to split the data in the second feature dimension i.e. y axis.

Making a new prediciton: If we would pass a new sample (e.g., from the test set) that had a feature/variable x=5, that would end up in the left leaf node and would classified as classred”.

Everytime we split at a specific x or y coordinate, we create a decision node!

Example 2: A real case using the Gini Impurity

Usually, the data cannot be separated so easily and it takes a lot of effort/iterations (this is done during model training/fitting) to find the optimal splits.

But the question is:

How can we find these splits using mathematical formulations?

Let’s look at another example now with 3 classes, each having 5 samples. Here is some Python code to create the dataset and plot it:

X,y = make_classification(n_samples=15, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, random_state=1,n_classes=3, n_clusters_per_class=1)c = ["blue" if i==0 else "red" if i==1 else "green" for i in y]plt.figure(dpi=200)
plt.scatter(X[:,0], X[:,1], c=c)
plt.xlabel("x")
plt.ylabel("y")
plt.axvline(x=0.9, label="split at x=0.9", c = "k", linestyle="--")
plt.axhline(y=0.4, xmin=0, xmax=0.59, label="split at y=0.4", c = "pink", linestyle="--")
plt.legend()
plt.show()
Fig 3: Example of a dataset. Figure made in python by the author.

Let’s use the same inital rule as before first (we split at x=0.9).

Fig 4: The decision tree. Split at x=0.9. Figure made by the author in smartdraw.

We start again with a root node containing all the data and then, we split the data at x=0.9 leading to two branches leading to two leaf nodes. The right leaf contains all the blue samples and the left the ref and green samples.

Here, we cannot stop and we need to move forward one more step. We will now split the data in the second feature dimension i.e. y axis to try to reparate the green dots from the reds.

The decison tree becomes (for a y=0.4 split):

Fig 5: The decision tree. Second Split at y=0.4. Figure made by the author in smartdraw.

Making a new prediciton: If we would pass a new sample (e.g., from the test set) that had a features/variables x=5 and y=-2, that would end up in the left leaf node and would classified as classgreen”.

Reminder: Everytime we split at a specific x or y coordinate, we create a decision node!

The principle idea behind Gini Impurity is simple:

  • Assume that you randomly select a sample in our dataset.
  • Next, classify it randmoly according to the class distribution in the dataset.

Gini Impurity tells us what is the probability that we classify the datapoint wrongly [3].

In the above example, we found the perfect split levels via visual inspection. However, in real world problems, this is not possible because the number of samples and features are large. To find the optimal split levels (in each feature dimension), decision trees choose the split that maximizes the Gini Gain [3].

The best possible Gini Impurity score is 0 and it corresponds to the optimal split at the current level/state of the model.

The mathematical formualtion is as follows.

For C classes in our data, p(i) being the probability of choosing a data sample that belongs to class i, the Gini Impurity (G) is given by:

Formula made in markdown by the author.

Now, let’s go back to our first (most simple) example (Example 1), define a new bad split and estimate the Gini Impurity and gain step by step!

First, let’s estimate p(i) for each class (without a split yet):

  • Blue class: we have 5 out of 10 sample being blue so the p(blue) = 5/10=0.5 probability to pick a blue sample by chance.
  • Red class: we have 5 out of 10 sample being red so the p(red) = 5/10=0.5 probability to pick a red sample by chance.

Now, let’s estimate the initial G (without any split yet):

  • G0= 0.5*(1–0.5) + 0.5*(1–0.5) = 0.5

(50% chance to wrongly classify it since we have 4 possible events and only 2 are correct i.e. choose blue, classify blue, choose blue, classify red, choose red, classify blue, choose red, classify red).

Fig 6: Example of a dataset. Figure made in python by the author.

Now, let’s estimate the the Gini Impurity for the x=0.2 “non-perfect” split.

  • G_left_space = (3/3)*(1–3/3) (for blue) + 0 (no red) = 0 (good value since we only have blue dots and none red i.e. 0 impurity).
  • G_right_space = (2/7)*(1–2/7) (for blue) +(5/7)*(1–5/7) (for red) = 0.4081 (here the impurity is not 0 so this means that the split contains samples from different classes).

The final Gini Gain is computed by subtracting the weighted impurities (G_left & G_right) from the original impurity G0 (no split):

  • Gini_gain = G0 — (3/10)*0 (left) — (7/10)*0.4081 = 0.5— (3/10)*0 (left) — (7/10)*0.4081 = 0.214

Reminder: The higher the Gini Gain the better the split.

To conclude, the decision tree algorithm, tries a lot of different splits, it estimates the Gini Gain of each split and splits acocrding to the maximum Gini Gain. So, we end up optimal splits w.r.t. the Gini Gain.

Task: Try to estimate now the Gini gain for the perfect split as shown in Fig 1.

Tip: You will easily find G0 = 0.5 and G_left & G_right = 0, thus a Gini Gain of 0.5-(5/10)*0-(5/10)*0 = 0.5!

Now that you understand that a desicion tree is and how it finds the best splits (based on the Gini Gain), it is time to introduce the Random Forests.

Random forests are nothing more than an ensemble of decision trees [1]. One important thing to notice here is that random forest usually use a subsample of the available variables / features instead of all of them, for each unique decision tree in the forest.

E.g. see input argument “max_features” in https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html model.

Finally, random forests usually decide by taking the majority voting or the average predicted class of the individual decison trees that constitute the forest.

Decision trees and random forests are powerful machine learning models that can be used for regression and classification. In the case of classification problems, the best splits are done based on the Gini Gain score as explained above.

In the near future, I will write a python-only article showing how to fit a model using sklearn and real data.

Make sure to subscribe to my mailing list in order to not miss it out!

That’s all folks! Hope you liked this article. Any questions? Post them as a comment and I will reply as soon as possible.

If you liked and found this article useful, follow me and make sure you subscribe to my mailing list and become a member:


What is a decision tree? What is a random forest? How do they decide to split the data? How do they predict class labels in classification tasks? Everything explained in 10 minutes.

A visual example of a random forest consisting of 3 decision trees (Source).

Hi folks! I hope you are doing well. I have intented to write this post for a long time, so here it is! Let’s get started!

  • Random forests is an ensemble learning method for classification, regression and other tasks [1]. It is a very well-known machine learning (ML) algorithm that is widely used!
  • Each random forest consists of multiple decision trees (which makes sense; a forest contains many trees!).
  • In classification tasks, the output of the random forest is the class of the samples. E.g., you could use animal images as inputs (in a vectorized form) and predict the animal species (dog or cat).
  • In classification tasks, we need to have a labeled dataset. This refers to knowing the labels/classes of each sample (cat/dog in our previous example).

Before we can understand how Random Forests (RF) work and decide, we need to understand first what are the Decision Trees and how they work.

Example 1: The ideal case

Let’s assume that we have a labeled dataset with 10 samples in total. 5 of these samples belong to the dog class (blue) and the remaining 5 to the cat class (red).

Here is some Python code to create the dataset and plot it:

X , y = make_classification(n_samples=10, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, random_state=0)c = ["blue" if i==0 else "red" for i in y]plt.figure(dpi=200)
plt.scatter(X[:,0], X[:,1], c=c)
plt.xlabel("x")
plt.ylabel("y")
plt.axvline(x=0.9, label="split at x=0.9", c = "k", linestyle="--")
plt.legend()
plt.show()
Fig 1: Example of a dataset. Figure made in python by the author.

What the Decision Trees do is simple: they find ways to split the data in a way such as that separate as much as possible the samples of the classes (increasing the class separability).

In the above example, the perfect split would be a split at x=0.9 as this would lead to 5 red points being at the left side and the 5 blue at the right side (perfect class separability). Each time we split the space/data like that, we actually build a decision tree with a specific rule.

The decision tree would look like the following:

Fig 2: The decision tree. Split at x=0.9. Figure made by the author in smartdraw.

Here we initially have the root node containing all the data and then, we split the data at x=0.9 leading to two branches leading to two leaf nodes. Since we cannot split the data more (we cannot add new decision nodes since the data are perfectly split), the decision tree construction ends here.

No need to split the data in the second feature dimension i.e. y axis.

Making a new prediciton: If we would pass a new sample (e.g., from the test set) that had a feature/variable x=5, that would end up in the left leaf node and would classified as classred”.

Everytime we split at a specific x or y coordinate, we create a decision node!

Example 2: A real case using the Gini Impurity

Usually, the data cannot be separated so easily and it takes a lot of effort/iterations (this is done during model training/fitting) to find the optimal splits.

But the question is:

How can we find these splits using mathematical formulations?

Let’s look at another example now with 3 classes, each having 5 samples. Here is some Python code to create the dataset and plot it:

X,y = make_classification(n_samples=15, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, random_state=1,n_classes=3, n_clusters_per_class=1)c = ["blue" if i==0 else "red" if i==1 else "green" for i in y]plt.figure(dpi=200)
plt.scatter(X[:,0], X[:,1], c=c)
plt.xlabel("x")
plt.ylabel("y")
plt.axvline(x=0.9, label="split at x=0.9", c = "k", linestyle="--")
plt.axhline(y=0.4, xmin=0, xmax=0.59, label="split at y=0.4", c = "pink", linestyle="--")
plt.legend()
plt.show()
Fig 3: Example of a dataset. Figure made in python by the author.

Let’s use the same inital rule as before first (we split at x=0.9).

Fig 4: The decision tree. Split at x=0.9. Figure made by the author in smartdraw.

We start again with a root node containing all the data and then, we split the data at x=0.9 leading to two branches leading to two leaf nodes. The right leaf contains all the blue samples and the left the ref and green samples.

Here, we cannot stop and we need to move forward one more step. We will now split the data in the second feature dimension i.e. y axis to try to reparate the green dots from the reds.

The decison tree becomes (for a y=0.4 split):

Fig 5: The decision tree. Second Split at y=0.4. Figure made by the author in smartdraw.

Making a new prediciton: If we would pass a new sample (e.g., from the test set) that had a features/variables x=5 and y=-2, that would end up in the left leaf node and would classified as classgreen”.

Reminder: Everytime we split at a specific x or y coordinate, we create a decision node!

The principle idea behind Gini Impurity is simple:

  • Assume that you randomly select a sample in our dataset.
  • Next, classify it randmoly according to the class distribution in the dataset.

Gini Impurity tells us what is the probability that we classify the datapoint wrongly [3].

In the above example, we found the perfect split levels via visual inspection. However, in real world problems, this is not possible because the number of samples and features are large. To find the optimal split levels (in each feature dimension), decision trees choose the split that maximizes the Gini Gain [3].

The best possible Gini Impurity score is 0 and it corresponds to the optimal split at the current level/state of the model.

The mathematical formualtion is as follows.

For C classes in our data, p(i) being the probability of choosing a data sample that belongs to class i, the Gini Impurity (G) is given by:

Formula made in markdown by the author.

Now, let’s go back to our first (most simple) example (Example 1), define a new bad split and estimate the Gini Impurity and gain step by step!

First, let’s estimate p(i) for each class (without a split yet):

  • Blue class: we have 5 out of 10 sample being blue so the p(blue) = 5/10=0.5 probability to pick a blue sample by chance.
  • Red class: we have 5 out of 10 sample being red so the p(red) = 5/10=0.5 probability to pick a red sample by chance.

Now, let’s estimate the initial G (without any split yet):

  • G0= 0.5*(1–0.5) + 0.5*(1–0.5) = 0.5

(50% chance to wrongly classify it since we have 4 possible events and only 2 are correct i.e. choose blue, classify blue, choose blue, classify red, choose red, classify blue, choose red, classify red).

Fig 6: Example of a dataset. Figure made in python by the author.

Now, let’s estimate the the Gini Impurity for the x=0.2 “non-perfect” split.

  • G_left_space = (3/3)*(1–3/3) (for blue) + 0 (no red) = 0 (good value since we only have blue dots and none red i.e. 0 impurity).
  • G_right_space = (2/7)*(1–2/7) (for blue) +(5/7)*(1–5/7) (for red) = 0.4081 (here the impurity is not 0 so this means that the split contains samples from different classes).

The final Gini Gain is computed by subtracting the weighted impurities (G_left & G_right) from the original impurity G0 (no split):

  • Gini_gain = G0 — (3/10)*0 (left) — (7/10)*0.4081 = 0.5— (3/10)*0 (left) — (7/10)*0.4081 = 0.214

Reminder: The higher the Gini Gain the better the split.

To conclude, the decision tree algorithm, tries a lot of different splits, it estimates the Gini Gain of each split and splits acocrding to the maximum Gini Gain. So, we end up optimal splits w.r.t. the Gini Gain.

Task: Try to estimate now the Gini gain for the perfect split as shown in Fig 1.

Tip: You will easily find G0 = 0.5 and G_left & G_right = 0, thus a Gini Gain of 0.5-(5/10)*0-(5/10)*0 = 0.5!

Now that you understand that a desicion tree is and how it finds the best splits (based on the Gini Gain), it is time to introduce the Random Forests.

Random forests are nothing more than an ensemble of decision trees [1]. One important thing to notice here is that random forest usually use a subsample of the available variables / features instead of all of them, for each unique decision tree in the forest.

E.g. see input argument “max_features” in https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html model.

Finally, random forests usually decide by taking the majority voting or the average predicted class of the individual decison trees that constitute the forest.

Decision trees and random forests are powerful machine learning models that can be used for regression and classification. In the case of classification problems, the best splits are done based on the Gini Gain score as explained above.

In the near future, I will write a python-only article showing how to fit a model using sklearn and real data.

Make sure to subscribe to my mailing list in order to not miss it out!

That’s all folks! Hope you liked this article. Any questions? Post them as a comment and I will reply as soon as possible.

If you liked and found this article useful, follow me and make sure you subscribe to my mailing list and become a member:

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsdecideDecisionexplainedForestsJunLoukaspythonrandomSerafeimSimplyTech NewsTechnoblenderTrees
Comments (0)
Add Comment