What? When? How?: ExtraTrees Classifier | by Karun Thankachan | Aug, 2022

By Jessie Hobb On Aug 9, 2022

What is ExtraTrees Classifier? When to use it? How to implement it?

Tree based models have increased in popularity over the last decade, primarily due to their robust nature. Tree-based models can be used on any type of data (categorical/continuous), can be used on data that is not normally distributed, and require little if any data transformations (can handle missing value/scale issues etc.)

While Decision Trees and Random Forest are often the go to tree-based models, a lesser known one is ExtraTrees. (If in case you are new to tree-based models, do check out the following post).

Similar to Random Forests, ExtraTrees is an ensemble ML approach that trains numerous decision trees and aggregates the results from the group of decision trees to output a prediction. However, there are few differences between Extra Trees and Random Forest.

Random Forest uses bagging to select different variations of the training data to ensure decision trees are sufficiently different. However, Extra Trees uses the entire dataset to train decision trees. As such, to ensure sufficient differences between individual decision trees, it RANDOMLY SELECTS the values at which to split a feature and create child nodes. In contrast, in a Random Forest, we use an algorithm to greedy search and select the value at which to split a feature. Apart from these two differences, Random Forest and Extra Trees are largely the same. So what effect do these changes have?

Using the entire dataset (which is the default setting and can be changed) allows ExtraTrees to reduce the bias of the model. However, the randomization of the feature value at which to split, increases the bias and variance. The paper that introduced the Extra Trees model conducts a bias-variance analysis of different tree based models. From the paper we see on most classification and regression tasks (six were analyzed) ExtraTrees have higher bias and lower variance than Random Forest. However, the paper goes on to say this is because the randomization in extra trees works to include irrelevant features into the model. As such, when irrelevant feature were excluded, say via a feature selection pre-modelling step, Extra Trees get a bias score similar to that of Random Forest.
In terms of computational cost, Extra Trees is much faster than Random Forest. This is because Extra Trees randomly selects the value at which to split features, instead of the greedy algorithm used in Random Forest.

Random Forest remains the go-to ensemble tree based model (with recent competition from XGBoost Models). However, from our prior discussion on the differences between Random Forest and Extra Trees, we see that ExtraTrees have value, especially when computational cost is a concern. Specifically, when building models that have substantial feature engineering/feature selection pre-modelling steps, and computational cost is an issue ExtraTrees would be a good choice over other ensemble tree-based models.

ExtraTrees can be used to build classification model or regression models and is available via Scikit-learn. For this tutorial, we will cover the classification model, but the code can be used for regression with minor tweaks (i.e., switching from ExtraTreesClassifier to ExtraTreesRegressor)

Building a model

We will use make_classification from Scikit-learn to create dummy classification dataset. To evaluate the model, we will use 10-fold cross validation with accuracy as the evaluation metric.

What is ExtraTrees Classifier? When to use it? How to implement it?

While Decision Trees and Random Forest are often the go to tree-based models, a lesser known one is ExtraTrees. (If in case you are new to tree-based models, do check out the following post).

Using the entire dataset (which is the default setting and can be changed) allows ExtraTrees to reduce the bias of the model. However, the randomization of the feature value at which to split, increases the bias and variance. The paper that introduced the Extra Trees model conducts a bias-variance analysis of different tree based models. From the paper we see on most classification and regression tasks (six were analyzed) ExtraTrees have higher bias and lower variance than Random Forest. However, the paper goes on to say this is because the randomization in extra trees works to include irrelevant features into the model. As such, when irrelevant feature were excluded, say via a feature selection pre-modelling step, Extra Trees get a bias score similar to that of Random Forest.
In terms of computational cost, Extra Trees is much faster than Random Forest. This is because Extra Trees randomly selects the value at which to split features, instead of the greedy algorithm used in Random Forest.

Building a model

We will use make_classification from Scikit-learn to create dummy classification dataset. To evaluate the model, we will use 10-fold cross validation with accuracy as the evaluation metric.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.