Machine learning on multioutput datasets: a quick guide | by Marco vd Boom

How to train and validate ML models on multioutput datasets with minimal coding effort

Introduction

The standard machine learning tasks everyone is familiar with are classification (binary and multiclass) and regression. In these cases, there is one target column that we are trying to predict. In the multioutput case, there is more than one target column, and we want to train a model capable of predicting every one of them at the same time. We recognize three types of multioutput tasks:

Multilabel: Multilabel is a classification task, labeling each sample with m labels from n_classes possible classes, where m can be 0 to n_classes inclusive. This can be thought of as predicting properties of a sample that are not mutually exclusive. For example, prediction of the topics relevant to a text document. The document may be about one of religion, politics, finance or education, several of the topic classes or all of the topic classes.
Multiclass-multioutput: Multiclass-multioutput (also known as multitask classification) is a classification task which labels each sample with a set of non-binary properties. Both the number of properties and the number of classes per property is greater than 2. This is both a generalization of the multilabel classification task, which only considers binary attributes, as well as a generalization of the multiclass classification task, where only one property is considered. For example, classification of the properties “type of fruit” and “colour” for a set of images of fruit. The property “type of fruit” has the possible classes: apple, pear and orange. The property “colour” has the possible classes: green, red, yellow and orange. Each sample is an image of a fruit, a label is output for both properties and each label is one of the possible classes of the corresponding property.
Multioutput regression: Multioutput regression predicts multiple numerical properties for each sample. Each property is a numerical variable and the number of properties to be predicted for each sample is >= 2. For example, prediction of both wind speed and wind direction, in degrees, using data obtained at a certain location. Each sample would be data obtained at one location and both wind speed and direction would be output for each sample.

In this story, we’ll explain how the ATOM library can help you fasten your pipelines on multioutput datasets. From data preprocessing, to model training, validation and results analysis. ATOM is an open-source Python package designed to help data scientists with the exploration of machine learning pipelines.

Note: This story focuses on using ATOM for multioutput datasets. Teaching the basics of the library lies outside the scope of this story. Read this other story if you want a gentle introduction to the library.

Data preprocessing

Initializing a multioutput dataset in atom works much the same as every other task, with one remark: you must specify the target columns using the keyword argument y .

atom = ATOMClassifier(X, y=y, verbose=2, random_state=1)

Not providing y= makes atom think the second argument is the test set, as if you were initializing with arguments atom = ATOMClassifier(train, test) , and will result in a column mismatch exception.

You can also provide a sequence of column names or positions to specify the target columns in X. For example, to specify the last 3 columns as the target, use:

atom = ATOMClassifier(X, y=(-3, -2, -1), verbose=2, random_state=1)

In all cases, printing self.y now returns the target of type DataFrame, instead of type Series.

For multilabel tasks, the target column could look like this.

0                        [politics]
1               [religion, finance]
2    [politics, finance, education]
3                                []
4                         [finance]
5               [finance, religion]
6                         [finance]
7               [religion, finance]
8                       [education]
9     [finance, religion, politics]Name: target, dtype: object

A model can not directly ingest a variable amount of target classes. Use
the clean method to assign a binary output to each class, for every sample. Positive classes are indicated with 1 and negative classes with 0. It is thus comparable to running n_classes binary classification tasks.

atom.clean()

In our example, the target (atom.y ) is converted to:

   education  finance  politics  religion
0          0        0         1         0
1          0        1         0         1
2          1        1         1         0
3          0        0         0         0
4          0        1         0         0
5          0        1         0         1
6          0        1         0         0
7          0        1         0         1
8          1        0         0         0
9          0        1         1         1

Model training and validation

Some models have native support for multioutput tasks. This means that
the original estimator is used to make predictions directly on all the
target columns.

The majority of the models, however, don’t have integrated support for multioutput tasks. ATOM makes it still possible to use them, wrapping the estimators in a meta-estimator capable of handling multiple target columns. This is done automatically, without any additional code nor prior knowledge from the user.

For multilabel tasks, the default meta-estimator used is:

For multiclass-multioutput and multioutput regression tasks, the
default meta-estimators are respectively:

The multioutput attribute contains the meta-estimator object. Change the
attribute’s value to use a custom object. Both classes or instances where the
underlying estimator is the first parameter are accepted. For example, to change the meta-estimator for regression models use:

from sklearn.multioutput import RegressorChainatom.multioutput = RegressorChain

To check which models have native support for multioutput datasets and which don’t, use:

atom.available_model()[["acronym", "model", "native_multioutput"]]

Now, you can train the models normally.

atom.run(models=["LDA", "RF"], metric="f1")

And inspect the estimators.

Some models, such as the MultiLayer Perceptron, have native support for multilabel, but not for multiclass-multioutput tasks. For that reason, their native_multioutput tag is False, but those models don’t necessarily need a multioutput meta-estimator if you have a multilabel task. In such cases, use atom’s multioutput attribute to tell atom not to use any multioutput wrapper.

atom.multioutput = None# MLP won't use a meta-estimator wrapper now
atom.run(models=["MLP"])

Note: sklearn metrics do not support multiclass-multioutput classification tasks. ATOM calculates the metric for such tasks taking the mean of the score over every target column.

Results analysis

Models with multioutput estimators can be called normally for analysis methods and plots. Use the target parameter in plots to specify which target column to use.

atom.plot_roc(target=2)

When the target parameter also specifies the class, use format (column, class).

atom.plot_probabilities(models="MLP", target=(2, 1))

with atom.canvas(figsize=(900, 600)):
atom.plot_calibration(target=0)
atom.plot_calibration(target=1)

Conclusion

We have shown how easy it is to use the ATOM package in order to quickly explore machine learning pipelines on multioutput datasets. Click here to see a full example for a multioutput regression task, and here for a multilabel classification example.

For further information about ATOM, have a look at the package’s documentation. For bugs or feature requests, don’t hesitate to open an issue on GitHub or send me an email.

How to train and validate ML models on multioutput datasets with minimal coding effort

Photo by Victor Barrios on Unsplash

Introduction

Multilabel: Multilabel is a classification task, labeling each sample with m labels from n_classes possible classes, where m can be 0 to n_classes inclusive. This can be thought of as predicting properties of a sample that are not mutually exclusive. For example, prediction of the topics relevant to a text document. The document may be about one of religion, politics, finance or education, several of the topic classes or all of the topic classes.
Multiclass-multioutput: Multiclass-multioutput (also known as multitask classification) is a classification task which labels each sample with a set of non-binary properties. Both the number of properties and the number of classes per property is greater than 2. This is both a generalization of the multilabel classification task, which only considers binary attributes, as well as a generalization of the multiclass classification task, where only one property is considered. For example, classification of the properties “type of fruit” and “colour” for a set of images of fruit. The property “type of fruit” has the possible classes: apple, pear and orange. The property “colour” has the possible classes: green, red, yellow and orange. Each sample is an image of a fruit, a label is output for both properties and each label is one of the possible classes of the corresponding property.
Multioutput regression: Multioutput regression predicts multiple numerical properties for each sample. Each property is a numerical variable and the number of properties to be predicted for each sample is >= 2. For example, prediction of both wind speed and wind direction, in degrees, using data obtained at a certain location. Each sample would be data obtained at one location and both wind speed and direction would be output for each sample.

Data preprocessing

Initializing a multioutput dataset in atom works much the same as every other task, with one remark: you must specify the target columns using the keyword argument y .

atom = ATOMClassifier(X, y=y, verbose=2, random_state=1)

You can also provide a sequence of column names or positions to specify the target columns in X. For example, to specify the last 3 columns as the target, use:

atom = ATOMClassifier(X, y=(-3, -2, -1), verbose=2, random_state=1)

In all cases, printing self.y now returns the target of type DataFrame, instead of type Series.

For multilabel tasks, the target column could look like this.

0                        [politics]
1               [religion, finance]
2    [politics, finance, education]
3                                []
4                         [finance]
5               [finance, religion]
6                         [finance]
7               [religion, finance]
8                       [education]
9     [finance, religion, politics]Name: target, dtype: object

atom.clean()

In our example, the target (atom.y ) is converted to:

   education  finance  politics  religion
0          0        0         1         0
1          0        1         0         1
2          1        1         1         0
3          0        0         0         0
4          0        1         0         0
5          0        1         0         1
6          0        1         0         0
7          0        1         0         1
8          1        0         0         0
9          0        1         1         1

Model training and validation

Some models have native support for multioutput tasks. This means that
the original estimator is used to make predictions directly on all the
target columns.

For multilabel tasks, the default meta-estimator used is:

For multiclass-multioutput and multioutput regression tasks, the
default meta-estimators are respectively:

from sklearn.multioutput import RegressorChainatom.multioutput = RegressorChain

To check which models have native support for multioutput datasets and which don’t, use:

atom.available_model()[["acronym", "model", "native_multioutput"]]

Now, you can train the models normally.

atom.run(models=["LDA", "RF"], metric="f1")

And inspect the estimators.

atom.multioutput = None# MLP won't use a meta-estimator wrapper now
atom.run(models=["MLP"])

Note: sklearn metrics do not support multiclass-multioutput classification tasks. ATOM calculates the metric for such tasks taking the mean of the score over every target column.

Results analysis

Models with multioutput estimators can be called normally for analysis methods and plots. Use the target parameter in plots to specify which target column to use.

atom.plot_roc(target=2)

When the target parameter also specifies the class, use format (column, class).

atom.plot_probabilities(models="MLP", target=(2, 1))

with atom.canvas(figsize=(900, 600)):
atom.plot_calibration(target=0)
atom.plot_calibration(target=1)

Conclusion

For further information about ATOM, have a look at the package’s documentation. For bugs or feature requests, don’t hesitate to open an issue on GitHub or send me an email.

Machine learning on multioutput datasets: a quick guide | by Marco vd Boom | Mar, 2023

How to train and validate ML models on multioutput datasets with minimal coding effort

Introduction

Data preprocessing

Model training and validation

Results analysis

Conclusion

How to train and validate ML models on multioutput datasets with minimal coding effort

Introduction

Data preprocessing

Model training and validation

Results analysis

Conclusion

Related Posts