Improving TabTransformer Part 1: Linear Numerical Embeddings | by Anton Rubert | Oct, 2022

By Jessie Hobb On Oct 22, 2022

Deep learning for tabular data with FT-Transformer

In the previous post about TabTransformer I’ve described how the model works and how it can be applied to your data. This post will build on it, so if you haven’t read it yet, I highly recommend starting there and returning to this post afterwards.

TabTransformer was shown to outperform traditional multi-layer perceptrons (MLPs) and came close to the performance of Gradient Boosted Trees (GBTs) on some datasets. However, there is one noticeable drawback with the architecture — it doesn’t take numerical features into account when constructing contextual embeddings. This post deep dives into the paper by Gorishniy et al. (2021) which has addressed this issue by introducing FT-Transformer (Feature Tokenizer + Transformer).

Both models use Transformers (Vaswani et al., 2017) as their model backbone, but there are 2 main differences:

Use of numerical embeddings
Use of CLS token for output

Numerical Embeddings

Traditional TabTransformer takes categorical embeddings and passes them through the Transformer blocks to transform them into contextual ones. Then, numerical features are concatenated with these contextual embeddings and are passed through the MLP to get a prediction.

TabTransformer diagram. Image by author.

Most of the magic happens inside the Transformer blocks, so it’s a shame that numerical features are left out and are only used in the final layers of the model. Gorishniy et al. (2021) propose to address this issue by embedding numerical features as well.

The embeddings that the FT-Transformer uses are linear, meaning that each feature gets transformed into dense vector after passing through a simple fully connected layer. It should be noted that these dense layers don’t share weights, so there’s a separate embedding layer per numeric feature.

Linear Numerical Embeddings. Image by author.

You might find yourself asking — why would you do that if these features are already numeric? The main reason is that numerical embeddings can be passed through the Transformer blocks together with the categorical ones. This adds more context to learn from and hence improves the representation quality.

Transformer with Numerical Embeddings. Image by author.

Interestingly, it was demonstrated (e.g. here) that the addition of these numerical embeddings can improve the performance of various deep learning models (not only TabTransformer), so they can be applied even to simple MLPs.

MLP with Numerical Embeddings. Image by author.

CLS Token

The usage of CLS token is adapted from NLP domain but it translates quite nicely to the tabular tasks. The basic idea is that after we’ve embedded our features, we append to them another “embedding” which represents a CLS token. This way, categorical, numerical and CLS embeddings get contextualised by passing through the Transformer blocks. Afterwards, contextualised CLS token embedding serves as an input into a simple MLP classifier which produces the desired output.

FT-Transformer

By augmenting TabTransformer with numerical embeddings and CLS token, we get the final proposed architecture.

Reported results for FT-Transformer. Source: Gorishniy et al. (2021)

From the results we can see that FT-Transformer outperforms gradient boosting models on a variety of dataset. In addition, it outperforms ResNet which is a strong deep learning baseline for tabular data. Interestingly, hyperparameter tuning doesn’t change the FT-Transformer results that much which might indicate that it’s not that sensitive to the hyperparameters.

This section is going to show you how to use FT-Transformer by validating the results for Adult Income Dataset. I’m going to use a package called tabtransformertf which can be installed using pip install tabtransformertf . It allows us to use the tabular transformer models without extensive pre-processing. Below you can see the main steps and results of the analysis but make sure to look into the supplementary notebook for more details.

Data pre-processing

Data can be download from here or using a number of APIs. Data pre-processing steps are not that relevant for this post, so you can find a full working example on GitHub. FT-Transformer specific pre-processing is similar to TabTransformer since we need to create the categorical preprocessing layers and transform the data into TF Datasets.

Training/Validation loss and metric. Plots by author.

Deep learning for tabular data with FT-Transformer

Both models use Transformers (Vaswani et al., 2017) as their model backbone, but there are 2 main differences:

Use of numerical embeddings
Use of CLS token for output

Numerical Embeddings

CLS Token

FT-Transformer

By augmenting TabTransformer with numerical embeddings and CLS token, we get the final proposed architecture.

Data pre-processing

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Improving TabTransformer Part 1: Linear Numerical Embeddings | by Anton Rubert | Oct, 2022

Deep learning for tabular data with FT-Transformer

Numerical Embeddings

CLS Token

FT-Transformer

Data pre-processing

FT-Transformer Initialisation

Model Training

Evaluation

Explainability

Deep learning for tabular data with FT-Transformer

Numerical Embeddings

CLS Token

FT-Transformer

Data pre-processing

FT-Transformer Initialisation

Model Training

Evaluation

Explainability