Transformers for Tabular Data (Part 3): Piecewise Linear & Periodic Encodings | by Anton Rubert | Nov, 2022

By Jessie Hobb On Nov 22, 2022

Advanced numerical embeddings for better performance

This is the third part in my exploration of Transformers for Tabular Data.

In the Part 2 I’ve described linear numerical embeddings and how they are used in the FT-Transformer model. This post is going to explore more complex versions of the numerical embeddings, so if you haven’t read the previous part, I highly recommend starting there and returning to this post afterwards.

As a reminder, above you can see the architecture for previously explored FT-Transformer. This model first embeds both numerical and categorical features and then passes these embeddings through the Transformer layers to obtain final CLS token representation.

Embedding of numerical features is a relatively new research topic, and this post is going to deep-dive into two newly proposed numerical embedding methodologies — Piecewise Linear Encoding and Periodic Encoding. Both of them were described in the paper by Gorishniy et al. (2022) called On Embeddings for Numerical Features in Tabular Deep Learning. Make sure to check it out after going through this post!

If you’re interested in simply applying these methods, then head over to the practical notebook where I show how to use them with tabtransformertf package. If you’re interested in how these methods actually work, then keep on reading!

Numerical embedding layers transform a single float into a dense numerical representation (embedding). This transformation is useful because these embeddings can be passed through Transformer blocks together with the categorical ones which adds more context to learn from.

Linear Embeddings

As a quick recap, Linear embedding layers are simple fully connected layers (optionally with ReLU activation). It’s important that these layers don’t share weights between each others, so there’s one embedding layer per numerical feature. For more information, read the previous post about the FT-Transformer.

Periodic Embeddings

The idea of periodic activations is quite prevalent in ML right now. For example, periodic encodings in the Transformer architecture allow the model to represent position of words in a sentence (you can read more about it e.g. here). But how exactly can it be applied to the tabular data? Gorishniy et al. (2022) propose the following equation to encode a feature x:

Periodic encoding equation. Source: Gorishniy et al. (2022)

Let’s try to unpack this approach. There are three main steps in the encoding process:

Transformation into pre-activation values (v) using a learned vector (c)
Activation of values (v) using Sine and Cosine
Concatenation of Sine and Cosine values

The first step is where the learning happens. The raw values of a feature get multiplied by learned parameters c_i where i is a dimensionality of embeddings. So, if we choose embedding dimensionality to be 3, there will be 3 parameters to learn per feature.

For an illustrative example, consider a randomly generated feature below.

Random feature distribution. Plot by author.

Using three different c parameters, we can transform it into three pre-activation values (i.e. embeddings with dimensionality of 3).

Periodic pre-activation embedding of the random feature. Plot by author.

Then, these pre-activation values get transformed into post-activation values using Sine and Cosine operations.

Periodic post-activation embeddings of the random feature. Plot by author.

As you can see, the slope affects periodic activations frequency. Pre-activations with larger slope (blue line) have post-activation values that have higher frequency. On the other hand, pre-activation values with small slope (green and orange lines) result in low-frequency activations. Judging from the diagram above, a feature value of 1 would be encoded approximately as [-0.98, -0.90, 0.85] and -1 would be encoded as [0.97, 0.8, -0.9] .

The authors also suggest adding an additional linear layer on top of the periodic encoding, so the final embedding diagram looks as displayed above.

Piecewise Linear Encoding (Quantile Binning)

This embedding method takes inspiration from one-hot-encoding, a popular categorical encoding methodology, and adapts it to the numerical features. The first step in this process is to split a feature into t bins. The authors suggest two splitting methods — quantile and target binning. This section describes the first method and the second method will be covered later.

Quantile binning is relatively straight forward —we split our feature into t equal width bins. For example, if we want to end up with 3 bins (i.e. t = 3), our quantiles to calculate are — 0, 0.33, 0.66, 1.0.

Quantile binning of the random feature. Plot by author.

Each quantile (Bt) gets represented as a tuple — [bin_start, bin_end) , so in this case we end up with 3 bins — [-3.85, -0.41), [-0.41, 0.44), [0.44. 3.26) . Formula notation for this representation is as follows:

Bins formula notation. Source: Gorishniy et al. (2022)

Once we have obtained these bins, we can start encoding the feature. Formula for encoding is presented below.

PLE formula. Source: Gorishniy et al. (2022)

As you can see, for each value we’re going to end up with a t dimensional embedding. There are 3 overall options — 0, 1, or something in-between. After applying this formula for each bin and for each value, our embeddings end up looking like this:

PLE embeddings of random the feature. Plot by author.

As you can see, smaller values have only one “active embedding” (PLE 1), in the middle we get PLE 2 active as well. Finally, in the last bin, all three embeddings get activated. This way, value of -1 turns approximately into [0.8, 0.0, 0.0] and 1 transforms into [1.0, 1.0, 0.2] .

PLE embeddings with quantile binning. Image by author.

Target Binning Approach

Target binning involves using the decision tree algorithm to assist in construction of the bins. As we saw above, the quantile approach splits our feature into the bins of equal width but it might be suboptimal in certain cases. A decision tree would be able to find the most meaningful splits with regards to the target. For example, if there was more target variance towards the larger values of the feature, majority of the bins could move to the right.

PLE embeddings with target binning. Image by author.

Extract from results reported by Gorishniy et al. (2022)

The paper did an extensive comparative study of all the proposed embedding methods combined with MLP, ResNet, and Transformer architectures. In this table L stands for Linear, Q stands for Quantile, T stands for target, LR stands for Linear with ReLu, and P stands for periodic.

As can be seen from the table above, there’s no single winner across the dataset (No Free Lunch Theorem in action), hence the embedding type might be treated as yet another hyperparameter to tune. Nevertheless, most of the times we see a significant improvement in performance when we compare Periodic and PLE encodings with simple linear embeddings.

Let’s see if we can re-create the results from this paper on a popular toy dataset — California Housing. You can see the full working notebook here, whereas below I’ll cover the main parts necessary for modelling. Like in the previous posts, I’ll be using my tabtrasnformertf package (please give it a star ⭐️ if you like it) which you can easily install (or update) using the command pip install -U tabtransformertf .

Data Download and Pre-processing

We can download the data using sklearn repository of toy datasets. The pre-processing procedure is quite simple — doing train/val/test split, scaling the data and transforming it into the TF Dataset.

Validation loss history. Generated by author.

Advanced numerical embeddings for better performance

This is the third part in my exploration of Transformers for Tabular Data.

Linear Embeddings

Periodic Embeddings

Let’s try to unpack this approach. There are three main steps in the encoding process:

Transformation into pre-activation values (v) using a learned vector (c)
Activation of values (v) using Sine and Cosine
Concatenation of Sine and Cosine values

For an illustrative example, consider a randomly generated feature below.

Using three different c parameters, we can transform it into three pre-activation values (i.e. embeddings with dimensionality of 3).

Then, these pre-activation values get transformed into post-activation values using Sine and Cosine operations.

The authors also suggest adding an additional linear layer on top of the periodic encoding, so the final embedding diagram looks as displayed above.

Piecewise Linear Encoding (Quantile Binning)

Once we have obtained these bins, we can start encoding the feature. Formula for encoding is presented below.

Target Binning Approach

Data Download and Pre-processing

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Transformers for Tabular Data (Part 3): Piecewise Linear & Periodic Encodings | by Anton Rubert | Nov, 2022

Advanced numerical embeddings for better performance

Linear Embeddings

Periodic Embeddings

Piecewise Linear Encoding (Quantile Binning)

Target Binning Approach

Data Download and Pre-processing

Periodic Embeddings

PLE with Quantile Binning Embeddings

PLE with Target Binning Embeddings

Advanced numerical embeddings for better performance

Linear Embeddings

Periodic Embeddings

Piecewise Linear Encoding (Quantile Binning)

Target Binning Approach

Data Download and Pre-processing

Periodic Embeddings

PLE with Quantile Binning Embeddings

PLE with Target Binning Embeddings