Joining the Transformer Encoder and Decoder Plus Masking

By Jessie Hobb On Nov 1, 2022

Last Updated on October 26, 2022

We have arrived at a point where we have implemented and tested the Transformer encoder and decoder separately, and we may now join the two together into a complete model. We will also see how to create padding and look-ahead masks by which we will suppress the input values that will not be considered in the encoder or decoder computations. Our end goal remains to apply the complete model to Natural Language Processing (NLP).

In this tutorial, you will discover how to implement the complete Transformer model and create padding and look-ahead masks.

After completing this tutorial, you will know:

How to create a padding mask for the encoder and decoder
How to create a look-ahead mask for the decoder
How to join the Transformer encoder and decoder into a single model
How to print out a summary of the encoder and decoder layers

Let’s get started.

Joining the Transformer encoder and decoder and Masking
Photo by John O’Nolan, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

Recap of the Transformer Architecture
Masking
- Creating a Padding Mask
- Creating a Look-Ahead Mask
Joining the Transformer Encoder and Decoder
Creating an Instance of the Transformer Model
- Printing Out a Summary of the Encoder and Decoder Layers

Prerequisites

For this tutorial, we assume that you are already familiar with:

Recap of the Transformer Architecture

Recall having seen that the Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need“

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

You have seen how to implement the Transformer encoder and decoder separately. In this tutorial, you will join the two into a complete Transformer model and apply padding and look-ahead masking to the input values.

Let’s start first by discovering how to apply masking.

Masking

Creating a Padding Mask

You should already be familiar with the importance of masking the input values before feeding them into the encoder and decoder.

As you will see when you proceed to train the Transformer model, the input sequences fed into the encoder and decoder will first be zero-padded up to a specific sequence length. The importance of having a padding mask is to make sure that these zero values are not processed along with the actual input values by both the encoder and decoder.

Let’s create the following function to generate a padding mask for both the encoder and decoder:

from tensorflow import math, cast, float32 def padding_mask(input): # Create mask which marks the zero padding values in the input by a 1 mask = math.equal(input, 0) mask = cast(mask, float32) return mask

from tensorflow import math, cast, float32

def padding_mask(input):

# Create mask which marks the zero padding values in the input by a 1

mask = math.equal(input, 0)

mask = cast(mask, float32)

return mask

Upon receiving an input, this function will generate a tensor that marks by a value of one wherever the input contains a value of zero.

Hence, if you input the following array:

from numpy import array input = array([1, 2, 3, 4, 0, 0, 0]) print(padding_mask(input))

from numpy import array

input = array([1, 2, 3, 4, 0, 0, 0])

print(padding_mask(input))

Then the output of the padding_mask function would be the following:

tf.Tensor([0. 0. 0. 0. 1. 1. 1.], shape=(7,), dtype=float32)

tf.Tensor([0. 0. 0. 0. 1. 1. 1.], shape=(7,), dtype=float32)

Creating a Look-Ahead Mask

A look-ahead mask is required to prevent the decoder from attending to succeeding words, such that the prediction for a particular word can only depend on known outputs for the words that come before it.

For this purpose, let’s create the following function to generate a look-ahead mask for the decoder:

from tensorflow import linalg, ones def lookahead_mask(shape): # Mask out future entries by marking them with a 1.0 mask = 1 – linalg.band_part(ones((shape, shape)), -1, 0) return mask

from tensorflow import linalg, ones

def lookahead_mask(shape):

# Mask out future entries by marking them with a 1.0

mask = 1 – linalg.band_part(ones((shape, shape)), –1, 0)

return mask

You will pass to it the length of the decoder input. Let’s make this length equal to 5, as an example:

Then the output that the lookahead_mask function returns is the following:

tf.Tensor( [[0. 1. 1. 1. 1.] [0. 0. 1. 1. 1.] [0. 0. 0. 1. 1.] [0. 0. 0. 0. 1.] [0. 0. 0. 0. 0.]], shape=(5, 5), dtype=float32)

tf.Tensor(

[[0. 1. 1. 1. 1.]

[0. 0. 1. 1. 1.]

[0. 0. 0. 1. 1.]

[0. 0. 0. 0. 1.]

[0. 0. 0. 0. 0.]], shape=(5, 5), dtype=float32)

Again, the one values mask out the entries that should not be used. In this manner, the prediction of every word only depends on those that come before it.

Joining the Transformer Encoder and Decoder

Let’s start by creating the class, TransformerModel, which inherits from the Model base class in Keras:

class TransformerModel(Model): def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate, **kwargs): super(TransformerModel, self).__init__(**kwargs) # Set up the encoder self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate) # Set up the decoder self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate) # Define the final dense layer self.model_last_layer = Dense(dec_vocab_size) …

class TransformerModel(Model):

def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate, **kwargs):

super(TransformerModel, self).__init__(**kwargs)

# Set up the encoder

self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

# Set up the decoder

self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

# Define the final dense layer

self.model_last_layer = Dense(dec_vocab_size)

...

Our first step in creating the TransformerModel class is to initialize instances of the Encoder and Decoder classes implemented earlier and assign their outputs to the variables, encoder and decoder, respectively. If you saved these classes in separate Python scripts, do not forget to import them. I saved my code in the Python scripts encoder.py and decoder.py, so I need to import them accordingly.

You will also include one final dense layer that produces the final output, as in the Transformer architecture of Vaswani et al. (2017).

Next, you shall create the class method, call(), to feed the relevant inputs into the encoder and decoder.

A padding mask is first generated to mask the encoder input, as well as the encoder output, when this is fed into the second self-attention block of the decoder:

… def call(self, encoder_input, decoder_input, training): # Create padding mask to mask the encoder inputs and the encoder outputs in the decoder enc_padding_mask = self.padding_mask(encoder_input) …

...

def call(self, encoder_input, decoder_input, training):

# Create padding mask to mask the encoder inputs and the encoder outputs in the decoder

enc_padding_mask = self.padding_mask(encoder_input)

...

A padding mask and a look-ahead mask are then generated to mask the decoder input. These are combined together through an element-wise maximum operation:

… # Create and combine padding and look-ahead masks to be fed into the decoder dec_in_padding_mask = self.padding_mask(decoder_input) dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1]) dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask) …

...

# Create and combine padding and look-ahead masks to be fed into the decoder

dec_in_padding_mask = self.padding_mask(decoder_input)

dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1])

dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask)

...

Next, the relevant inputs are fed into the encoder and decoder, and the Transformer model output is generated by feeding the decoder output into one final dense layer:

… # Feed the input into the encoder encoder_output = self.encoder(encoder_input, enc_padding_mask, training) # Feed the encoder output into the decoder decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training) # Pass the decoder output through a final dense layer model_output = self.model_last_layer(decoder_output) return model_output

...

# Feed the input into the encoder

encoder_output = self.encoder(encoder_input, enc_padding_mask, training)

# Feed the encoder output into the decoder

decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training)

# Pass the decoder output through a final dense layer

model_output = self.model_last_layer(decoder_output)

return model_output

Combining all the steps gives us the following complete code listing:

from encoder import Encoder from decoder import Decoder from tensorflow import math, cast, float32, linalg, ones, maximum, newaxis from tensorflow.keras import Model from tensorflow.keras.layers import Dense class TransformerModel(Model): def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate, **kwargs): super(TransformerModel, self).__init__(**kwargs) # Set up the encoder self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate) # Set up the decoder self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate) # Define the final dense layer self.model_last_layer = Dense(dec_vocab_size) def padding_mask(self, input): # Create mask which marks the zero padding values in the input by a 1.0 mask = math.equal(input, 0) mask = cast(mask, float32) # The shape of the mask should be broadcastable to the shape # of the attention weights that it will be masking later on return mask[:, newaxis, newaxis, :] def lookahead_mask(self, shape): # Mask out future entries by marking them with a 1.0 mask = 1 – linalg.band_part(ones((shape, shape)), -1, 0) return mask def call(self, encoder_input, decoder_input, training): # Create padding mask to mask the encoder inputs and the encoder outputs in the decoder enc_padding_mask = self.padding_mask(encoder_input) # Create and combine padding and look-ahead masks to be fed into the decoder dec_in_padding_mask = self.padding_mask(decoder_input) dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1]) dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask) # Feed the input into the encoder encoder_output = self.encoder(encoder_input, enc_padding_mask, training) # Feed the encoder output into the decoder decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training) # Pass the decoder output through a final dense layer model_output = self.model_last_layer(decoder_output) return model_output

from encoder import Encoder

from decoder import Decoder

from tensorflow import math, cast, float32, linalg, ones, maximum, newaxis

from tensorflow.keras import Model

from tensorflow.keras.layers import Dense

class TransformerModel(Model):

def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate, **kwargs):

super(TransformerModel, self).__init__(**kwargs)

# Set up the encoder

self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

# Set up the decoder

self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

# Define the final dense layer

self.model_last_layer = Dense(dec_vocab_size)

def padding_mask(self, input):

# Create mask which marks the zero padding values in the input by a 1.0

mask = math.equal(input, 0)

mask = cast(mask, float32)

# The shape of the mask should be broadcastable to the shape

# of the attention weights that it will be masking later on

return mask[:, newaxis, newaxis, :]

def lookahead_mask(self, shape):

# Mask out future entries by marking them with a 1.0

mask = 1 – linalg.band_part(ones((shape, shape)), –1, 0)

return mask

def call(self, encoder_input, decoder_input, training):

# Create padding mask to mask the encoder inputs and the encoder outputs in the decoder

enc_padding_mask = self.padding_mask(encoder_input)

# Create and combine padding and look-ahead masks to be fed into the decoder

dec_in_padding_mask = self.padding_mask(decoder_input)

dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1])

dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask)

# Feed the input into the encoder

encoder_output = self.encoder(encoder_input, enc_padding_mask, training)

# Feed the encoder output into the decoder

decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training)

# Pass the decoder output through a final dense layer

model_output = self.model_last_layer(decoder_output)

return model_output

Note that you have performed a small change to the output that is returned by the padding_mask function. Its shape is made broadcastable to the shape of the attention weight tensor that it will mask when you train the Transformer model.

Creating an Instance of the Transformer Model

You will work with the parameter values specified in the paper, Attention Is All You Need, by Vaswani et al. (2017):

h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the inner fully connected layer d_model = 512 # Dimensionality of the model sub-layers’ outputs n = 6 # Number of layers in the encoder stack dropout_rate = 0.1 # Frequency of dropping the input units in the dropout layers …

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_ff = 2048 # Dimensionality of the inner fully connected layer

d_model = 512 # Dimensionality of the model sub-layers’ outputs

n = 6 # Number of layers in the encoder stack

dropout_rate = 0.1 # Frequency of dropping the input units in the dropout layers

...

As for the input-related parameters, you will work with dummy values for now until you arrive at the stage of training the complete Transformer model. At that point, you will use actual sentences:

… enc_vocab_size = 20 # Vocabulary size for the encoder dec_vocab_size = 20 # Vocabulary size for the decoder enc_seq_length = 5 # Maximum length of the input sequence dec_seq_length = 5 # Maximum length of the target sequence …

...

enc_vocab_size = 20 # Vocabulary size for the encoder

dec_vocab_size = 20 # Vocabulary size for the decoder

enc_seq_length = 5 # Maximum length of the input sequence

dec_seq_length = 5 # Maximum length of the target sequence

...

You can now create an instance of the TransformerModel class as follows:

from model import TransformerModel # Create model training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

from model import TransformerModel

# Create model

training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

The complete code listing is as follows:

enc_vocab_size = 20 # Vocabulary size for the encoder dec_vocab_size = 20 # Vocabulary size for the decoder enc_seq_length = 5 # Maximum length of the input sequence dec_seq_length = 5 # Maximum length of the target sequence h = 8 # Number of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the inner fully connected layer d_model = 512 # Dimensionality of the model sub-layers’ outputs n = 6 # Number of layers in the encoder stack dropout_rate = 0.1 # Frequency of dropping the input units in the dropout layers # Create model training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

enc_vocab_size = 20 # Vocabulary size for the encoder

dec_vocab_size = 20 # Vocabulary size for the decoder

enc_seq_length = 5 # Maximum length of the input sequence

dec_seq_length = 5 # Maximum length of the target sequence

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_ff = 2048 # Dimensionality of the inner fully connected layer

d_model = 512 # Dimensionality of the model sub-layers’ outputs

n = 6 # Number of layers in the encoder stack

dropout_rate = 0.1 # Frequency of dropping the input units in the dropout layers

# Create model

training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

Printing Out a Summary of the Encoder and Decoder Layers

You may also print out a summary of the encoder and decoder blocks of the Transformer model. The choice to print them out separately will allow you to be able to see the details of their individual sub-layers. In order to do so, add the following line of code to the __init__() method of both the EncoderLayer and DecoderLayer classes:

self.build(input_shape=[None, sequence_length, d_model])

self.build(input_shape=[None, sequence_length, d_model])

Then you need to add the following method to the EncoderLayer class:

def build_graph(self): input_layer = Input(shape=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))

def build_graph(self):

input_layer = Input(shape=(self.sequence_length, self.d_model))

return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))

And the following method to the DecoderLayer class:

def build_graph(self): input_layer = Input(shape=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.call(input_layer, input_layer, None, None, True))

def build_graph(self):

input_layer = Input(shape=(self.sequence_length, self.d_model))

return Model(inputs=[input_layer], outputs=self.call(input_layer, input_layer, None, None, True))

This results in the EncoderLayer class being modified as follows (the three dots under the call() method mean that this remains the same as the one that was implemented here):

from tensorflow.keras.layers import Input from tensorflow.keras import Model class EncoderLayer(Layer): def __init__(self, sequence_length, h, d_k, d_v, d_model, d_ff, rate, **kwargs): super(EncoderLayer, self).__init__(**kwargs) self.build(input_shape=[None, sequence_length, d_model]) self.d_model = d_model self.sequence_length = sequence_length self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(rate) self.add_norm1 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout2 = Dropout(rate) self.add_norm2 = AddNormalization() def build_graph(self): input_layer = Input(shape=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True)) def call(self, x, padding_mask, training): …

from tensorflow.keras.layers import Input

from tensorflow.keras import Model

class EncoderLayer(Layer):

def __init__(self, sequence_length, h, d_k, d_v, d_model, d_ff, rate, **kwargs):

super(EncoderLayer, self).__init__(**kwargs)

self.build(input_shape=[None, sequence_length, d_model])

self.d_model = d_model

self.sequence_length = sequence_length

self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)

self.dropout1 = Dropout(rate)

self.add_norm1 = AddNormalization()

self.feed_forward = FeedForward(d_ff, d_model)

self.dropout2 = Dropout(rate)

self.add_norm2 = AddNormalization()

def build_graph(self):

input_layer = Input(shape=(self.sequence_length, self.d_model))

return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))

def call(self, x, padding_mask, training):

...

Similar changes can be made to the DecoderLayer class too.

Once you have the necessary changes in place, you can proceed to create instances of the EncoderLayer and DecoderLayer classes and print out their summaries as follows:

from encoder import EncoderLayer from decoder import DecoderLayer encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate) encoder.build_graph().summary() decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate) decoder.build_graph().summary()

from encoder import EncoderLayer

from decoder import DecoderLayer

encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)

encoder.build_graph().summary()

decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)

decoder.build_graph().summary()

The resulting summary for the encoder is the following:

Model: “model” __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(None, 5, 512)] 0 [] multi_head_attention_18 (Multi (None, 5, 512) 131776 [‘input_1[0][0]’, HeadAttention) ‘input_1[0][0]’, ‘input_1[0][0]’] dropout_32 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_18[0][0]’] add_normalization_30 (AddNorma (None, 5, 512) 1024 [‘input_1[0][0]’, lization) ‘dropout_32[0][0]’] feed_forward_12 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_30[0][0]’] dropout_33 (Dropout) (None, 5, 512) 0 [‘feed_forward_12[0][0]’] add_normalization_31 (AddNorma (None, 5, 512) 1024 [‘add_normalization_30[0][0]’, lization) ‘dropout_33[0][0]’] ================================================================================================== Total params: 2,233,536 Trainable params: 2,233,536 Non-trainable params: 0 __________________________________________________________________________________________________

Model: “model”

__________________________________________________________________________________________________

Layer (type) Output Shape Param # Connected to

==================================================================================================

input_1 (InputLayer) [(None, 5, 512)] 0 []

multi_head_attention_18 (Multi (None, 5, 512) 131776 [‘input_1[0][0]’,

HeadAttention) ‘input_1[0][0]’,

‘input_1[0][0]’]

dropout_32 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_18[0][0]’]

add_normalization_30 (AddNorma (None, 5, 512) 1024 [‘input_1[0][0]’,

lization) ‘dropout_32[0][0]’]

feed_forward_12 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_30[0][0]’]

dropout_33 (Dropout) (None, 5, 512) 0 [‘feed_forward_12[0][0]’]

add_normalization_31 (AddNorma (None, 5, 512) 1024 [‘add_normalization_30[0][0]’,

lization) ‘dropout_33[0][0]’]

==================================================================================================

Total params: 2,233,536

Trainable params: 2,233,536

Non-trainable params: 0

__________________________________________________________________________________________________

While the resulting summary for the decoder is the following:

Model: “model_1″ __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_2 (InputLayer) [(None, 5, 512)] 0 [] multi_head_attention_19 (Multi (None, 5, 512) 131776 [‘input_2[0][0]’, HeadAttention) ‘input_2[0][0]’, ‘input_2[0][0]’] dropout_34 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_19[0][0]’] add_normalization_32 (AddNorma (None, 5, 512) 1024 [‘input_2[0][0]’, lization) ‘dropout_34[0][0]’, ‘add_normalization_32[0][0]’, ‘dropout_35[0][0]’] multi_head_attention_20 (Multi (None, 5, 512) 131776 [‘add_normalization_32[0][0]’, HeadAttention) ‘input_2[0][0]’, ‘input_2[0][0]’] dropout_35 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_20[0][0]’] feed_forward_13 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_32[1][0]’] dropout_36 (Dropout) (None, 5, 512) 0 [‘feed_forward_13[0][0]’] add_normalization_34 (AddNorma (None, 5, 512) 1024 [‘add_normalization_32[1][0]’, lization) ‘dropout_36[0][0]’] ================================================================================================== Total params: 2,365,312 Trainable params: 2,365,312 Non-trainable params: 0 __________________________________________________________________________________________________

Model: “model_1”

__________________________________________________________________________________________________

Layer (type) Output Shape Param # Connected to

==================================================================================================

input_2 (InputLayer) [(None, 5, 512)] 0 []

multi_head_attention_19 (Multi (None, 5, 512) 131776 [‘input_2[0][0]’,

HeadAttention) ‘input_2[0][0]’,

‘input_2[0][0]’]

dropout_34 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_19[0][0]’]

add_normalization_32 (AddNorma (None, 5, 512) 1024 [‘input_2[0][0]’,

lization) ‘dropout_34[0][0]’,

‘add_normalization_32[0][0]’,

‘dropout_35[0][0]’]

multi_head_attention_20 (Multi (None, 5, 512) 131776 [‘add_normalization_32[0][0]’,

HeadAttention) ‘input_2[0][0]’,

‘input_2[0][0]’]

dropout_35 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_20[0][0]’]

feed_forward_13 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_32[1][0]’]

dropout_36 (Dropout) (None, 5, 512) 0 [‘feed_forward_13[0][0]’]

add_normalization_34 (AddNorma (None, 5, 512) 1024 [‘add_normalization_32[1][0]’,

lization) ‘dropout_36[0][0]’]

==================================================================================================

Total params: 2,365,312

Trainable params: 2,365,312

Non-trainable params: 0

__________________________________________________________________________________________________

Summary

In this tutorial, you discovered how to implement the complete Transformer model and create padding and look-ahead masks.

Specifically, you learned:

How to create a padding mask for the encoder and decoder
How to create a look-ahead mask for the decoder
How to join the Transformer encoder and decoder into a single model
How to print out a summary of the encoder and decoder layers

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Last Updated on October 26, 2022

In this tutorial, you will discover how to implement the complete Transformer model and create padding and look-ahead masks.

After completing this tutorial, you will know:

How to create a padding mask for the encoder and decoder
How to create a look-ahead mask for the decoder
How to join the Transformer encoder and decoder into a single model
How to print out a summary of the encoder and decoder layers

Let’s get started.

Joining the Transformer encoder and decoder and Masking
Photo by John O’Nolan, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

Recap of the Transformer Architecture
Masking
- Creating a Padding Mask
- Creating a Look-Ahead Mask
Joining the Transformer Encoder and Decoder
Creating an Instance of the Transformer Model
- Printing Out a Summary of the Encoder and Decoder Layers

Prerequisites

For this tutorial, we assume that you are already familiar with:

Recap of the Transformer Architecture

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need“

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

Let’s start first by discovering how to apply masking.

Masking

Creating a Padding Mask

You should already be familiar with the importance of masking the input values before feeding them into the encoder and decoder.

Let’s create the following function to generate a padding mask for both the encoder and decoder:

from tensorflow import math, cast, float32

def padding_mask(input):

# Create mask which marks the zero padding values in the input by a 1

mask = math.equal(input, 0)

mask = cast(mask, float32)

return mask

Upon receiving an input, this function will generate a tensor that marks by a value of one wherever the input contains a value of zero.

Hence, if you input the following array:

from numpy import array input = array([1, 2, 3, 4, 0, 0, 0]) print(padding_mask(input))

from numpy import array

input = array([1, 2, 3, 4, 0, 0, 0])

print(padding_mask(input))

Then the output of the padding_mask function would be the following:

tf.Tensor([0. 0. 0. 0. 1. 1. 1.], shape=(7,), dtype=float32)

tf.Tensor([0. 0. 0. 0. 1. 1. 1.], shape=(7,), dtype=float32)

Creating a Look-Ahead Mask

For this purpose, let’s create the following function to generate a look-ahead mask for the decoder:

from tensorflow import linalg, ones

def lookahead_mask(shape):

# Mask out future entries by marking them with a 1.0

mask = 1 – linalg.band_part(ones((shape, shape)), –1, 0)

return mask

You will pass to it the length of the decoder input. Let’s make this length equal to 5, as an example:

Then the output that the lookahead_mask function returns is the following:

tf.Tensor( [[0. 1. 1. 1. 1.] [0. 0. 1. 1. 1.] [0. 0. 0. 1. 1.] [0. 0. 0. 0. 1.] [0. 0. 0. 0. 0.]], shape=(5, 5), dtype=float32)

tf.Tensor(

[[0. 1. 1. 1. 1.]

[0. 0. 1. 1. 1.]

[0. 0. 0. 1. 1.]

[0. 0. 0. 0. 1.]

[0. 0. 0. 0. 0.]], shape=(5, 5), dtype=float32)

Again, the one values mask out the entries that should not be used. In this manner, the prediction of every word only depends on those that come before it.

Joining the Transformer Encoder and Decoder

Let’s start by creating the class, TransformerModel, which inherits from the Model base class in Keras:

class TransformerModel(Model):

def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate, **kwargs):

super(TransformerModel, self).__init__(**kwargs)

# Set up the encoder

self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

# Set up the decoder

self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

# Define the final dense layer

self.model_last_layer = Dense(dec_vocab_size)

...

You will also include one final dense layer that produces the final output, as in the Transformer architecture of Vaswani et al. (2017).

Next, you shall create the class method, call(), to feed the relevant inputs into the encoder and decoder.

A padding mask is first generated to mask the encoder input, as well as the encoder output, when this is fed into the second self-attention block of the decoder:

...

def call(self, encoder_input, decoder_input, training):

# Create padding mask to mask the encoder inputs and the encoder outputs in the decoder

enc_padding_mask = self.padding_mask(encoder_input)

...

A padding mask and a look-ahead mask are then generated to mask the decoder input. These are combined together through an element-wise maximum operation:

...

# Create and combine padding and look-ahead masks to be fed into the decoder

dec_in_padding_mask = self.padding_mask(decoder_input)

dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1])

dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask)

...

Next, the relevant inputs are fed into the encoder and decoder, and the Transformer model output is generated by feeding the decoder output into one final dense layer:

...

# Feed the input into the encoder

encoder_output = self.encoder(encoder_input, enc_padding_mask, training)

# Feed the encoder output into the decoder

decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training)

# Pass the decoder output through a final dense layer

model_output = self.model_last_layer(decoder_output)

return model_output

Combining all the steps gives us the following complete code listing:

from encoder import Encoder

from decoder import Decoder

from tensorflow import math, cast, float32, linalg, ones, maximum, newaxis

from tensorflow.keras import Model

from tensorflow.keras.layers import Dense

class TransformerModel(Model):

def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate, **kwargs):

super(TransformerModel, self).__init__(**kwargs)

# Set up the encoder

self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

# Set up the decoder

self.decoder = Decoder(dec_vocab_size, dec_seq_length, h, d_k, d_v, d_model, d_ff_inner, n, rate)

# Define the final dense layer

self.model_last_layer = Dense(dec_vocab_size)

def padding_mask(self, input):

# Create mask which marks the zero padding values in the input by a 1.0

mask = math.equal(input, 0)

mask = cast(mask, float32)

# The shape of the mask should be broadcastable to the shape

# of the attention weights that it will be masking later on

return mask[:, newaxis, newaxis, :]

def lookahead_mask(self, shape):

# Mask out future entries by marking them with a 1.0

mask = 1 – linalg.band_part(ones((shape, shape)), –1, 0)

return mask

def call(self, encoder_input, decoder_input, training):

# Create padding mask to mask the encoder inputs and the encoder outputs in the decoder

enc_padding_mask = self.padding_mask(encoder_input)

# Create and combine padding and look-ahead masks to be fed into the decoder

dec_in_padding_mask = self.padding_mask(decoder_input)

dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1])

dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask)

# Feed the input into the encoder

encoder_output = self.encoder(encoder_input, enc_padding_mask, training)

# Feed the encoder output into the decoder

decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training)

# Pass the decoder output through a final dense layer

model_output = self.model_last_layer(decoder_output)

return model_output

Creating an Instance of the Transformer Model

You will work with the parameter values specified in the paper, Attention Is All You Need, by Vaswani et al. (2017):

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_ff = 2048 # Dimensionality of the inner fully connected layer

d_model = 512 # Dimensionality of the model sub-layers’ outputs

n = 6 # Number of layers in the encoder stack

dropout_rate = 0.1 # Frequency of dropping the input units in the dropout layers

...

As for the input-related parameters, you will work with dummy values for now until you arrive at the stage of training the complete Transformer model. At that point, you will use actual sentences:

...

enc_vocab_size = 20 # Vocabulary size for the encoder

dec_vocab_size = 20 # Vocabulary size for the decoder

enc_seq_length = 5 # Maximum length of the input sequence

dec_seq_length = 5 # Maximum length of the target sequence

...

You can now create an instance of the TransformerModel class as follows:

from model import TransformerModel

# Create model

training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

The complete code listing is as follows:

enc_vocab_size = 20 # Vocabulary size for the encoder

dec_vocab_size = 20 # Vocabulary size for the decoder

enc_seq_length = 5 # Maximum length of the input sequence

dec_seq_length = 5 # Maximum length of the target sequence

h = 8 # Number of self-attention heads

d_k = 64 # Dimensionality of the linearly projected queries and keys

d_v = 64 # Dimensionality of the linearly projected values

d_ff = 2048 # Dimensionality of the inner fully connected layer

d_model = 512 # Dimensionality of the model sub-layers’ outputs

n = 6 # Number of layers in the encoder stack

dropout_rate = 0.1 # Frequency of dropping the input units in the dropout layers

# Create model

training_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate)

Printing Out a Summary of the Encoder and Decoder Layers

self.build(input_shape=[None, sequence_length, d_model])

self.build(input_shape=[None, sequence_length, d_model])

Then you need to add the following method to the EncoderLayer class:

def build_graph(self): input_layer = Input(shape=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))

def build_graph(self):

input_layer = Input(shape=(self.sequence_length, self.d_model))

return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))

And the following method to the DecoderLayer class:

def build_graph(self): input_layer = Input(shape=(self.sequence_length, self.d_model)) return Model(inputs=[input_layer], outputs=self.call(input_layer, input_layer, None, None, True))

def build_graph(self):

input_layer = Input(shape=(self.sequence_length, self.d_model))

return Model(inputs=[input_layer], outputs=self.call(input_layer, input_layer, None, None, True))

This results in the EncoderLayer class being modified as follows (the three dots under the call() method mean that this remains the same as the one that was implemented here):

from tensorflow.keras.layers import Input

from tensorflow.keras import Model

class EncoderLayer(Layer):

def __init__(self, sequence_length, h, d_k, d_v, d_model, d_ff, rate, **kwargs):

super(EncoderLayer, self).__init__(**kwargs)

self.build(input_shape=[None, sequence_length, d_model])

self.d_model = d_model

self.sequence_length = sequence_length

self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)

self.dropout1 = Dropout(rate)

self.add_norm1 = AddNormalization()

self.feed_forward = FeedForward(d_ff, d_model)

self.dropout2 = Dropout(rate)

self.add_norm2 = AddNormalization()

def build_graph(self):

input_layer = Input(shape=(self.sequence_length, self.d_model))

return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))

def call(self, x, padding_mask, training):

...

Similar changes can be made to the DecoderLayer class too.

Once you have the necessary changes in place, you can proceed to create instances of the EncoderLayer and DecoderLayer classes and print out their summaries as follows:

from encoder import EncoderLayer

from decoder import DecoderLayer

encoder = EncoderLayer(enc_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)

encoder.build_graph().summary()

decoder = DecoderLayer(dec_seq_length, h, d_k, d_v, d_model, d_ff, dropout_rate)

decoder.build_graph().summary()

The resulting summary for the encoder is the following:

Model: “model”

__________________________________________________________________________________________________

Layer (type) Output Shape Param # Connected to

==================================================================================================

input_1 (InputLayer) [(None, 5, 512)] 0 []

multi_head_attention_18 (Multi (None, 5, 512) 131776 [‘input_1[0][0]’,

HeadAttention) ‘input_1[0][0]’,

‘input_1[0][0]’]

dropout_32 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_18[0][0]’]

add_normalization_30 (AddNorma (None, 5, 512) 1024 [‘input_1[0][0]’,

lization) ‘dropout_32[0][0]’]

feed_forward_12 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_30[0][0]’]

dropout_33 (Dropout) (None, 5, 512) 0 [‘feed_forward_12[0][0]’]

add_normalization_31 (AddNorma (None, 5, 512) 1024 [‘add_normalization_30[0][0]’,

lization) ‘dropout_33[0][0]’]

==================================================================================================

Total params: 2,233,536

Trainable params: 2,233,536

Non-trainable params: 0

__________________________________________________________________________________________________

While the resulting summary for the decoder is the following:

Model: “model_1”

__________________________________________________________________________________________________

Layer (type) Output Shape Param # Connected to

==================================================================================================

input_2 (InputLayer) [(None, 5, 512)] 0 []

multi_head_attention_19 (Multi (None, 5, 512) 131776 [‘input_2[0][0]’,

HeadAttention) ‘input_2[0][0]’,

‘input_2[0][0]’]

dropout_34 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_19[0][0]’]

add_normalization_32 (AddNorma (None, 5, 512) 1024 [‘input_2[0][0]’,

lization) ‘dropout_34[0][0]’,

‘add_normalization_32[0][0]’,

‘dropout_35[0][0]’]

multi_head_attention_20 (Multi (None, 5, 512) 131776 [‘add_normalization_32[0][0]’,

HeadAttention) ‘input_2[0][0]’,

‘input_2[0][0]’]

dropout_35 (Dropout) (None, 5, 512) 0 [‘multi_head_attention_20[0][0]’]

feed_forward_13 (FeedForward) (None, 5, 512) 2099712 [‘add_normalization_32[1][0]’]

dropout_36 (Dropout) (None, 5, 512) 0 [‘feed_forward_13[0][0]’]

add_normalization_34 (AddNorma (None, 5, 512) 1024 [‘add_normalization_32[1][0]’,

lization) ‘dropout_36[0][0]’]

==================================================================================================

Total params: 2,365,312

Trainable params: 2,365,312

Non-trainable params: 0

__________________________________________________________________________________________________

Summary

In this tutorial, you discovered how to implement the complete Transformer model and create padding and look-ahead masks.

Specifically, you learned:

How to create a padding mask for the encoder and decoder
How to create a look-ahead mask for the decoder
How to join the Transformer encoder and decoder into a single model
How to print out a summary of the encoder and decoder layers

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Joining the Transformer Encoder and Decoder Plus Masking

Tutorial Overview

Prerequisites

Recap of the Transformer Architecture

Masking

Creating a Padding Mask

Creating a Look-Ahead Mask

Joining the Transformer Encoder and Decoder

Creating an Instance of the Transformer Model

Printing Out a Summary of the Encoder and Decoder Layers

Further Reading

Books

Papers

Summary

Tutorial Overview

Prerequisites

Recap of the Transformer Architecture

Masking

Creating a Padding Mask

Creating a Look-Ahead Mask

Joining the Transformer Encoder and Decoder

Creating an Instance of the Transformer Model

Printing Out a Summary of the Encoder and Decoder Layers

Further Reading

Books

Papers

Summary