Generating Sentence-Level Embedding Based on the Trends in Token-Level BERT Embeddings | by Yuli Vasiliev | Jan, 2023

By Jessie Hobb On Jan 5, 2023

How to derive sentence-level embedding from word embeddings

The sentence (phrase or passage) level embedding is often used as input in many NLP classification problems, for example, in spam detection and question-answering (QA) systems. In my previous post Discovering Trends in BERT Embeddings of Different Levels for the Task of Semantic Context Determining, I discussed how you might generate a vector representation that holds information about changes in contextual embedding values relative to a static embedding of the same token, which you then can use as a component for generating a sentence-level embedding. This article expands on this topic, exploring what tokens in a sentence you need to derive such trend vectors from to be able to generate an efficient embedding for the entire sentence.

The first question that arises in connection with this is: from how many tokens in a sentence do you need to derive the embedding in order to generate an efficient embedding for the entire sentence? If you recall from the discussion in the previous post, we’re getting a single vector — derived for the most important word in a sentence — that includes information about the context of the entire sentence. However, to get a better picture of the sentence context, it would also be nice to have such a vector for the word that is most syntactically related to that most important word. Why do we need this?

A simple analogy from life can help answer this question: If you admire the surrounding beauties while sitting, say, in a restaurant located inside the tower — the views you contemplate will not include the tower itself. To take a photo of the tower view, you first need to exit the tower.

OK, how can we determine the word that is most syntactically related to the most important word in the sentence? (that is, you need to decide on the best place to take a photo of the tower, according to the previous analogy) The answer is: with the help of the attention weights, which you can also obtain from the BERT model.

Before you can follow the code discussed in the rest of this article, you’ll need to refer to the example provided in the previous post (we’re going to use the model and the derived vector representations defined in that example). The only correction you’ll need to do is the following: When creating the model, be sure to make the model return not only the hidden states but also the attention weights:

model = BertModel.from_pretrained(‘bert-base-uncased’,
output_hidden_states = True, # so that the model returns all hidden-states.
output_attentions = True
)

Everything else, including sample sentences, can be used without modification. Actually, we’re going to use only the first sample sentence: ‘I want an apple.’

Below we are determining the word that is syntactically closest to the most important word (Want, in this particular example). For that we check the attention weights in all 12 layers. To start with, we create an empty array (we don’t count special symbols, excluding the first and last symbols):

a = np.empty([0, len(np.sum(outputs[0].attentions[0][0][11].numpy(), axis=0)[1:-1])])

Next, we fill in the matrix of attention weights:

for i in range(12):
a = np.vstack([a,np.sum(outputs[0].attentions[0][0][i].numpy(), axis=0)[1:-1]])

We are not interested in the punctuation symbol. So, we’ll remove the last column in the matrix:

a = np.delete(a, -1, axis=1)

So our matrix will now look as follows (12×4, i.e 12 layers and 4 words)

print(a)

[[0.99275106 1.00205731 0.76726311 0.72082734]
[0.7479955 1.16846883 0.63782167 1.39036024]
[1.23037624 0.40373796 0.57493907 0.25739866]
[1.319888 1.21090519 1.37013197 0.7479018 ]
[0.48407069 1.15729702 0.54152751 0.57587731]
[0.47308242 0.61861634 0.46330488 0.47692096]
[1.23776317 1.2546916 0.92190945 1.2607218 ]
[1.19664812 0.51989007 0.48901123 0.65525496]
[0.5389185 0.98384732 0.8789593 0.98946768]
[0.75819892 0.80689037 0.5612824 1.10385513]
[0.14660755 1.10911655 0.84521955 1.00496972]
[0.77081972 0.79827666 0.45695013 0.36948431]]

Let’s now determine in which layers Want (the second column with the index 1) drew the most attention:

print(np.argmax(a,axis=1))
b = a[np.argmax(a,axis=1) == 1]

array([1, 3, 0, 2, 1, 1, 3, 0, 3, 3, 1, 1])

Next, we can determine which token draws more attention after Want in the layers where Want is in the lead. For that, we first delete the Want column, and then explore the remaining three:

c = np.delete(b, 1, axis=1)
d = np.argmax(c, axis =1)
print(d)
counts = np.bincount(d)
print(np.argmax(counts))

[0 2 2 2 0]
2

The above shows that we have word Apple (after deleting Want here, Apple’s index is 2) as the one that is the most syntactically related to word Want. This is quite expected because these words represent the direct object and the transitive verb, respectively.

_l12_1 = hidden_states[0][12][0][4][:10].numpy()
_l0_1 = hidden_states[0][0][0][4][:10].numpy()
_l0_12_1 = np.log(_l12_1/_l0_1)
_l0_12_1 = np.where(np.isnan(_l0_12_1), 0, _l0_12_1)

Let’s now compare the vectors derived from the embeddings of words Apple and Want.

print(_l0_12_1)

array([ 3.753544 , 1.4458075 , -0.56288993, -0.44559467, 0.9137548 ,
0.33285233, 0. , 0. , 0. , 0. ],
dtype=float32)print(l0_12_1) # this vector has been defined in the previous post

array([ 0. , 0. , 0. , 0. , -0.79848075,
0.6715901 , 0.30298436, -1.6455574 , 0.1162319 , 0. ],
dtype=float32)

As you can see, one of the values in the pair of matching elements in the above two vectors is in most cases zero while the other value is non-zero — i.e. the vectors look complementary (Remember the tower view analogy: Neighboring sights are visible from the tower, but in order to see the tower itself — perhaps the main attraction — you need to leave it) So, you can safely sum up these vectors elementwise to combine the available information into a single vector.

s = _l0_12_1 + l0_12_1
print(s)

array([ 3.753544 , 1.4458075 , -0.56288993, -0.44559467, 0.11527407,
1.0044425 , 0.30298436, -1.6455574 , 0.1162319 , 0. ],
dtype=float32)

The above vector can next be used as input for sentence-level classification.

This article provides the intuition along with the code on how you can generate sentence-level embedding based on the trends in token-level BERT embeddings when moving from static embedding to contextual embedding. This sentence-level embedding can be then used as an alternative to the CLS token embedding generated by BERT for sentence classification, meaning you can try them both to see which one will work best for your particular problem.