Generating Word Embeddings from Text Data using Skip-Gram Algorithm and Deep Learning in Python | by Angel Das | Nov, 2022

By Jessie Hobb On Nov 10, 2022

Introduction to embeddings in natural language processing using Artificial Neural Network and Gensim

Photo by Leonardo Toshiro Okubo on Unsplash

The biggest challenge in NLP is devising a way to represent a word’s meaning. This is a critical component since different individuals communicate the same message differently. E.g., “I liked the food here” and “Quality of food seems good. I loved it!” At the end of the day, both texts reflect an idea or a thought represented by a sequence of words or phrases. In distributional semantics, the meaning of a word is given by the words that most frequently appear close by. When a word (w) appears in a sentence, its context is the set or sequence of words that appear nearby (within a fixed window size). E.g.,

“We can use a traditional machine learning algorithm to create a driver model.”
“Building deep learning frameworks is the need of the hour.”
“Can we use a traditional approach and transfer learning towards improving performance?”

In NLP, we can use the context words of learning (e.g., (machine, algorithm), (deep, frameworks), (transfer, towards)) to build a vector representation of learning. We will talk about word vectors a little later on.

Word embedding is used in natural language processing (NLP) to describe how words are represented for text analysis. Typically, this representation takes the form of a real-valued vector that encodes the word’s meaning with the expectation that words that are closer to one another in the vector space will have similar meanings. In a process known as word embedding, each word is represented as real-valued vectors in a predetermined vector space. The method is called deep learning since each word is assigned to a single vector, and the vector values are learned like a neural network (Jason Browniee, 2017).

2.1 Sample example of an Embedding

Employing a densely distributed representation for each word is essential to the method of producing embeddings. A real-valued vector with frequently tens, hundreds, or even thousands of dimensions represents each word. In contrast, sparse word representations, like one-hot encoding, need thousands or millions of dimensions.

This is a quick example of how the representation of two different words in a three-dimensional space would look. Note that a vector notation and hypothetical values represent the words “Angel” and “Eugene.”

Angel = 3i + 4j + 5k, Eugene = 2i + 4j + 4k

The beauty of embedding is that it allows us to calculate the similarity of words. The similarity here represents context and not necessarily meaning. Say, if we use a Cosine Similarity, then Similarity(Angel, Eugene) = (3×2 + 4×4 + 5×4)/(sqrt((3²+4²+5²)) x sqrt(2²+4²+4²)) = 42/42.42 = 0.99 or 99% similarity. You are probably wondering why this similarity is important in NLP. This is purely to allow models to contextualize words so that words that appear in a similar context have a similar meaning. To validate this, let’s look at the two sentences below.

“Angel and Eugene are working on a project proposal and plan to roll it out tomorrow.”
“Eugene’s input on the model validation seems correct. Angel, can you implement the changes?”

Although we are looking at two sentences here, it appears that Angel and Eugene are teammates; hence the similarity remains high. Again, it’s not whether the words have a similar meaning but whether they appear in a similar context or co-occur.

3.1 Simple Probability Models

A simple Probability Model can use a chain rule to calculate probabilities of words occurring in a text. It can use a Syntagmatic relationship to identify co-occurring words and calculate their probability. Example below.

**Formula 1:** Simple probability models. Image created by the Author using Latex & Jupyter Notebook.

p(Angel plays well) = p(Angel) x p(plays|Angel) x p(well|Angel plays). Computing these probabilities requires counting word and term occurrences. Despite increasing sophistication and accuracy, direct probability models like these have yet to achieve great performance.

3.2 Word2vec

A framework for learning word vectors is called Word2vec. Word2vec is a neural network that uses the word “vectorization” to parse text. It accepts a text corpus as input and produces a set of feature vectors representing the words in the corpus. Grouping similar word vectors together in vector space is the goal and benefit of Word2vec. There are two distinct architectures in Word2Vec: Skip Gram and Continuous Bag of Words (CBOW).

Assume we have a substantial corpus of texts. In a fixed vocabulary, each word is represented by a vector. We calculate embedding using the following stepwise approach:

Go through each of the text’s positions t, each of which has a centered word c
Words in context (or “outside”) are represented as o
Calculate the likelihood of o given c using the similarity of the word vectors for c and o. (or vice versa)
Continue modifying the word vectors to increase this likelihood

3.2.1 Skip Gram Model

In a skip-gram model, we try to model the contextual words (e.g., word ± window size) given a particular word.

E.g., if we have the text “Angel plays football well,” Considering football as the center word (hypothetically), the remaining words become a context. The window size represents how many context words we would like to consider. Example of window size and process for computing probabilities using two different examples.

**Figure 1:** Example of a Skip-Gram model. The diagram shows the concept of center and context words using two different examples. t here denotes the position of a center word, whereas t-1 and t+1 represent the position of context words. Image created by the Author using PowerPoint.

**Figure 2.** Illustrates the likelihood function for Skip Gram Model. In a real word, an algorithm is trained to maximize this likelihood which would result in more accurate representation of words using embeddings. Image created by the Author using Latex & Jupyter Notebook.

3.2.1.1 Example of generating Skip-Gram-based embedding using Gensim

Gensim is an open-source framework that uses modern statistical machine learning for unsupervised topic modeling, document indexing, embedding creation, and other NLP features. We will use some standard texts from Gensim to create word embeddings here.

#----------------------------------------Install and import necessary libraries-----------------------------------# !pip install contractionsimport re, string, unicodedata                          # Import Regex, string and unicodedata.
import contractions                                     # Import contractions library.
from bs4 import BeautifulSoup                           # Import BeautifulSoup.import numpy as np                                      # Import numpy.
import pandas as pd                                     # Import pandas.
import nltk                                             # Import Natural Language Tool-Kit.# nltk.download('stopwords')                              # Download Stopwords.
# nltk.download('punkt')
# nltk.download('wordnet')from nltk.corpus import stopwords                       # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer. For removing stem words
import unicodedata                                      # Removing accented characters
from nltk.stem import LancasterStemmerfrom IPython.display import display

These packages are useful for data cleaning and preprocessing.

import gensimfrom gensim.models import Word2Vecfrom gensim.test.utils import common_textscommon_texts[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]

Use sg=1 within Word2Vec() to create a Skip-Gram model. Vector size represents the number of embeddings per word produced by the model. The window represents the window size, i.e., the number of context words used to generate the embeddings.

model = Word2Vec(sentences=common_texts, vector_size=3, window=5, min_count=1, workers=4, sg=1)
model.save(“word2vec.model”)print("Embeddings of a Word computer with 3 as Embedding Dimension")print(model.wv['computer'])print("Embeddings of a Word Tree with 3 as Embedding Dimension")print(model.wv['trees'])print("Embeddings of a Word Graph with 3 as Embedding Dimension")print(model.wv['graph'])Embeddings of a Word computer with 3 as Embedding Dimension
[ 0.19234174 -0.2507943  -0.13124168]
Embeddings of a Word Tree with 3 as Embedding Dimension
[ 0.21529572  0.2990996  -0.16718094]
Embeddings of a Word Graph with 3 as Embedding Dimension
[ 0.3003091  -0.31009832 -0.23722696]

We will now convert these embeddings into a list and use matplotlib() to create a visualization. The value of the embeddings may make little sense if you examine them, but when represented graphically, they should appear closer to each other.

inter_array=model.wv['human']
comp_array=model.wv['user']
tree_array=model.wv['interface']from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt%matplotlib inlinec=[10,20,30]labels=['human', 'user', 'interface']fig = plt.figure(figsize=(10, 8))ax = plt.axes(projection ='3d')
# defining all 3 axes
z = [comp_array[2], inter_array[2], tree_array[2]]
x = [comp_array[0], inter_array[0], tree_array[0]]
y = [comp_array[1], inter_array[1], tree_array[1]]
# plotting
ax.scatter(x, y, z, color=['red','green','blue'])ax.text(comp_array[0],comp_array[1],comp_array[2], "human")
ax.text(inter_array[0],inter_array[1],inter_array[2], "user")
ax.text(tree_array[0],tree_array[1],tree_array[2], "interface")ax.legend(loc="lower right")ax.set_title('3D Representation of words - computer, graph, and tree')
plt.show()

**Figure 3.** 3-Dimensional representation of words “interface”, “user”, and “human” using embedding generated from Gensim. Image created by the author using Jupyter Notebook.

3.2.2 Continuous Bag-of-Words (CBOW)

In a continuous bag-of-words (CBOW) model, we try to model the center word given its surrounding words (e.g., word ± window size).

**Figure 4:** Example of a CBOW model. Similar to Skip Gram, the likelihood function for CBOW is displayed above. Image created by the Author using Jupyter Notebook and Latex.

We will use randomly generated data about Messi and Ronaldo and try and develop word embeddings by training a Neural Network Model (Bujokas, 2020). We will keep the size of the embeddings to two, allowing us to create a plot in Python and visually inspect which words are similar in a given context. You can find the sample data generated by the Author here.

4.1 Basic NLP Libraries

import itertools
import pandas as pd
import numpy as np
import re
import os
from tqdm import tqdm#----------------------Drawing the embeddings in Python
import matplotlib.pyplot as plt#-----------------------Deep learning: ANN to create and train embeddings
from tensorflow.keras.models import Modelfrom tensorflow.keras.layers import Input, Dense

4.2 User-Defined Function: Words/Tokens from Data to Dictionary

User-defined functions that create a dictionary where the keys represent unique words and key values are indices or indexes.

def create_unique_word_to_dict(sample):
"""
User defined function that creates a dictionary where the keys represents unique words
and key values are indices or indexes
"""
#-------Extracting all the unique words from our sample text and sorting them alphabetically
#-------set allows us to remove any duplicated words and extract only the unique ones
words_list = list(set(sample))#-------------Sort applied here
words_list.sort()#------Creating a dictionary for each unique words in the sample text
unique_word_dict_containing_text = {}for i, word in enumerate(words_list): #------For each word in the document
unique_word_dict_containing_text.update({
word: i
})return unique_word_dict_containing_text

4.3 User-Defined Function: Data Cleaning or Text Preprocessing

This is a custom text preprocessing function. One can enhance the list of stop words used here to account for more frequently used unwanted words. We use this function to perform the following:

Remove punctuations
Remove numbers
Remove whitespace
Remove stop words
Return cleaned text

def text_preprocessing_custom(sample_text):#---------------Custome Text Preprocessing created as UDF-----------------------------
#----------------Regular expression for Punctuations------------------
punctuations = r''';:'"\,<>./?*_“~'''
#---------------Stop Words (Custom List)--------------------
stop_words=['i', 'dont', 'we', 'had','and', 'are', 'of', 'a', 'is', 'the', 'in', 'be',\
'will', 'this', 'with', 'if', 'be', 'as', 'to', 'is', 'don\'t']
"""
A method to preproces text
"""
for x in sample_text.lower():
if x in punctuations: 
sample_text = sample_text.replace(x, "")#-------------Removing words that have numbers in them using regular expression---------
sample_text = re.sub(r'\w*\d\w*', '', sample_text)#-------------Removing whitespaces---------------------
sample_text = re.sub(r'\s+', ' ', sample_text).strip()# --------Convert to lower case----------------------
sample_text = sample_text.lower()#--------------Converting all our text to a list------------------
sample_text = sample_text.split(' ')#--------------Deleting empty strings---------------------
sample_text = [x for x in sample_text if x!='']# Stopword removalsample_text = [x for x in sample_text if x not in stop_words]return sample_text

4.4 Data Preprocessing and [Center, Context] word Generation

Keeping the concept of a Skip Gram model in mind, we need to create a framework that allows us to identify all combinations of Center and Context words from our text using a window size of 2. Note that each word in a given text (after being pre-processed) can be a center word in one instance and a context word in another. Ideally, we allow each word in the text to be a Center word and then find relevant Context words accordingly.

from scipy import sparsesample = pd.read_csv('Text Football.csv') #-----Reading text datasample = [x for x in sample['Discussion']]print("--"*50)print("Text in the File")print("--"*50)print(sample)
#----------------------------------------Defining the window for context----------------------------------------
window_size = 2
#----------------------------------------Creating empty lists to store texts---------------
word_lists = []
all_text = []#----------------Combines preprocessed texts from the Sample Datafor text in sample:
#----------------------------------------Cleaning the text
text = text_preprocessing_custom(text)all_text += text#-------------------------Iterating across each word in a text
for i, word in enumerate(text):for ws in range(window_size):if i + 1 + ws < len(text):word_lists.append([word] + [text[(i + 1 + ws)]])if i - ws - 1 >= 0:
word_lists.append([word] + [text[(i - ws - 1)]])unique_word_dict = create_unique_word_to_dict(all_text)print("--"*50)
print("Text to Sequence in post cleaning")
print("--"*50)
print(unique_word_dict)
print("--"*50)
print("Text to WordList [main word, context word]")
print("--"*50)
print(word_lists)

**Figure 5:** Example of how Center (Main) and Context words are generated. Image created by the Author using Jupyter Notebook.

4.5 Generating Input & Output Data for the Neural Network

We will now use the Main Words (Center Words) as Inputs to our Neural Network and Context Words as Outputs. The Deep Dense layer between the Input and Output Layer would learn the embeddings. The idea is to train a 2-Neuron Neural Network within a single Hidden Layer, where weights learned while training the model will reflect the Embeddings learned from the data. For our data to be compatible with a Tensorflow architecture, we perform a One Hot Encoding (OHE) on the list of Main and Context Words generated above.

def create_OHE_data(word_dict, word_lists):#----------------------------------------Defining the number of unique words----------------------------------------
n_unique_words = len(word_dict)#----------------------------------------Getting all the unique words----------------------------------------
words = list(word_dict.keys())#----------------------------------------Creating the X and Y matrices using one hot encoding---------------------------------
X = []
Y = []for i, word_list in tqdm(enumerate(word_lists)):#----------------------------------------Getting Indicies----------------------------------------index_main_word = word_dict.get(word_list[0])
index_context_word = word_dict.get(word_list[1])#----------------------------------------Creating the placeholders   
Xrow = np.zeros(n_unique_words)
Yrow = np.zeros(n_unique_words)#----------------------------------------One hot encoding the main word, Input Matrix
Xrow[index_main_word] = 1#----------------------------------------One hot encoding the Y matrix words, Output Matrix 
Yrow[index_context_word] = 1#----------------------------------------Appending to the main matrices
X.append(Xrow)
Y.append(Yrow)#------------------------Converting the matrices into a sparse format because the vast majority of the data are 0s
X_Matrix = sparse.csr_matrix(X)
Y_Matrix = sparse.csr_matrix(Y)print("--"*50)
print("Input Data [Showing the First Record]")
print("--"*50)print(X_Matrix.todense()[0])
print(X_Matrix.todense().shape)print("--"*50)
print("Output Data [Showing the first record]")
print("--"*50)print(Y_Matrix.todense()[0])
print(Y_Matrix.todense().shape)return X_Matrix, Y_MatrixX, Y = create_OHE_data(unique_word_dict, word_lists)

**Figure 6:** Example of OHE to generate the Input and Output Data. Image created by the Author using Jupyter Notebook.

4.6 Training the Neural Network to generate Embeddings

An example of how the sample architecture would look is represented below.

**Figure 7:** Illustrates a two-neuron neural network to train embeddings. Image developed by the Author using PowerPoint.

from tensorflow.keras.models import Sequentialdef embedding_model():model=Sequential()
model.add(Dense(2, input_shape=(X.shape[1],), activation='relu'))
model.add(Dense(units=Y.shape[1], activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
return modelmodel_embbedding=embedding_model()model_embbedding.summary()#----------------------------------------Optimizing the network weights----------------------------------------
model_embbedding.fit(
x=X.todense(), 
y=Y.todense(), 
batch_size=64,
epochs=1000,
verbose=0
)#----------------------------------------Obtaining the weights from the neural network----------------------------------------
#----------------------------------------These weights are equivalent to word embeddings--------------------------------------weights = model_embbedding.get_weights()[0] #---embeddings#----------------------------------------Creating Dictionary to Store Embeddings----------------------------------------
embedding_dict = {}
for word in words_list: 
embedding_dict.update({
word: weights[unique_word_dict.get(word)]
})
embedding_dict

**Figure 8:** Example of embeddings generated from the model. Image created by the Author using Jupyter Notebook.

# !pip install ann_visualizerfrom ann_visualizer.visualize import ann_vizann_viz(model_embbedding, view=True, filename='embed.pdf', title='ANN Architecture for embedding')

**Figure 9.** Illustrates the ANN architecture used to generate the embeddings. Image developed by the author using Jupyter + ann_visualizer()

4.7 Visually representing the learned Embeddings

#----------------------------------------Ploting the embeddings----------------------------------------
plt.figure(figsize=(15, 10))import seaborn as snssns.set(color_codes=True)plt.title("Embeddings")for word in list(unique_word_dict.keys()):coord = embedding_dict.get(word) #------------------------Extracting Embeddings
plt.scatter(coord[0], coord[1]) #-----------------------Plotting Embeddings
plt.annotate(word, (coord[0], coord[1]))  #----------------Annotate tokens

**Figure 9:** Visual representation of embeddings generated from the model. Image created by the Author using Jupyter Notebook.

4.8 Observation

In football, Messi and Ronaldo are often compared by their respective fans’ across the globe. Hence they have a higher tendency to co-occur in the data. Terms like “greatest,” “players,” “scored,” and “outscore” are often used to make comments on a player’s performance. Using randomly generated 12 texts, we captured those variations and co-occurrences in the graph above. Note few stop words were not removed; hence they appear in the plot. This can be enhanced using our text_preprocessing_custom() function. As reflected earlier, the embeddings from Figure 8 may not make much sense when we study the numerical values. Still, when plotted, we infer that co-occurring words used in similar contexts appear closer to each other.

A recurrent Neural Network from Tensorflow provides an Embedding layer that can generate embeddings from the text automatically as we train a classifier model. The process involves mapping words or sentences to vectors, or a collection of numbers, using text vectorization. Our one-hot vector — sparse because most of its entries are zero — is transformed into a dense embedding vector by the embedding layer (dense as the dimensionality is a lot smaller and all the elements are natural numbers). It only takes one fully connected layer to make up this embedding layer (Tensor Flow, n.d.). However, training embeddings as we train a classification model is time-consuming. In such scenarios, the flexibility to generate embeddings, given if we have a large volume of data, allows us to reduce the computation time involved in creating a classification model. Embeddings are also a better representation of words than traditional approaches like Bag of words and Term Frequency-Inverse Document Frequency. Other pre-trained models like BERT, Bio-BERT, GLoVE, Edu-BERT, etc., give users the flexibility to generate embeddings using pre-trained models.

Bujokas, E. (2020, May 30). Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning. Retrieved from Medium website: https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8
Jason Browniee. (2017, October 10). What Are Word Embeddings for Text? Retrieved from Machine Learning Mastery website: https://machinelearningmastery.com/what-are-word-embeddings/
Tensor Flow. (n.d.). Text classification with an RNN. Retrieved from TensorFlow website: https://www.tensorflow.org/text/tutorials/text_classification_rnn