Techno Blender
Digitally Yours.

Sentiment Analysis — Intro and Implementation | by Farzad Mahmoodinobar | Nov, 2022

0 43


Sentiment analysis using NLTK, scikit-learn and TextBlob

analyzing text to understand feelings, by DALL.E 2

Have you ever left an online review for a product, service or maybe a movie? Or maybe you are one of those who just do not leave reviews — then, how about making any textual posts or comments on Twitter, Facebook or Instagram? If the answer is yes, then there is a good chance that algorithms have already reviewed your textual data in order to extract some valuable information from it.

Brands and businesses make decisions based on the information extracted from such textual artifacts. For example, if a movie was not a success on Netflix or Prime Video, scientists from each company would dive into the movie reviews to understand what were the reasons behind the unsuccessful movie to avoid making the same mistakes in the future. Investment companies monitor tweets (and other textual data) as one of the variables in their investment models — Elon Musk has been known to make such financially impactful tweets every once in a while! If you are curious to learn more about how these companies extract information from such textual inputs, then this post is for you.

In this post, we are going to learn more about the Technical Requirements to Become a Data Scientist by taking a closer look at Sentiment Analysis. In the field of Natural Language Processing (NLP), sentiment analysis is a tool to identify, quantify, extract and study subjective information from textual data. For example, “I like watching TV shows.” carries a positive sentiment. But maybe the sentiment could even be “relatively more” positive if one says “I really like watching TV shows!”. Sentiment analysis attempts at quantifying the sentiment conveyed in textual data. One of the most common use cases of sentiment analysis is enabling brands and businesses to review their customers’ feedback and monitor their level of satisfaction. As you can imagine, it would be quite expensive to have human headcount read customer reviews to determine whether the customers are happy or not with the business, service, or products. In such cases brands and businesses use machine learning techniques such as sentiment analysis to achieve similar results at scale.

Similar to my other posts, learning will be achieved through practice questions and answers. I will include hints and explanations in the questions as needed to make the journey easier. Lastly, the notebook that I used to create this exercise is also linked in the bottom of the post, which you can download, run and follow along.

Let’s get started!

In order to practice sentiment analysis, we are going to use a test set from UCI Machine Learning Repository, which is based on the paper “From Group to Individual Labels using Deep Features” (Kotzias et. al, 2015) and can be downloaded from this link (CC BY 4.0).

Let’s start with importing the libraries we will be using today, then read the data set into a dataframe and look at the top five rows of the dataframe to familiarize ourselves with the data.

# Import required packages
import numpy as np
import pandas as pd
import nltk

# Making width of the column viewable
pd.set_option('display.max_colwidth', None)

# Read the data into a dataframe
df = pd.read_csv('movie_reduced.csv')

# look at the top five rows of the dataframe
df.head()

Results:

There are only two columns. “text” contains the review itself and “label” indicates the sentiment of the review. In this dataset a label of 1 indicates a positivie sentiment, while a label of 0 indicates a negative sentiment. Since there are only two classes of labels, let’s look at whether these two classes are balanced or imbalanced. Classes are considered balanced when classes (roughly) account for the same portion of the total observations. Let’s look at the data, which makes this easier to understand.

df['label'].value_counts()

The data is almost equally divided between positive and negative sentiments, therefore we consider the data to have balanced classes.

Next, we are going to create a sample string, which includes the very first entry in the “text” column of the dataframe. In some of the questions, we will apply various techniques to this one sample to better understand the concepts. Let’s go ahead and create our sample string.

# Take the very first text entry of the dataframe
sample = df.text[0]
sample

Tokens and Bigrams

In order for programs and computers to understand/consume textual data, we start by breaking down larger segments of textual data into smaller pieces. Breaking down a sequence of characters (such as a string) into smaller pieces (or substrings) is called tokenization and the functions that perform tokenization are called tokenizers. A tokenizer can break down a given string into a list of substrings. Let’s look at an example.

Input: “What is a sentence?”

If we apply a tokenizer to the above “Input”, we will get the following “Output”:

Output: [‘What’, ‘is’, ‘a’, ‘sentence’, ‘?’]

As expected, the output is a sequence of the tokenized substrings of the input sentence.

We can implement this concept with the nltk.word_tokenize package. Let’s see how this is implemented in an example.

Question 1:

Tokenize the generated sample and return the first 10 tokens.

Answer:

# Import the package
from nltk import word_tokenize

# Tokenize the sample
sample_tokens = word_tokenize(sample)

# Return the first 10 tokens
sample_tokens[:10]

Results:

A token is also called a unigram. If we combine two unigrams, we get to a bigram (and this process can continue). Formally, a bigram is an n-gram where n equals two. An n-gram is a sequence of n adjacent items from a given sample of text. Therefore, a bigram is a sequence of two adjacent elements from a string of tokens. It will be easier to understand in an example:

Original Sentence: “What is a sentence?”

Tokens: [‘What’, ‘is’, ‘a’, ‘sentence’, ‘?’]

Bigrams: [(‘What’, ‘is’), (‘is’, ‘a’), (‘a’, ‘sentence’), (‘sentence’, ‘?’)]

As expected, each two adjacent tokens are now represented in one bigram.

We can implement this concept with the nltk.bigrams package.

Question 2:

Create a list of bigrams from the tokenized sample and return the first 10 bigrams.

Answer:

# Import the package
from nltk import bigrams

# Create the bigrams
sample_bitokens = list(bigrams(sample_tokens))

# Return the first 10 bigrams
sample_bitokens[:10]

Results:

Frequency Distribution

Let’s go back to the tokens (unigrams) that we created from our sample. It is good to see what tokens are out there but it might be more informative to know which tokens have a higher representation compared to others in a given textual input. In other words, an occurrence frequency distribution of tokens would be more informative. More formally, a frequency distribution records the number of times each outcome of an experiment has occurred.

Let’s implement a frequency distribution using nltk.FreqDist package.

Question 3:

What are the top 10 most frequent tokens in our sample?

Answer:

# Import the package
from nltk import FreqDist

# Create the frequency distribution for all tokens
sample_freqdist = FreqDist(sample_tokens)

# Return top ten most frequent tokens
sample_freqdist.most_common(10)

Results:

Some of the results intuitively make sense. For example, a comma, “the”, “a” or periods can be quite common in a given textual input. Now let’s put all of these steps into one Python function to streamline the process. If you need a refresher on Python functions, I have a post with practice questions on Python functions linked here.

Question 4:

Create a function named “top_n” that takes in a text as an input and returns the top n most common tokens in the given text. Use “text” and “n” as the function arguments. Then try it on our sample to reproduce the results from the previous question.

Answer:

# Create a function to accept a text and n and returns top n most common tokens
def top_n(text, n):
# Create tokens
tokens = word_tokenize(text)

# Create the frequency distribution
freqdist = FreqDist(tokens)

# Return the top n most common ones
return freqdist.most_common(n)

# Try it on the sample
top_n(df.text[1], 10)

We were able to reproduce the same output using the function.

A Document-Term Matrix (DTM) is a matrix that represents the frequency of terms that occur in a collection of documents. Let’s look at two sentences to understand what DTM is.

Let’s say that we have the following two sentences:

sentence_1 = 'He is walking down the street.'

sentence_2 = 'She walked up then walked down the street yesterday.'

The DTM of the above two sentences will be:

In the above DTM, numbers indicate how many times that particular term (or token) was observed in the given sentence. For example, “down” is present once in both sentences, while “walked” appears twice but only in the second sentence.

Now let’s look at how we can implement a DTM concept, using scikit-learn’s CountVectorizer. Note that the DTM that is initially created using scikit-learn is in the form of a sparse matrix/array (i.e. most of the entries are zero). This is done for efficiency reasons but we will need to convert the sparse array to a dense array (i.e. most of the values are non-zero). Since understanding the differentiation between sparse and dense arrays are not the intention of this post, we won’t go deeper into that topic.

Question 5:

Define a function named “create_dtm” that creates a Document-Term Matrix in the form of a dataframe for a given series of strings. Then test it on the top five rows of our data set.

Answer:

# Import the package
from sklearn.feature_extraction.text import CountVectorizer

def create_dtm(series):
# Create an instance of the class
cv = CountVectorizer()

# Create a document term matrix from the provided series
dtm = cv.fit_transform(series)

# Convert the sparse array to a dense array
dtm = dtm.todense()

# Get column names
features = cv.get_feature_names_out()

# Create a dataframe
dtm_df = pd.DataFrame(dtm, columns = features)

# Return the dataframe
return dtm_df

# Try the function on the top 5 rows of the df['text']
create_dtm(df.text.head())

Results:

Feature Importance

Now we want to think about sentiment analysis as a machine learning model. In such a machine learning model, we would like the model to take in the textual input and make predictions about the sentiment of each textual entry. In other words, the textual input is the independent variable and the sentiment is the dependent variable. We also learned that we can break down the text into smaller pieces named tokens, therefore, we can think of each of the tokens within the textual input as “features” that help in predicting the sentiment as the output of the machine learning model. To summarize, we started with a machine learning model that took in large textual data and predicted sentiments but now we have converted our task into a model that takes in multiple “tokens” (instead of a large body of text) and predicts the sentiment based on the given tokens. Then the next logical step would be to make an attempt at quantifying which of the tokens (i.e. features) are more important in predicting the sentiment. This task is called feature importance.

Luckily for us, feature importance can be easily implemented in scikit-learn. Let’s look at an example together.

Question 6:

Define a function named “top_n_tokens” that accepts three arguemnts: (1) “text”, which is the textual input in the format of a data frame column, (2) “sentiment”, which is the label of the sentiment for the given text in the format of a data frame column, and (3) “n”, which is a positive number. The function will return the top “n” most important tokens (i.e. features) to predict the “sentiment” of the “text”. Please use LogisticRegression from sklearn.linear_model with the following parameters: solver = 'lbfgs', max_iter = 2500, and random_state = 1234. Finally, use the function to return the top 10 most important tokens in the “text” column of the dataframe.

Note: Since the goal of this post is to explore sentiment analysis, we assume the reader is familiar with Logistic Regression. If you would like to take a deeper look at Logistic Regression, check out this post.

Answer:

# Import logistic regression
from sklearn.linear_model import LogisticRegression

def top_n_tokens(text, sentiment, n):
# Create an instance of the class
lgr = LogisticRegression(solver = 'lbfgs', max_iter = 2500, random_state = 1234)
cv = CountVectorizer()

# create the DTM
dtm = cv.fit_transform(text)

# Fit the logistic regression model
lgr.fit(dtm, sentiment)

# Get the coefficients
coefs = lgr.coef_[0]

# Create the features / column names
features = cv.get_feature_names_out()

# create the dataframe
df = pd.DataFrame({'Tokens' : features, 'Coefficients' : coefs})

# Return the largest n
return df.nlargest(n, 'Coefficients')

# Test it on the df['text']
top_n_tokens(df.text, df.label, 10)

Results:

Results are quite interesting. We were looking for the most important features and as we know label 1 indicated a positive sentiment in the dataset. In other words, the most important features (i.e. the ones with the highest coefficients) will be the ones that indicate a strong positive sentiment. This comes across in the results, which all sound quite positive.

In order to validate this hypothesis, let’s look at the 10 smallest coefficients (i.e. the least important features). We expect those to convey a strong negative sentiment.

# Import logistic regression
from sklearn.linear_model import LogisticRegression

def bottom_n_tokens(text, sentiment, n):
# Create an instance of the class
lgr = LogisticRegression(solver = 'lbfgs', max_iter = 2500, random_state = 1234)
cv = CountVectorizer()

# create the DTM
dtm = cv.fit_transform(text)

# Fit the logistic regression model
lgr.fit(dtm, sentiment)

# Get the coefficients
coefs = lgr.coef_[0]

# Create the features / column names
features = cv.get_feature_names_out()

# create the dataframe
df = pd.DataFrame({'Tokens' : features, 'Coefficients' : coefs})

# Return the largest n
return df.nsmallest(n, 'Coefficients')

# Test it on the df['text']
bottom_n_tokens(df.text, df.label, 10)

Results:

As expected, these words convey a strong negative sentiment.

In the previous example, we trained a logistic regression model on the existing labeled data. But what if we do not have labeled data and would like to determine the sentiment of a given data set? In such cases, we can leverage pre-trained models, such as TextBlob, which we will discuss next.

Pre-Trained Models — TextBlob

TextBlob is a library for processing textual data and one of its functions returns the sentiment of a given data in the format of a named tuple as follows: “(polarity, subjectivity)”. The polarity score is a float within the range of [-1.0, 1.0] that aims at differentiating whether the text is positive or negative. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. For example, a fact is expected to be objective and one’s opinion is expected to be subjective. Polarity and subjectivity detection are two of the most common tasks within sentiment analysis, which we will explore in the next question.

Question 7:

Define a function named “polarity_subjectivity” that accepts two arguments. The function applies “TextBlob” to the provided “text” (defaulting to “sample”) and if `print_results = True`, prints polarity and subjectivity of the “text” using “TextBlob”, otherwise returns a tuple of float values with the first value being polarity and the second value being subjectivity, such as “(polarity, subjectivity)”. Returning the tuple should be the default for the function (i.e. set print_results = False). Lastly, use the function on our sample and print the results.

Hint: If you need to install TextBlob you can do so using the following command: !pip install textblob

Answer:

# Import TextBlob
from textblob import TextBlob

def polarity_subjectivity(text = sample, print_results = False):
# Create an instance of TextBlob
tb = TextBlob(text)

# If the condition is met, print the results, otherwise, return the tuple
if print_results:
print(f"Polarity is {round(tb.sentiment[0], 2)} and subjectivity is {round(tb.sentiment[1], 2)}.")
else:
return(tb.sentiment[0], tb.sentiment[1])

# Test the function on our sample
polarity_subjectivity(sample, print_results = True)

Results:

Let’s look at the sample and try to interpret these values.

sample

Results:

Interpreting these results are more meaningful in comparison to other strings but in the absence of such a comparison and purely based on the numbers, let’s try to intrepret the reuslts. The results indicate that our sample has a slightly positive polarity (remember polarity ranges from -1 to 1, therefore 0.18 would indicate slightly positive) and is relatively subjective, which makes intuitive sense since this is someone’s review describing their subjective experience about a movie.

Question 8:

First define a function named “token_count” that accepts a string and using `nltk`’s word tokenizer, returns an integer number of tokens in the given string. Then define a second function named “series_tokens” that accepts a Pandas Series object as an argument and applies the previously-defined “token_count” function to the given Series, returning the integer number of tokens for each row of the given Series. Lastly, use the second function on the top 10 rows of our dataframe and return the results.

Answer:

# Import libraries
from nltk import word_tokenize

# Define the first function that counts the number of tokens in a given string
def token_count(string):
return len(word_tokenize(string))

# Define the second function that applies the token_count function to a given Pandas Series
def series_tokens(series):
return series.apply(token_count)

# Apply the function to the top 10 rows of the dataframe
series_tokens(df.text.head(10))

Results:

Question 9:

Define a function named “series_polarity_subjectivity” that applies the “polarity_subjectivity” function defined in Question 7 to a Pandas Series (in the form of a dataframe column) and returns the results. Then use the function on the top 10 rows of our dataframe to see the results.

Answer:

# Define the function
def series_polarity_subjectivity(series):
return series.apply(polarity_subjectivity)

# Apply to the top 10 rows of the df['text']
series_polarity_subjectivity(df['text'].head(10))

Results:

Measure of Complexity — Lexical Diversity

As the name suggests, Lexical Diversity is a measurement of how many different lexical words there are in a given text and is formulaically defined as the number of unique tokens over the total number of tokens. The idea is that the more diverse lexical tokens in a text are, the more complex that text is expected to be. Let’s look at an example.

Question 10:

Define a “complexity” function that accepts a string as an argument and returns the lexical complexity score defined as the number of unique tokens over the total number of tokens. Then apply the function to the top 10 rows of our dataframe.

Answer:

def complexity(string):
# Create a list of all tokens
total_tokens = word_tokenize(string)

# Create a set of all tokens (which only keeps unique values)
unique_tokens = set(word_tokenize(string))

# Return the complexity measure
return len(unique_tokens) / len(total_tokens)

# Apply to the top 10 rows of the dataframe
df.text.head(10).apply(complexity)

Results:

Text Cleanup — Stopwords and Non-Alphabeticals

If you recall in Question 3 we conducted a Frequency Distribution and the resulting 10 most common tokens were as follows: [(‘,’, 4), (‘very’, 3), (‘A’, 1), (‘slow-moving’, 1), (‘aimless’, 1), (‘movie’, 1), (‘about’, 1), (‘a’, 1), (‘distressed’, 1), (‘drifting’, 1)]

Some of these are not very helpful and are considered less significant compared to other tokens. For example, how much information can be gained from knowing that periods are quite common in a given text? An attempt at filtering out such less significant words so that the focus can be directed towards more significant words is called removal of the stopwords. Note that there is no universal definition of what these stopwords are and this designation is purely subjective.

Let’s look at some examples of English stopwords, as defined by nltk:

# Import library
from nltk.corpus import stopwords

# Select only English stopwords
english_stop_words = stopwords.words('english')

# Print the first 20
print(english_stop_words[:20])

Results:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

Question 11:

Define a function named “stopword_remover” that accepts a string as argument, tokenizes the input string, removes the English stopwords (as defined by nltk), and returns the tokens without the stopwords. Then apply the function to the top 5 rows of our dataframe.

Answer:

def stopword_remover(string):
# Tokenize the string
tokens = word_tokenize(string)

# Create a list of English stopwords
english_stopwords = stopwords.words('english')

# Return non-stopwords
return [w for w in tokens if w.lower() not in english_stopwords]

# Apply to the top 5 rows of our df['text']
df.text.head(5).apply(stopword_remover)

Results:

Another group of tokens that we can consider filtering out, similar to stopwords, is the non-alphabeticals. As the name suggests, examples of non-alphabeticals are: ! % & # * $ (note that space is also considered a non-alphabetical). To help identify what is considered alphabetical or not, we can use isalpha(), which is a built-in Python function that checks whether all characters in a given string are alphabets or not. Let’s look at a few examples to better understand this concept:

string_1 = "TomAndJerryAreFun"
string_2 = "Tom&JerryAreFun"
string_3 = "TomAndJerryAreFun!"

print(f"String_1: {string_1.isalpha()}\n")
print(f"String_2: {string_2.isalpha()}\n")
print(f"String_3: {string_3.isalpha()}")

Results:

Let’s look at each one to better understand what happened. The first one returned “True” indicating the string contains only alpabeticals. The second one returned “False”, which was because of “&” and the third one also returned “False”, driven by the “!”.

Now that we are familiar with how isalpha() works, let’s use it in our example to further clean up our data.

Question 12:

Define a function named “stopword_nonalpha_remover” that accepts a string as an argument, removes both stopwords (using the “stopword_remover” function that we defined in the previous question) and non-alphabeticals and then returns the remainder. Apply this function to the top 5 rows of our dataframe and visually compare to the outcome of the previous question (which still included the non-alphabeticals).

Answer:

def stopword_nonalpha_remover(string):
return [x for x in stopword_remover(string) if x.isalpha()]

df.text.head().apply(stopword_nonalpha_remover)

Results:

As expected, the non-alphabeticals were removed, in addition to the stopwords. Therefore the tokens that are expected to have a higher significance, compared to the removed ones.

In the next step, we will put together everything that we have learned so far to find out which reviews had the highest complexity score.

Question 13:

Define a function named “complexity_cleaned” that accepts a Series and removes the stopwords and non-alphabeticals (using the function defined in Question 12). Then create a column named “complexity” in our dataframe that uses the “complexity_cleaned” function to calculate the complexity. Finally, return the rows of the dataframe for the 10 largest complexity scores.

Answer:

# Define the complexity_cleaned function
def complexity_cleaned(series):
return series.apply(lambda x: complexity(' '.join(stopword_nonalpha_remover(x))))

# Add 'complexity' column to the dataframe
df['complexity'] = complexity_cleaned(df.text)

# Return top 10 highest complexity scores
df.sort_values('complexity', ascending = False).head(10)

Results:


Sentiment analysis using NLTK, scikit-learn and TextBlob

analyzing text to understand feelings, by DALL.E 2

Have you ever left an online review for a product, service or maybe a movie? Or maybe you are one of those who just do not leave reviews — then, how about making any textual posts or comments on Twitter, Facebook or Instagram? If the answer is yes, then there is a good chance that algorithms have already reviewed your textual data in order to extract some valuable information from it.

Brands and businesses make decisions based on the information extracted from such textual artifacts. For example, if a movie was not a success on Netflix or Prime Video, scientists from each company would dive into the movie reviews to understand what were the reasons behind the unsuccessful movie to avoid making the same mistakes in the future. Investment companies monitor tweets (and other textual data) as one of the variables in their investment models — Elon Musk has been known to make such financially impactful tweets every once in a while! If you are curious to learn more about how these companies extract information from such textual inputs, then this post is for you.

In this post, we are going to learn more about the Technical Requirements to Become a Data Scientist by taking a closer look at Sentiment Analysis. In the field of Natural Language Processing (NLP), sentiment analysis is a tool to identify, quantify, extract and study subjective information from textual data. For example, “I like watching TV shows.” carries a positive sentiment. But maybe the sentiment could even be “relatively more” positive if one says “I really like watching TV shows!”. Sentiment analysis attempts at quantifying the sentiment conveyed in textual data. One of the most common use cases of sentiment analysis is enabling brands and businesses to review their customers’ feedback and monitor their level of satisfaction. As you can imagine, it would be quite expensive to have human headcount read customer reviews to determine whether the customers are happy or not with the business, service, or products. In such cases brands and businesses use machine learning techniques such as sentiment analysis to achieve similar results at scale.

Similar to my other posts, learning will be achieved through practice questions and answers. I will include hints and explanations in the questions as needed to make the journey easier. Lastly, the notebook that I used to create this exercise is also linked in the bottom of the post, which you can download, run and follow along.

Let’s get started!

In order to practice sentiment analysis, we are going to use a test set from UCI Machine Learning Repository, which is based on the paper “From Group to Individual Labels using Deep Features” (Kotzias et. al, 2015) and can be downloaded from this link (CC BY 4.0).

Let’s start with importing the libraries we will be using today, then read the data set into a dataframe and look at the top five rows of the dataframe to familiarize ourselves with the data.

# Import required packages
import numpy as np
import pandas as pd
import nltk

# Making width of the column viewable
pd.set_option('display.max_colwidth', None)

# Read the data into a dataframe
df = pd.read_csv('movie_reduced.csv')

# look at the top five rows of the dataframe
df.head()

Results:

There are only two columns. “text” contains the review itself and “label” indicates the sentiment of the review. In this dataset a label of 1 indicates a positivie sentiment, while a label of 0 indicates a negative sentiment. Since there are only two classes of labels, let’s look at whether these two classes are balanced or imbalanced. Classes are considered balanced when classes (roughly) account for the same portion of the total observations. Let’s look at the data, which makes this easier to understand.

df['label'].value_counts()

The data is almost equally divided between positive and negative sentiments, therefore we consider the data to have balanced classes.

Next, we are going to create a sample string, which includes the very first entry in the “text” column of the dataframe. In some of the questions, we will apply various techniques to this one sample to better understand the concepts. Let’s go ahead and create our sample string.

# Take the very first text entry of the dataframe
sample = df.text[0]
sample

Tokens and Bigrams

In order for programs and computers to understand/consume textual data, we start by breaking down larger segments of textual data into smaller pieces. Breaking down a sequence of characters (such as a string) into smaller pieces (or substrings) is called tokenization and the functions that perform tokenization are called tokenizers. A tokenizer can break down a given string into a list of substrings. Let’s look at an example.

Input: “What is a sentence?”

If we apply a tokenizer to the above “Input”, we will get the following “Output”:

Output: [‘What’, ‘is’, ‘a’, ‘sentence’, ‘?’]

As expected, the output is a sequence of the tokenized substrings of the input sentence.

We can implement this concept with the nltk.word_tokenize package. Let’s see how this is implemented in an example.

Question 1:

Tokenize the generated sample and return the first 10 tokens.

Answer:

# Import the package
from nltk import word_tokenize

# Tokenize the sample
sample_tokens = word_tokenize(sample)

# Return the first 10 tokens
sample_tokens[:10]

Results:

A token is also called a unigram. If we combine two unigrams, we get to a bigram (and this process can continue). Formally, a bigram is an n-gram where n equals two. An n-gram is a sequence of n adjacent items from a given sample of text. Therefore, a bigram is a sequence of two adjacent elements from a string of tokens. It will be easier to understand in an example:

Original Sentence: “What is a sentence?”

Tokens: [‘What’, ‘is’, ‘a’, ‘sentence’, ‘?’]

Bigrams: [(‘What’, ‘is’), (‘is’, ‘a’), (‘a’, ‘sentence’), (‘sentence’, ‘?’)]

As expected, each two adjacent tokens are now represented in one bigram.

We can implement this concept with the nltk.bigrams package.

Question 2:

Create a list of bigrams from the tokenized sample and return the first 10 bigrams.

Answer:

# Import the package
from nltk import bigrams

# Create the bigrams
sample_bitokens = list(bigrams(sample_tokens))

# Return the first 10 bigrams
sample_bitokens[:10]

Results:

Frequency Distribution

Let’s go back to the tokens (unigrams) that we created from our sample. It is good to see what tokens are out there but it might be more informative to know which tokens have a higher representation compared to others in a given textual input. In other words, an occurrence frequency distribution of tokens would be more informative. More formally, a frequency distribution records the number of times each outcome of an experiment has occurred.

Let’s implement a frequency distribution using nltk.FreqDist package.

Question 3:

What are the top 10 most frequent tokens in our sample?

Answer:

# Import the package
from nltk import FreqDist

# Create the frequency distribution for all tokens
sample_freqdist = FreqDist(sample_tokens)

# Return top ten most frequent tokens
sample_freqdist.most_common(10)

Results:

Some of the results intuitively make sense. For example, a comma, “the”, “a” or periods can be quite common in a given textual input. Now let’s put all of these steps into one Python function to streamline the process. If you need a refresher on Python functions, I have a post with practice questions on Python functions linked here.

Question 4:

Create a function named “top_n” that takes in a text as an input and returns the top n most common tokens in the given text. Use “text” and “n” as the function arguments. Then try it on our sample to reproduce the results from the previous question.

Answer:

# Create a function to accept a text and n and returns top n most common tokens
def top_n(text, n):
# Create tokens
tokens = word_tokenize(text)

# Create the frequency distribution
freqdist = FreqDist(tokens)

# Return the top n most common ones
return freqdist.most_common(n)

# Try it on the sample
top_n(df.text[1], 10)

We were able to reproduce the same output using the function.

A Document-Term Matrix (DTM) is a matrix that represents the frequency of terms that occur in a collection of documents. Let’s look at two sentences to understand what DTM is.

Let’s say that we have the following two sentences:

sentence_1 = 'He is walking down the street.'

sentence_2 = 'She walked up then walked down the street yesterday.'

The DTM of the above two sentences will be:

In the above DTM, numbers indicate how many times that particular term (or token) was observed in the given sentence. For example, “down” is present once in both sentences, while “walked” appears twice but only in the second sentence.

Now let’s look at how we can implement a DTM concept, using scikit-learn’s CountVectorizer. Note that the DTM that is initially created using scikit-learn is in the form of a sparse matrix/array (i.e. most of the entries are zero). This is done for efficiency reasons but we will need to convert the sparse array to a dense array (i.e. most of the values are non-zero). Since understanding the differentiation between sparse and dense arrays are not the intention of this post, we won’t go deeper into that topic.

Question 5:

Define a function named “create_dtm” that creates a Document-Term Matrix in the form of a dataframe for a given series of strings. Then test it on the top five rows of our data set.

Answer:

# Import the package
from sklearn.feature_extraction.text import CountVectorizer

def create_dtm(series):
# Create an instance of the class
cv = CountVectorizer()

# Create a document term matrix from the provided series
dtm = cv.fit_transform(series)

# Convert the sparse array to a dense array
dtm = dtm.todense()

# Get column names
features = cv.get_feature_names_out()

# Create a dataframe
dtm_df = pd.DataFrame(dtm, columns = features)

# Return the dataframe
return dtm_df

# Try the function on the top 5 rows of the df['text']
create_dtm(df.text.head())

Results:

Feature Importance

Now we want to think about sentiment analysis as a machine learning model. In such a machine learning model, we would like the model to take in the textual input and make predictions about the sentiment of each textual entry. In other words, the textual input is the independent variable and the sentiment is the dependent variable. We also learned that we can break down the text into smaller pieces named tokens, therefore, we can think of each of the tokens within the textual input as “features” that help in predicting the sentiment as the output of the machine learning model. To summarize, we started with a machine learning model that took in large textual data and predicted sentiments but now we have converted our task into a model that takes in multiple “tokens” (instead of a large body of text) and predicts the sentiment based on the given tokens. Then the next logical step would be to make an attempt at quantifying which of the tokens (i.e. features) are more important in predicting the sentiment. This task is called feature importance.

Luckily for us, feature importance can be easily implemented in scikit-learn. Let’s look at an example together.

Question 6:

Define a function named “top_n_tokens” that accepts three arguemnts: (1) “text”, which is the textual input in the format of a data frame column, (2) “sentiment”, which is the label of the sentiment for the given text in the format of a data frame column, and (3) “n”, which is a positive number. The function will return the top “n” most important tokens (i.e. features) to predict the “sentiment” of the “text”. Please use LogisticRegression from sklearn.linear_model with the following parameters: solver = 'lbfgs', max_iter = 2500, and random_state = 1234. Finally, use the function to return the top 10 most important tokens in the “text” column of the dataframe.

Note: Since the goal of this post is to explore sentiment analysis, we assume the reader is familiar with Logistic Regression. If you would like to take a deeper look at Logistic Regression, check out this post.

Answer:

# Import logistic regression
from sklearn.linear_model import LogisticRegression

def top_n_tokens(text, sentiment, n):
# Create an instance of the class
lgr = LogisticRegression(solver = 'lbfgs', max_iter = 2500, random_state = 1234)
cv = CountVectorizer()

# create the DTM
dtm = cv.fit_transform(text)

# Fit the logistic regression model
lgr.fit(dtm, sentiment)

# Get the coefficients
coefs = lgr.coef_[0]

# Create the features / column names
features = cv.get_feature_names_out()

# create the dataframe
df = pd.DataFrame({'Tokens' : features, 'Coefficients' : coefs})

# Return the largest n
return df.nlargest(n, 'Coefficients')

# Test it on the df['text']
top_n_tokens(df.text, df.label, 10)

Results:

Results are quite interesting. We were looking for the most important features and as we know label 1 indicated a positive sentiment in the dataset. In other words, the most important features (i.e. the ones with the highest coefficients) will be the ones that indicate a strong positive sentiment. This comes across in the results, which all sound quite positive.

In order to validate this hypothesis, let’s look at the 10 smallest coefficients (i.e. the least important features). We expect those to convey a strong negative sentiment.

# Import logistic regression
from sklearn.linear_model import LogisticRegression

def bottom_n_tokens(text, sentiment, n):
# Create an instance of the class
lgr = LogisticRegression(solver = 'lbfgs', max_iter = 2500, random_state = 1234)
cv = CountVectorizer()

# create the DTM
dtm = cv.fit_transform(text)

# Fit the logistic regression model
lgr.fit(dtm, sentiment)

# Get the coefficients
coefs = lgr.coef_[0]

# Create the features / column names
features = cv.get_feature_names_out()

# create the dataframe
df = pd.DataFrame({'Tokens' : features, 'Coefficients' : coefs})

# Return the largest n
return df.nsmallest(n, 'Coefficients')

# Test it on the df['text']
bottom_n_tokens(df.text, df.label, 10)

Results:

As expected, these words convey a strong negative sentiment.

In the previous example, we trained a logistic regression model on the existing labeled data. But what if we do not have labeled data and would like to determine the sentiment of a given data set? In such cases, we can leverage pre-trained models, such as TextBlob, which we will discuss next.

Pre-Trained Models — TextBlob

TextBlob is a library for processing textual data and one of its functions returns the sentiment of a given data in the format of a named tuple as follows: “(polarity, subjectivity)”. The polarity score is a float within the range of [-1.0, 1.0] that aims at differentiating whether the text is positive or negative. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. For example, a fact is expected to be objective and one’s opinion is expected to be subjective. Polarity and subjectivity detection are two of the most common tasks within sentiment analysis, which we will explore in the next question.

Question 7:

Define a function named “polarity_subjectivity” that accepts two arguments. The function applies “TextBlob” to the provided “text” (defaulting to “sample”) and if `print_results = True`, prints polarity and subjectivity of the “text” using “TextBlob”, otherwise returns a tuple of float values with the first value being polarity and the second value being subjectivity, such as “(polarity, subjectivity)”. Returning the tuple should be the default for the function (i.e. set print_results = False). Lastly, use the function on our sample and print the results.

Hint: If you need to install TextBlob you can do so using the following command: !pip install textblob

Answer:

# Import TextBlob
from textblob import TextBlob

def polarity_subjectivity(text = sample, print_results = False):
# Create an instance of TextBlob
tb = TextBlob(text)

# If the condition is met, print the results, otherwise, return the tuple
if print_results:
print(f"Polarity is {round(tb.sentiment[0], 2)} and subjectivity is {round(tb.sentiment[1], 2)}.")
else:
return(tb.sentiment[0], tb.sentiment[1])

# Test the function on our sample
polarity_subjectivity(sample, print_results = True)

Results:

Let’s look at the sample and try to interpret these values.

sample

Results:

Interpreting these results are more meaningful in comparison to other strings but in the absence of such a comparison and purely based on the numbers, let’s try to intrepret the reuslts. The results indicate that our sample has a slightly positive polarity (remember polarity ranges from -1 to 1, therefore 0.18 would indicate slightly positive) and is relatively subjective, which makes intuitive sense since this is someone’s review describing their subjective experience about a movie.

Question 8:

First define a function named “token_count” that accepts a string and using `nltk`’s word tokenizer, returns an integer number of tokens in the given string. Then define a second function named “series_tokens” that accepts a Pandas Series object as an argument and applies the previously-defined “token_count” function to the given Series, returning the integer number of tokens for each row of the given Series. Lastly, use the second function on the top 10 rows of our dataframe and return the results.

Answer:

# Import libraries
from nltk import word_tokenize

# Define the first function that counts the number of tokens in a given string
def token_count(string):
return len(word_tokenize(string))

# Define the second function that applies the token_count function to a given Pandas Series
def series_tokens(series):
return series.apply(token_count)

# Apply the function to the top 10 rows of the dataframe
series_tokens(df.text.head(10))

Results:

Question 9:

Define a function named “series_polarity_subjectivity” that applies the “polarity_subjectivity” function defined in Question 7 to a Pandas Series (in the form of a dataframe column) and returns the results. Then use the function on the top 10 rows of our dataframe to see the results.

Answer:

# Define the function
def series_polarity_subjectivity(series):
return series.apply(polarity_subjectivity)

# Apply to the top 10 rows of the df['text']
series_polarity_subjectivity(df['text'].head(10))

Results:

Measure of Complexity — Lexical Diversity

As the name suggests, Lexical Diversity is a measurement of how many different lexical words there are in a given text and is formulaically defined as the number of unique tokens over the total number of tokens. The idea is that the more diverse lexical tokens in a text are, the more complex that text is expected to be. Let’s look at an example.

Question 10:

Define a “complexity” function that accepts a string as an argument and returns the lexical complexity score defined as the number of unique tokens over the total number of tokens. Then apply the function to the top 10 rows of our dataframe.

Answer:

def complexity(string):
# Create a list of all tokens
total_tokens = word_tokenize(string)

# Create a set of all tokens (which only keeps unique values)
unique_tokens = set(word_tokenize(string))

# Return the complexity measure
return len(unique_tokens) / len(total_tokens)

# Apply to the top 10 rows of the dataframe
df.text.head(10).apply(complexity)

Results:

Text Cleanup — Stopwords and Non-Alphabeticals

If you recall in Question 3 we conducted a Frequency Distribution and the resulting 10 most common tokens were as follows: [(‘,’, 4), (‘very’, 3), (‘A’, 1), (‘slow-moving’, 1), (‘aimless’, 1), (‘movie’, 1), (‘about’, 1), (‘a’, 1), (‘distressed’, 1), (‘drifting’, 1)]

Some of these are not very helpful and are considered less significant compared to other tokens. For example, how much information can be gained from knowing that periods are quite common in a given text? An attempt at filtering out such less significant words so that the focus can be directed towards more significant words is called removal of the stopwords. Note that there is no universal definition of what these stopwords are and this designation is purely subjective.

Let’s look at some examples of English stopwords, as defined by nltk:

# Import library
from nltk.corpus import stopwords

# Select only English stopwords
english_stop_words = stopwords.words('english')

# Print the first 20
print(english_stop_words[:20])

Results:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

Question 11:

Define a function named “stopword_remover” that accepts a string as argument, tokenizes the input string, removes the English stopwords (as defined by nltk), and returns the tokens without the stopwords. Then apply the function to the top 5 rows of our dataframe.

Answer:

def stopword_remover(string):
# Tokenize the string
tokens = word_tokenize(string)

# Create a list of English stopwords
english_stopwords = stopwords.words('english')

# Return non-stopwords
return [w for w in tokens if w.lower() not in english_stopwords]

# Apply to the top 5 rows of our df['text']
df.text.head(5).apply(stopword_remover)

Results:

Another group of tokens that we can consider filtering out, similar to stopwords, is the non-alphabeticals. As the name suggests, examples of non-alphabeticals are: ! % & # * $ (note that space is also considered a non-alphabetical). To help identify what is considered alphabetical or not, we can use isalpha(), which is a built-in Python function that checks whether all characters in a given string are alphabets or not. Let’s look at a few examples to better understand this concept:

string_1 = "TomAndJerryAreFun"
string_2 = "Tom&JerryAreFun"
string_3 = "TomAndJerryAreFun!"

print(f"String_1: {string_1.isalpha()}\n")
print(f"String_2: {string_2.isalpha()}\n")
print(f"String_3: {string_3.isalpha()}")

Results:

Let’s look at each one to better understand what happened. The first one returned “True” indicating the string contains only alpabeticals. The second one returned “False”, which was because of “&” and the third one also returned “False”, driven by the “!”.

Now that we are familiar with how isalpha() works, let’s use it in our example to further clean up our data.

Question 12:

Define a function named “stopword_nonalpha_remover” that accepts a string as an argument, removes both stopwords (using the “stopword_remover” function that we defined in the previous question) and non-alphabeticals and then returns the remainder. Apply this function to the top 5 rows of our dataframe and visually compare to the outcome of the previous question (which still included the non-alphabeticals).

Answer:

def stopword_nonalpha_remover(string):
return [x for x in stopword_remover(string) if x.isalpha()]

df.text.head().apply(stopword_nonalpha_remover)

Results:

As expected, the non-alphabeticals were removed, in addition to the stopwords. Therefore the tokens that are expected to have a higher significance, compared to the removed ones.

In the next step, we will put together everything that we have learned so far to find out which reviews had the highest complexity score.

Question 13:

Define a function named “complexity_cleaned” that accepts a Series and removes the stopwords and non-alphabeticals (using the function defined in Question 12). Then create a column named “complexity” in our dataframe that uses the “complexity_cleaned” function to calculate the complexity. Finally, return the rows of the dataframe for the 10 largest complexity scores.

Answer:

# Define the complexity_cleaned function
def complexity_cleaned(series):
return series.apply(lambda x: complexity(' '.join(stopword_nonalpha_remover(x))))

# Add 'complexity' column to the dataframe
df['complexity'] = complexity_cleaned(df.text)

# Return top 10 highest complexity scores
df.sort_values('complexity', ascending = False).head(10)

Results:

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment