Quick Text Sentiment Analysis with R | by Gustavo Santos | Mar, 2023

By Jessie Hobb On Mar 10, 2023

Text everywhere! Since the Internet was spread around the world, the amount of textual data we generate everyday is ginormous. Only textual messages sent everyday, it is estimated that there are around 18 Billion of them circulating on a daily basis*.

Now imagine the amount of news generated as well. It’s a so overwhelming amount that there are whole businesses built around news clipping, separating the best information about a given topic to help companies in their marketing strategies.

How is AI helping that? Certainly, NLP plays a huge part on that providing good tools and algorithms to analyze textual information. As Data Scientists, we can profit of tidytext, an excellent library from R to help us building quick analytical tools to check the content of a text.

Let’s see that in practice, next.

Prepare your environment

To be prepared to code along with this article, load the libraries listed as follows.

# Installing libraries
install.packages('tidyverse')
install.packages('tidytext')# Loading libraries
library(tidyverse)
library(tidytext)

The library tidytext works on the same fashion as tidyverse, making use of intuitive function names and chaining them with the pipe symbol %>%.

Let’s use this text from Wikipedia about the R Language to create our first simple Text Analyzer.

text <- "R is a programming language for statistical computing and graphics supported by
the R Core Team and the R Foundation for Statistical Computing. Created by
statisticians Ross Ihaka and Robert Gentleman, R is used among data miners,
bioinformaticians and statisticians for data analysis and developing 
statistical software.[7] Users have created packages to augment the functions
of the R language.
According to user surveys and studies of scholarly literature databases, 
R is one of the most commonly used programming languages in data mining.[8]
As of December 2022, R ranks 11th in the TIOBE index, a measure of programming
language popularity, in which the language peaked in 8th place in August 2020.
The official R software environment is an open-source free software environment
within the GNU package, available under the GNU General Public License. 
It is written primarily in C, Fortran, and R itself (partially self-hosting). 
Precompiled executables are provided for various operating systems. R has a 
command line interface.[11] Multiple third-party graphical user interfaces are 
also available, such as RStudio, an integrated development environment, 
and Jupyter, a notebook interface."

The next step is to transform this text into a tibble object, which can be understood as a data.frame.

# Transform to tibble
df_text <- tibble(text)

It won’t change much your object, but it is just something required for us to be able to work with the tidytext functions, as those require that the data comes from a tibble or data.frame object. In case you’re curious, here is what it looks like after the transformation.

Text transformed to a tibble object. Image by the author.

Frequency count

Moving on, now we will tokenize the text. A token is the smallest meaningful unit of a text. Most of the projects use 1 word = 1 token, but it can be another size, if your project requires it. Therefore, tokenization is this process of breaking a text into this minimal pieces that carry a meaning to make the message. To tokenize our text with tidytext, use this function.

A token is the smallest meaningful unit of a text.

# Tokenizing the text
tokens <- df_text %>% 
unnest_tokens(input = text, #name of the input column
output = word) #name of the output column

And the result is this, as follows.

However, we can see that tokens such as is , a , for won’t add anything to the message. Agree? Those are called stop words. We should have a way to remove those tokens and leave only the clean data, those tokens with a real meaning from the text’s message.

tidytext already comes with a dataset with the stop words integrated. If we write stop_words and run the code, we will be able to see it.

# View stop_words
stop_words# A tibble: 1,149 × 2
word        lexicon
<chr>       <chr>  
1 a           SMART  
2 a's         SMART  
3 able        SMART  
4 about       SMART  
5 above       SMART  
6 according   SMART  
7 accordingly SMART  
8 across      SMART  
9 actually    SMART  
10 after       SMART  
# … with 1,139 more rows

Notice that the column with the words is named word. That is why we also named our tokenized column with that variable name, so it is easier to join both datasets. So, our job now is to join them, removing those stop words. We can use the anti_join() function for that. On the sequence, we just count and sort by most frequent appearances.

# Removing stopwords and counting frequencies
tokens %>% 
anti_join(stop_words) %>% 
count(word, sort = TRUE)# Result
# A tibble: 79 × 2
word            n
<chr>       <int>
1 language        4
2 data            3
3 environment     3
4 programming     3
5 software        3
6 statistical     3
7 computing       2
8 created         2
9 gnu             2
10 interface       2
# … with 69 more rows

Amazing, huh? That easily we have a sense of what’s this text about. A software or programming language for statistical data analysis.

We could create a function with the preceding code to quickly give us frequency counts of any text.

text_freq_counter <- function(text){# Transform to tibble
df_text <- tibble(text)
# Tokenizing the text
tokens <- df_text %>% 
unnest_tokens(input = text, #name of the input column
output = word) #name of the output column
# Removing stopwords and counting frequencies
freq_count <- tokens %>% 
anti_join(stop_words) %>% 
count(word, sort = TRUE)
# Return
return(freq_count)
}#close function

Let’s put it to test. I will go back to the first section of this article, copy it and let our function count the frequencies.

text <- "Text everywhere! Since the Internet was spread around the world, 
the amount of textual data we generate everyday is ginormous. Only textual 
messages sent everyday, it is estimated that there are around 18 Billion of
them circulating on a daily basis*. 
Now imagine the amount of news generated as well. It's a so overwhelming 
amount that there are whole businesses built around news clipping, separating 
the best information about a given topic to help companies in their marketing
strategies.
How is AI helping that? Certainly, NLP plays a huge part on that providing 
good tools and algorithms to analyze textual information. As Data Scientists, 
we can profit of tidytext, an excellent library from R to help us building 
quick analytical tools to check the content of a text.
Let's see that in practice, next."# Running the function
text_freq_counter(text)
[OUT]
# A tibble: 50 × 2
word            n
<chr>       <int>
1 amount          3
2 textual         3
3 data            2
4 everyday        2
5 information     2
6 news            2
7 text            2
8 tools           2
9 18              1
10 ai              1
# … with 40 more rows

Works like a charm.

We could stop here, but this topic is so interesting that I sense we should still go on with a little more content. Let’s add a sentiment analysis to our text analyzer now.

tidytext also comes prepared for sentiment analysis, since it has a couple of sentiments datasets provided with it. The options are “Bing”, “Afinn” and “nrc”. Let’s see the difference between them.

Bing sentiments dataset comes with words classified as positive or negative. So, one option here is to check how many positive vs. negative words does your text carry, and then have an idea of the sentiment.

# Bing sentiments
get_sentiments('bing')
# A tibble: 6,786 × 2
word        sentiment
<chr>       <chr>    
1 2-faces     negative 
2 abnormal    negative 
3 abolish     negative 
4 abominable  negative 
5 abominably  negative 
6 abominate   negative 
7 abomination negative 
8 abort       negative 
9 aborted     negative 
10 aborts      negative 
# … with 6,776 more rows

Afinn sentiments, probably from affinity classifies the words with numbers. The more positive the number, the more positive the word and the inverse is also true. It requires the library(textdata) to be loaded.

library(textdata)# Sentiments Afinn
get_sentiments('afinn')
# A tibble: 2,477 × 2
word       value
<chr>      <dbl>
1 abandon       -2
2 abandoned     -2
3 abandons      -2
4 abducted      -2
5 abduction     -2
6 abductions    -2
7 abhor         -3
8 abhorred      -3
9 abhorrent     -3
10 abhors        -3
# … with 2,467 more rows

Finally, the NRC will classify the words as sentiment names, like trust, surprise etc.

# Sentiments Afinn
get_sentiments('nrc')# A tibble: 13,875 × 2
word        sentiment
<chr>       <chr>    
1 abacus      trust    
2 abandon     fear     
3 abandon     negative 
4 abandon     sadness  
5 abandoned   anger    
6 abandoned   fear     
7 abandoned   negative 
8 abandoned   sadness  
9 abandonment anger    
10 abandonment fear     
# … with 13,865 more rows

So, what we will do next is to use afinn to create a score and then plot the result of positive vs. negative words of our text.

I will use a text captured in the internet about the layoffs in tech. The function now counts with a graphic that shows the scores of the words according to the afinn values and frequencies.

# Function for frequency count
text_freq_counter <- function(text){# get sentiments
sentiments <- get_sentiments(‘afinn’)
# Transform to tibble
df_text <- tibble(text)
# Tokenizing the text
tokens <- df_text %>% 
unnest_tokens(input = text, #name of the input column
output = word) #name of the output column
# Removing stopwords and counting frequencies
freq_count <- tokens %>% #dataset
inner_join(sentiments, by=’word’) %>% #join the sentiments
count(word, value, sort = TRUE) %>% #count the words by sentiment value
mutate(score = n * value) %>%  # create score by multiplying score * value
arrange( desc(score)) # sort
# Plot
g <- freq_count %>% 
ggplot( aes(x= score, y= reorder(word, score),
fill= score > 0) ) +
geom_col(show.legend = F) +
labs( x= ‘Sentiment Score’,
y= ‘WORD’,
subtitle = ‘Negative versus positive sentiments’) +
ggtitle(‘Sentiment Score by Word in the Text’)+
theme(plot.subtitle =  element_text(color = "gray", face = "italic")) +
theme_light()
# Return
return(list(freq_count, g))
}#close function
#Applying the function
text_freq_counter(text3)
# Resulting table
# A tibble: 16 × 4
word      value     n score
<chr>     <dbl> <int> <dbl>
1 care          2     2     4
2 best          3     1     3
3 feeling       1     2     2
4 hopes         2     1     2
5 robust        2     1     2
6 save          2     1     2
7 true          2     1     2
8 cool          1     1     1
9 fitness       1     1     1
10 shared        1     1     1
11 cutting      -1     2    -2
12 recession    -2     1    -2
13 cut          -1     3    -3
14 losing       -3     1    -3
15 lost         -3     1    -3
16 cuts         -1     7    -7

The resulting table was displayed above. This is the outcome graphic.

Sentiment Analysis for a news text about the layoffs in the Tech industry. Image by the author.

In my GitHub, there’s another function where you can also choose the sentiments pack to be used. The results are displayed below, followed by the code link.

# Enhanced Function
text_freq_sentiment(text3, .sentiment = 'nrc')
text_freq_sentiment(text3, .sentiment = 'bing')

Sentiments captured by word frequency with the “nrc” pack. Image by the author.

Sentiments captured by word frequency using the “Bing” pack. Image by the author.

You can see the entire code here: Link to the code in Git Hub.

I enjoy studying NLP and text mining data science tools. There is so much we can extract from a text. It’s super interesting.

I recommend that you check the links below in the References section and find resources to deepen your knowledge. My book also brings some interesting exercises about wrangling textual data, including text mining.

If you liked this content, don’t forget to follow my blog. Find me on LinkedIn.

Santos, G. 2023. Data Wrangling with R. 1ed. Packt Publishing.