Sentiment Analysis and Structural Breaks in Time-Series Text Data | by Petr Korab | Mar, 2023

By Jessie Hobb On Mar 7, 2023

Arabica now offers a structural break and sentiment analysis module to enrich time-series text mining

Text data contains lots of qualitative information, which can be quantified with various methods, including sentiment analysis techniques. These models are used to identify, extract and quantify emotions from text data and have wide use in business and academic research. Since the text is often recorded on a time-series basis, text datasets might display structural breaks as the quantitative information change due to many possible factors.

As a business analyst, measuring the changes in customer perceptions of a particular brand might be one of the key tasks. In the research role, one can be interested in the shifts in Vladimir Putin’s public statements over time. Arabica is a python library specifically designed to deal with similar questions. It contains these methods for exploratory analysis of time-series text datasets:

arabica_freq for descriptive n-gram-based exploratory analysis (EDA)
cappuccino is a visualization module including heatmap, word cloud, and line plot for unigram, bigram, and trigram frequencies
coffee_break enables sentiment and structural break analysis.

This article will introduce you to Coffee-break, the sentiment and structural breaks analysis module. Read the documentation and these tutorials for the first two methods: arabica_freq, cappuccino.

The coffee_break module has a simple backend architecture. Here is schematically how it works:

Figure 1. Coffe_break architecture. Source: draw.io

Raw text is cleaned with cleantext providing punctuation and numbers cleaning. Stop words (the most common words in a language with no significant meaning) are not removed in the pre-processing step because they don’t negatively affect sentiment analysis. Arabica automatically removes empty rows.

Sentiment analysis implements VADER (Valence Aware Dictionary and Sentiment Reasoner), a universal pre-trained sentiment classifier [1]. It was trained on social media data from Twitter, but it also works very well on other types of datasets. My previous article offers a more detailed introduction to the model and coding in Python.

The aggregate sentiment indicator is calculated as:

where t is the aggregation period. The aggregate indicator ranges [-1: 1] having a positive sentiment closer to 1 and a negative approaching -1.

The aggregated sentiment creates a time series displaying some degree of variability over time. Structural breaks in the time series are identified with the Fisher-Jenks algorithm, or Jenks Optimisation Method originally proposed by George F. Jenks [2].

It is a clustering-based method designed to find the best arrangement of values into different classes (clusters). The jenks_breaks function implemented with jenkspy library returns a list of values that correspond to the limits of the classes. These structural breaks are in the plot marked as vertical lines and visually indicate the breakpoints in the time series of text data.

The implemented libraries are Matplotlib (visualization), vaderSentiment (sentiment analysis), and jenkspy (structural breaks). Pandas and Numpy make the processing operations.

Let’s illustrate the coding on Pfizer Vaccine Tweets dataset collected using Twitter API. The data contains 11 000 tweets about Pfizer & BioNTech vaccine posted between 2006 and 2021. The dataset is released under the CC0: Public Domain license according to Twitter developer policy.

The data contains a lot of punctuation and numbers and needs cleaning before any further steps:

The coffee_break method’s parameters are:

def coffee_break(text: str,                 # Text column
time: str,                 # Time column
date_format: str,          # Date format: 'eur' - European, 'us' - American
preprocess: bool = False,  # Clean data from numbers and punctuation
time_freq: int ='',        # Aggregation period: 'Y'/'M'
n_breaks: int =''          # Number of breaks: min. 2
)

Our data has a 15-year time span covering the Covid-19 crisis. Changes in the public mood about vaccination, fake news about vaccines, and many other factors are expected to lead to significant variations in sentiment over time.

Coding

First, import coffee_break:

from arabica import coffee_break

Arabica reads dates in US-style (MM/DD/YYYY) and European-style (DD/MM/YYYY) date and datetime formats. The data is pretty raw and covers 15 years. Displaying sentiment by month is, therefore, not very helpful.

Let’s clean the data and aggregate sentiment by year with this code:

coffee_break(text = data['text'],
time = data['date'],
date_format = 'us',   # Read dates in US format
preprocess = True,    # Clean data
n_breaks = None,      # No structural break analysis
time_freq = 'Y')      # Yearly aggregation

Results

Arabica returns a picture that can be manually saved in PNG or JPEG.

At the same time, Arabica returns a dataframe with the corresponding data. The table can be saved simply by assigning the function to an object :

# generate a dataframe
df = coffee_break(text = data['text'],
time = data['date'],
date_format = 'us',
preprocess = True,
n_breaks = None,
time_freq = 'Y')# save is as a csv
df.to_csv('sentiment_data.csv')

Results interpretation: we can see that sentiment significantly dropped after Pfizer vaccines started to be used to tackle Covid in 2021 (Figure 2). The reason is likely the global pandemic the world faced and the generally negative mood in these years.

Next, let’s formalize the structural breaks in sentiment statistically. Coffe_break enables the identification of min. 2 breakpoints. The following code returns a figure with 3 breakpoints marked by vertical lines and the table with the corresponding time series:

coffee_break(text = data['text'],
time = data['date'],
date_format = 'us',  # US date format
preprocess = True,   # Clean data
n_breaks = 3,        # 3 breaktpoints
time_freq = 'Y')     # Yearly aggregation

The figure:

Figure 4. Structural break analysis – yearly

Subsetting the data to the two Covid years (2020–2021), we might observe monthly changes in public sentiment, keeping n_breaks = 3 and setting time_freq = 'M' :

Figure 5. Structural break analysis – monthly

The graph is not very informative. There are 1577 rows for 24 time observations in this subset, and after cleaning the raw data, the time series is very volatile. Making conclusions using a clustering-based algorithm on such a limited volume of data is not a good idea.

Results interpretation: the structural break analysis in yearly frequency statstically confirmed what we could see from the time series of sentiment in Figure 3. Fisher-Jenks algorithm identified three structural breaks — in 2009, 2017, and 2021. We can only guess what caused the decline in 2009 and between 2016 and 2018. The 2021’s drop is likely reasoned by the Covid-19 crisis.

Let’s summarize the recommendations for the most effective use of coffee_break:

don’t use structural break analysis if there are NAN values in the corresponding time series.
identification of more than 3 break points makes sense in longer time series (at least 12 observations).
breakpoint identification might not work well in highly volatile datasets. The reason for dramatic changes might not be the shifts in sentiment but rather the quality of data.
the analysis is only as correct as the underlying sentiment data. Before the actual use, make a short exploration of the raw text dataset to check if (1) it is not too imbalanced in the number of rows for each period and (2) it contains enough information for sentiment evaluation (texts are not too short and don’t contain mostly digits and special characters).

A drawback of coffee_break is that currently, it only works with English texts. Due to the fact that Arabica is mainly a Pandas-based package (including Numpy vectorization in some parts), coffee_break is rather slow in evaluating large datasets. It is time-efficient in processing datasets of up to approx. 40 000 rows.

Read these tutorials to find out more about n-gram analysis and visualization of time-series text data:

This article provides a comprehensive overview of sentiment classifiers, including VADER:

Coffee_break has been developed in cooperation with Prof. Jitka Poměnková (Brno University of Technology). The complete code in this tutorial is on my GitHub.

PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet you can join here.

[1] Hutto, C., Gilbert, E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1), 216–225.

[2] Jenks, G.F. (1977). Optimal data classification for choropleth maps. Kansas. University. Dept. of Geography-Meteorology. Occasional paper no. 2.