Visualization Module in Arabica Speeds Up Text Data Exploration | by Petr Korab | Jan, 2023
Arabica now offers unigram, bigram, and trigram word cloud, heatmap, and line chart to further accelerate time-series text data analysis
Arabica is a python library for exploratory text data analysis focusing on text from a time-series perspective. It reflects the empirical reality that many text datasets are collected as repeated observations over time. Time series text data include newspaper article headlines, research article abstracts and metadata, product reviews, social network communication, and many others. Arabica simplifies exploratory analysis (EDA) of these datasets by providing these methods:
- arabica_freq: descriptive n-gram analysis and time-series n-gram analysis, for n-gram based EDA of text dataset
- cappuccino: for visual exploration of the data.
This article provides an introduction to Cappuccino, Arabica’s visualization module for exploratory analysis of time-series text data. Read the documentation and a tutorial here for a general introduction to Arabica.
The plots implemented are word cloud (unigram, bigram, and trigram versions), heatmap, and line plot. They help discover (1) the most frequent n-grams for the whole data reflecting its time-series character (word clouds) and (2) n-grams development over time (heatmap, line plot).
The graphs are designed for use in presentations, reports, and empirical studies. They are, therefore, in high resolution (word clouds – 6192 x 3811, heatmap and line plot – 5049 x 2835).
Cappuccino relies on matplotlib, worcloud, and plotnine to create and display graphs, and cleantext and NTLK corpus of stopwords for pre-processing. Plotnine implements the popular and widely used ggplot2 library into Python. The requirements are here.
The method’s parameters are:
def cappuccino(text: str, # Text
time: str, # Time
plot: str ='', # Chart type: 'wordcloud'/'heatmap'/'line'
ngram: int ='', # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq: int ='', # Aggregation period: 'Y'/'M'', if no aggregation: 'ungroup'
max_words int ='', # Max number for most frequent n-grams displayed for each period
stopwords = [], # Languages for stop words
skip: [ ], # Remove additional strings
numbers: bool = False, # Remove numbers
punct: bool = False, # Remove punctuation
lower_case: bool = False # Lowercase text before cleaning and frequency analysis
)
Descriptive analysis in Arabica provides n-gram frequency calculations without aggregation over a specific period. In simple terms, first, n-grams frequencies are calculated for each text record, second, the frequencies are summed for the whole dataset, and finally, the frequencies are visualized in a plot.
Word cloud
Let’s illustrate the coding on the Million News Headlines of news headlines published in daily frequency over 2003–2–19: 2016–09–18. The dataset is provided by the Australian Broadcasting Corporation under the CC0: Public Domain license. We’ll subset the data to the first 50 000 headlines.
First, install Arabica with pip install arabica
, then import Cappuccino:
from arabica import cappuccino
After reading the data with pandas
, the data looks like this:
We lowercase the text, clean the data from punctuation and numbers, remove English stopwords and other unwanted strings (“g”, “br”), and plot a word cloud with the 100 most frequent words:
cappuccino(text = data['headline'],
time = data['date'],
plot = 'wordcloud',
ngram = 1, # n-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq = 'ungroup', # no period aggregation
max_words = 100, # displays 100 most frequent words
stopwords = ['english'], # remove English stopwords
skip = ['g','br'], # remove additional strings
numbers = True, # remove numbers
punct = True, # remove punctuation
lower_case = True # lowercase text before cleaning and frequency analysis
)
It returns the word cloud:
After changing ngram = 2
, we receive a word cloud with the 100 most frequent bigrams (see the cover picture). Alternatively, ngram = 3
displays the most frequent trigrams:
Time series text data typically display variability over time. Political statements before elections and newspaper headlines during the Covid-19 pandemic are nice examples. To display the n-grams over time, Arabica implements a heatmap and a line plot for monthly and yearly periods.
Heatmap
A heatmap with the ten most frequent unigrams in each month is displayed with the following code :
cappuccino(text = data['headline'],
time = data['date'],
plot = 'heatmap',
ngram = 1, # n-gram size, 1 = unigram, 2 = bigram
time_freq = 'M', # monthly aggregation
max_words = 10, # displays 10 most frequent words for each period
stopwords = ['english'], # remove English stopwords
skip = ['g', 'br'], # remove additional strings
numbers = True, # remove numbers
punct = True) # remove punctuation
lower_case = True, # lowercase text before cleaning and frequency analysis
)
The unigram heatmap is the output:
The unigram heatmap gives us the first look at the variability of data over time. We can clearly identify the important patterns in the data:
most frequent n-grams: “us”, “police”, “new”, “man”.
outliers (terms appearing only in one period): “war”, “wa”, “rain”, “killed”, “iraqi”, “concerns”, “budget”, “bali”.
We might consider removing the outliers in the later stage of the analysis. Alternatively, we create a bigram heatmap by changing ngram = 2
and max_words = 5
displaying a heatmap with the five most frequent bigrams in each period.
Line plot
A line plot with n-grams is displayed by changing plot = ‘line’
and setting ngram
parameter to 1 or 2. In this way, we create a line plot for the eight most frequent unigrams and four bigrams in each period:
The bigram line plot:
Cappuccino greatly helps in the visual exploration of text data which has a time-series character. With a single line of code, we pre-process the data and provide the first exploratory glimpse of the dataset. Here are several tips to follow:
- The visualization frequency also depends on the length of the time dimension in the data. In long time series, a monthly plot will not display the data clearly, while a graph for short time series (less than a year) in yearly frequency will not provide any variability over time.
- Select a suitable form of visualization on the basis of the dataset in your project. A line plot is not a good choice for datasets with high n-gram variability over time (see Fig 8). In this case, the heatmap shows a better picture even for many n-grams in each period.
Some questions we can answer with Arabica are (1) how the concepts in a specific domain (economics, biology, etc.) evolved over time, using research article metadata, (2) which key topics were emphasized during a presidential campaign, using Twitter tweets, (3) which parts of the brand and communication a company should improve, using customer product reviews.
The complete code in this tutorial is on my GitHub. For more examples, read the documentation and a tutorial on arabica_freq method.
PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet you can join here.
Arabica now offers unigram, bigram, and trigram word cloud, heatmap, and line chart to further accelerate time-series text data analysis
Arabica is a python library for exploratory text data analysis focusing on text from a time-series perspective. It reflects the empirical reality that many text datasets are collected as repeated observations over time. Time series text data include newspaper article headlines, research article abstracts and metadata, product reviews, social network communication, and many others. Arabica simplifies exploratory analysis (EDA) of these datasets by providing these methods:
- arabica_freq: descriptive n-gram analysis and time-series n-gram analysis, for n-gram based EDA of text dataset
- cappuccino: for visual exploration of the data.
This article provides an introduction to Cappuccino, Arabica’s visualization module for exploratory analysis of time-series text data. Read the documentation and a tutorial here for a general introduction to Arabica.
The plots implemented are word cloud (unigram, bigram, and trigram versions), heatmap, and line plot. They help discover (1) the most frequent n-grams for the whole data reflecting its time-series character (word clouds) and (2) n-grams development over time (heatmap, line plot).
The graphs are designed for use in presentations, reports, and empirical studies. They are, therefore, in high resolution (word clouds – 6192 x 3811, heatmap and line plot – 5049 x 2835).
Cappuccino relies on matplotlib, worcloud, and plotnine to create and display graphs, and cleantext and NTLK corpus of stopwords for pre-processing. Plotnine implements the popular and widely used ggplot2 library into Python. The requirements are here.
The method’s parameters are:
def cappuccino(text: str, # Text
time: str, # Time
plot: str ='', # Chart type: 'wordcloud'/'heatmap'/'line'
ngram: int ='', # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq: int ='', # Aggregation period: 'Y'/'M'', if no aggregation: 'ungroup'
max_words int ='', # Max number for most frequent n-grams displayed for each period
stopwords = [], # Languages for stop words
skip: [ ], # Remove additional strings
numbers: bool = False, # Remove numbers
punct: bool = False, # Remove punctuation
lower_case: bool = False # Lowercase text before cleaning and frequency analysis
)
Descriptive analysis in Arabica provides n-gram frequency calculations without aggregation over a specific period. In simple terms, first, n-grams frequencies are calculated for each text record, second, the frequencies are summed for the whole dataset, and finally, the frequencies are visualized in a plot.
Word cloud
Let’s illustrate the coding on the Million News Headlines of news headlines published in daily frequency over 2003–2–19: 2016–09–18. The dataset is provided by the Australian Broadcasting Corporation under the CC0: Public Domain license. We’ll subset the data to the first 50 000 headlines.
First, install Arabica with pip install arabica
, then import Cappuccino:
from arabica import cappuccino
After reading the data with pandas
, the data looks like this:
We lowercase the text, clean the data from punctuation and numbers, remove English stopwords and other unwanted strings (“g”, “br”), and plot a word cloud with the 100 most frequent words:
cappuccino(text = data['headline'],
time = data['date'],
plot = 'wordcloud',
ngram = 1, # n-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq = 'ungroup', # no period aggregation
max_words = 100, # displays 100 most frequent words
stopwords = ['english'], # remove English stopwords
skip = ['g','br'], # remove additional strings
numbers = True, # remove numbers
punct = True, # remove punctuation
lower_case = True # lowercase text before cleaning and frequency analysis
)
It returns the word cloud:
After changing ngram = 2
, we receive a word cloud with the 100 most frequent bigrams (see the cover picture). Alternatively, ngram = 3
displays the most frequent trigrams:
Time series text data typically display variability over time. Political statements before elections and newspaper headlines during the Covid-19 pandemic are nice examples. To display the n-grams over time, Arabica implements a heatmap and a line plot for monthly and yearly periods.
Heatmap
A heatmap with the ten most frequent unigrams in each month is displayed with the following code :
cappuccino(text = data['headline'],
time = data['date'],
plot = 'heatmap',
ngram = 1, # n-gram size, 1 = unigram, 2 = bigram
time_freq = 'M', # monthly aggregation
max_words = 10, # displays 10 most frequent words for each period
stopwords = ['english'], # remove English stopwords
skip = ['g', 'br'], # remove additional strings
numbers = True, # remove numbers
punct = True) # remove punctuation
lower_case = True, # lowercase text before cleaning and frequency analysis
)
The unigram heatmap is the output:
The unigram heatmap gives us the first look at the variability of data over time. We can clearly identify the important patterns in the data:
most frequent n-grams: “us”, “police”, “new”, “man”.
outliers (terms appearing only in one period): “war”, “wa”, “rain”, “killed”, “iraqi”, “concerns”, “budget”, “bali”.
We might consider removing the outliers in the later stage of the analysis. Alternatively, we create a bigram heatmap by changing ngram = 2
and max_words = 5
displaying a heatmap with the five most frequent bigrams in each period.
Line plot
A line plot with n-grams is displayed by changing plot = ‘line’
and setting ngram
parameter to 1 or 2. In this way, we create a line plot for the eight most frequent unigrams and four bigrams in each period:
The bigram line plot:
Cappuccino greatly helps in the visual exploration of text data which has a time-series character. With a single line of code, we pre-process the data and provide the first exploratory glimpse of the dataset. Here are several tips to follow:
- The visualization frequency also depends on the length of the time dimension in the data. In long time series, a monthly plot will not display the data clearly, while a graph for short time series (less than a year) in yearly frequency will not provide any variability over time.
- Select a suitable form of visualization on the basis of the dataset in your project. A line plot is not a good choice for datasets with high n-gram variability over time (see Fig 8). In this case, the heatmap shows a better picture even for many n-grams in each period.
Some questions we can answer with Arabica are (1) how the concepts in a specific domain (economics, biology, etc.) evolved over time, using research article metadata, (2) which key topics were emphasized during a presidential campaign, using Twitter tweets, (3) which parts of the brand and communication a company should improve, using customer product reviews.
The complete code in this tutorial is on my GitHub. For more examples, read the documentation and a tutorial on arabica_freq method.
PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet you can join here.