Techno Blender
Digitally Yours.

Visualization Module in Arabica Speeds Up Text Data Exploration | by Petr Korab | Jan, 2023

0 29


Figure 1. Bigram word cloud, image by author.

Arabica is a python library for exploratory text data analysis focusing on text from a time-series perspective. It reflects the empirical reality that many text datasets are collected as repeated observations over time. Time series text data include newspaper article headlines, research article abstracts and metadata, product reviews, social network communication, and many others. Arabica simplifies exploratory analysis (EDA) of these datasets by providing these methods:

  • arabica_freq: descriptive n-gram analysis and time-series n-gram analysis, for n-gram based EDA of text dataset
  • cappuccino: for visual exploration of the data.

This article provides an introduction to Cappuccino, Arabica’s visualization module for exploratory analysis of time-series text data. Read the documentation and a tutorial here for a general introduction to Arabica.

The plots implemented are word cloud (unigram, bigram, and trigram versions), heatmap, and line plot. They help discover (1) the most frequent n-grams for the whole data reflecting its time-series character (word clouds) and (2) n-grams development over time (heatmap, line plot).

The graphs are designed for use in presentations, reports, and empirical studies. They are, therefore, in high resolution (word clouds – 6192 x 3811, heatmap and line plot – 5049 x 2835).

Cappuccino relies on matplotlib, worcloud, and plotnine to create and display graphs, and cleantext and NTLK corpus of stopwords for pre-processing. Plotnine implements the popular and widely used ggplot2 library into Python. The requirements are here.

The method’s parameters are:

def cappuccino(text: str,                # Text
time: str, # Time
plot: str ='', # Chart type: 'wordcloud'/'heatmap'/'line'
ngram: int ='', # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq: int ='', # Aggregation period: 'Y'/'M'', if no aggregation: 'ungroup'
max_words int ='', # Max number for most frequent n-grams displayed for each period
stopwords = [], # Languages for stop words
skip: [ ], # Remove additional strings
numbers: bool = False, # Remove numbers
punct: bool = False, # Remove punctuation
lower_case: bool = False # Lowercase text before cleaning and frequency analysis
)

Descriptive analysis in Arabica provides n-gram frequency calculations without aggregation over a specific period. In simple terms, first, n-grams frequencies are calculated for each text record, second, the frequencies are summed for the whole dataset, and finally, the frequencies are visualized in a plot.

Word cloud

Let’s illustrate the coding on the Million News Headlines of news headlines published in daily frequency over 2003–2–19: 2016–09–18. The dataset is provided by the Australian Broadcasting Corporation under the CC0: Public Domain license. We’ll subset the data to the first 50 000 headlines.

First, install Arabica with pip install arabica, then import Cappuccino:

from arabica import cappuccino

After reading the data with pandas, the data looks like this:

Figure 2. Million News Headlines data

We lowercase the text, clean the data from punctuation and numbers, remove English stopwords and other unwanted strings (“g”, “br”), and plot a word cloud with the 100 most frequent words:

cappuccino(text = data['headline'],
time = data['date'],
plot = 'wordcloud',
ngram = 1, # n-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq = 'ungroup', # no period aggregation
max_words = 100, # displays 100 most frequent words
stopwords = ['english'], # remove English stopwords
skip = ['g','br'], # remove additional strings
numbers = True, # remove numbers
punct = True, # remove punctuation
lower_case = True # lowercase text before cleaning and frequency analysis
)

It returns the word cloud:

Figure 3. Word cloud, image by author.

After changing ngram = 2 , we receive a word cloud with the 100 most frequent bigrams (see the cover picture). Alternatively, ngram = 3 displays the most frequent trigrams:

Figure 4. Word cloud — trigram, image by author.

Time series text data typically display variability over time. Political statements before elections and newspaper headlines during the Covid-19 pandemic are nice examples. To display the n-grams over time, Arabica implements a heatmap and a line plot for monthly and yearly periods.

Image by author, source: Draw.io

Heatmap

A heatmap with the ten most frequent unigrams in each month is displayed with the following code :

cappuccino(text = data['headline'],
time = data['date'],
plot = 'heatmap',
ngram = 1, # n-gram size, 1 = unigram, 2 = bigram
time_freq = 'M', # monthly aggregation
max_words = 10, # displays 10 most frequent words for each period
stopwords = ['english'], # remove English stopwords
skip = ['g', 'br'], # remove additional strings
numbers = True, # remove numbers
punct = True) # remove punctuation
lower_case = True, # lowercase text before cleaning and frequency analysis
)

The unigram heatmap is the output:

Figure 5. Heatmap — unigram, image by author.

The unigram heatmap gives us the first look at the variability of data over time. We can clearly identify the important patterns in the data:

most frequent n-grams: “us”, “police”, “new”, “man”.

outliers (terms appearing only in one period): “war”, “wa”, “rain”, “killed”, “iraqi”, “concerns”, “budget”, “bali”.

We might consider removing the outliers in the later stage of the analysis. Alternatively, we create a bigram heatmap by changing ngram = 2 and max_words = 5 displaying a heatmap with the five most frequent bigrams in each period.

Figure 6. Heatmap — bigram, image by author.

Line plot

A line plot with n-grams is displayed by changing plot = ‘line’ and setting ngram parameter to 1 or 2. In this way, we create a line plot for the eight most frequent unigrams and four bigrams in each period:

Figure 7. Line plot — unigram, image by author.

The bigram line plot:

Figure 8. Line plot — bigram, image by author.

Cappuccino greatly helps in the visual exploration of text data which has a time-series character. With a single line of code, we pre-process the data and provide the first exploratory glimpse of the dataset. Here are several tips to follow:

  • The visualization frequency also depends on the length of the time dimension in the data. In long time series, a monthly plot will not display the data clearly, while a graph for short time series (less than a year) in yearly frequency will not provide any variability over time.
  • Select a suitable form of visualization on the basis of the dataset in your project. A line plot is not a good choice for datasets with high n-gram variability over time (see Fig 8). In this case, the heatmap shows a better picture even for many n-grams in each period.

Some questions we can answer with Arabica are (1) how the concepts in a specific domain (economics, biology, etc.) evolved over time, using research article metadata, (2) which key topics were emphasized during a presidential campaign, using Twitter tweets, (3) which parts of the brand and communication a company should improve, using customer product reviews.

The complete code in this tutorial is on my GitHub. For more examples, read the documentation and a tutorial on arabica_freq method.

PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet you can join here.

Photo by Kanwardeep Kaur on Unsplash


Figure 1. Bigram word cloud, image by author.

Arabica is a python library for exploratory text data analysis focusing on text from a time-series perspective. It reflects the empirical reality that many text datasets are collected as repeated observations over time. Time series text data include newspaper article headlines, research article abstracts and metadata, product reviews, social network communication, and many others. Arabica simplifies exploratory analysis (EDA) of these datasets by providing these methods:

  • arabica_freq: descriptive n-gram analysis and time-series n-gram analysis, for n-gram based EDA of text dataset
  • cappuccino: for visual exploration of the data.

This article provides an introduction to Cappuccino, Arabica’s visualization module for exploratory analysis of time-series text data. Read the documentation and a tutorial here for a general introduction to Arabica.

The plots implemented are word cloud (unigram, bigram, and trigram versions), heatmap, and line plot. They help discover (1) the most frequent n-grams for the whole data reflecting its time-series character (word clouds) and (2) n-grams development over time (heatmap, line plot).

The graphs are designed for use in presentations, reports, and empirical studies. They are, therefore, in high resolution (word clouds – 6192 x 3811, heatmap and line plot – 5049 x 2835).

Cappuccino relies on matplotlib, worcloud, and plotnine to create and display graphs, and cleantext and NTLK corpus of stopwords for pre-processing. Plotnine implements the popular and widely used ggplot2 library into Python. The requirements are here.

The method’s parameters are:

def cappuccino(text: str,                # Text
time: str, # Time
plot: str ='', # Chart type: 'wordcloud'/'heatmap'/'line'
ngram: int ='', # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq: int ='', # Aggregation period: 'Y'/'M'', if no aggregation: 'ungroup'
max_words int ='', # Max number for most frequent n-grams displayed for each period
stopwords = [], # Languages for stop words
skip: [ ], # Remove additional strings
numbers: bool = False, # Remove numbers
punct: bool = False, # Remove punctuation
lower_case: bool = False # Lowercase text before cleaning and frequency analysis
)

Descriptive analysis in Arabica provides n-gram frequency calculations without aggregation over a specific period. In simple terms, first, n-grams frequencies are calculated for each text record, second, the frequencies are summed for the whole dataset, and finally, the frequencies are visualized in a plot.

Word cloud

Let’s illustrate the coding on the Million News Headlines of news headlines published in daily frequency over 2003–2–19: 2016–09–18. The dataset is provided by the Australian Broadcasting Corporation under the CC0: Public Domain license. We’ll subset the data to the first 50 000 headlines.

First, install Arabica with pip install arabica, then import Cappuccino:

from arabica import cappuccino

After reading the data with pandas, the data looks like this:

Figure 2. Million News Headlines data

We lowercase the text, clean the data from punctuation and numbers, remove English stopwords and other unwanted strings (“g”, “br”), and plot a word cloud with the 100 most frequent words:

cappuccino(text = data['headline'],
time = data['date'],
plot = 'wordcloud',
ngram = 1, # n-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq = 'ungroup', # no period aggregation
max_words = 100, # displays 100 most frequent words
stopwords = ['english'], # remove English stopwords
skip = ['g','br'], # remove additional strings
numbers = True, # remove numbers
punct = True, # remove punctuation
lower_case = True # lowercase text before cleaning and frequency analysis
)

It returns the word cloud:

Figure 3. Word cloud, image by author.

After changing ngram = 2 , we receive a word cloud with the 100 most frequent bigrams (see the cover picture). Alternatively, ngram = 3 displays the most frequent trigrams:

Figure 4. Word cloud — trigram, image by author.

Time series text data typically display variability over time. Political statements before elections and newspaper headlines during the Covid-19 pandemic are nice examples. To display the n-grams over time, Arabica implements a heatmap and a line plot for monthly and yearly periods.

Image by author, source: Draw.io

Heatmap

A heatmap with the ten most frequent unigrams in each month is displayed with the following code :

cappuccino(text = data['headline'],
time = data['date'],
plot = 'heatmap',
ngram = 1, # n-gram size, 1 = unigram, 2 = bigram
time_freq = 'M', # monthly aggregation
max_words = 10, # displays 10 most frequent words for each period
stopwords = ['english'], # remove English stopwords
skip = ['g', 'br'], # remove additional strings
numbers = True, # remove numbers
punct = True) # remove punctuation
lower_case = True, # lowercase text before cleaning and frequency analysis
)

The unigram heatmap is the output:

Figure 5. Heatmap — unigram, image by author.

The unigram heatmap gives us the first look at the variability of data over time. We can clearly identify the important patterns in the data:

most frequent n-grams: “us”, “police”, “new”, “man”.

outliers (terms appearing only in one period): “war”, “wa”, “rain”, “killed”, “iraqi”, “concerns”, “budget”, “bali”.

We might consider removing the outliers in the later stage of the analysis. Alternatively, we create a bigram heatmap by changing ngram = 2 and max_words = 5 displaying a heatmap with the five most frequent bigrams in each period.

Figure 6. Heatmap — bigram, image by author.

Line plot

A line plot with n-grams is displayed by changing plot = ‘line’ and setting ngram parameter to 1 or 2. In this way, we create a line plot for the eight most frequent unigrams and four bigrams in each period:

Figure 7. Line plot — unigram, image by author.

The bigram line plot:

Figure 8. Line plot — bigram, image by author.

Cappuccino greatly helps in the visual exploration of text data which has a time-series character. With a single line of code, we pre-process the data and provide the first exploratory glimpse of the dataset. Here are several tips to follow:

  • The visualization frequency also depends on the length of the time dimension in the data. In long time series, a monthly plot will not display the data clearly, while a graph for short time series (less than a year) in yearly frequency will not provide any variability over time.
  • Select a suitable form of visualization on the basis of the dataset in your project. A line plot is not a good choice for datasets with high n-gram variability over time (see Fig 8). In this case, the heatmap shows a better picture even for many n-grams in each period.

Some questions we can answer with Arabica are (1) how the concepts in a specific domain (economics, biology, etc.) evolved over time, using research article metadata, (2) which key topics were emphasized during a presidential campaign, using Twitter tweets, (3) which parts of the brand and communication a company should improve, using customer product reviews.

The complete code in this tutorial is on my GitHub. For more examples, read the documentation and a tutorial on arabica_freq method.

PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet you can join here.

Photo by Kanwardeep Kaur on Unsplash

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment