Techno Blender
Digitally Yours.

Arabica: A Python Package for Exploratory Analysis of Text Data | by Petr Korab | Sep, 2022

0 75


Arabica provides unigrams, bigrams, and trigrams frequencies by a period in a single line of code. Learn more in this tutorial.

Photo by Artem Sapegin on Unsplash

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Exploratory analysis of such datasets is a non-trivial coding task. Arabica makes it simple in a single line of python code.

Arabica takes a data frame of text data as the input, enables standard cleaning operations (numbers, punctuation, and stopwords removal), and provides unigram (e.g., dog), bigram (e.g., dog, goes), and trigram (e.g., dog, goes, home) frequencies over a monthly or yearly period.

Figure 1: Scheme of arabica_freq method

It uses cleantext, an excellent python library for punctuation cleaning, and ntlk corpus of stopwords for pre-processing. A list of available languages for stopwords is printed with:

Let’s illustrate Arabica’s coding on the example of IMDb 50K Movie Reviews (see the data license). To add a time dimension, the time column contains synthetic dates in the ‘yyyy-mm-dd’ format. Here is what the subset of data looks like:

Figure 2: IMDb 50K Movie Reviews data subset

1. First look at the data

Let’s first look at the raw data in yearly frequency to find out more about the narrative of movie reviewers over time. We’ll read the data:

Then we call arabica_freq, specify yearly aggregation, keep numbers and punct as False, and stopwords as None, to look at raw data, including stopwords, digits, and special characters. max_words is set to 2 so that the output table is easy to read.

Here is the table of aggregated n-gram frequencies for the first six years:

Figure 3: arabica_freq output, yearly n-gram frequencies

We can see that the data contains lots of unnecessary prepositions and other stopwords that should be removed.

2. More detailed inspection of clean data

Next, we’ll remove numbers, punctuation, and English stopwords and display monthly n-gram frequencies to dig more into the clean data.

We can see a significant variation of text data over time (note that we work with a synthetic example dataset). The first five rows of the table are:

Figure 4: arabica_freq output, monthly n-gram frequencies
Photo by River Fx on Unsplash

This package glorifies the best drink in the world, and hopefully, it will save you some time doing exploratory analysis of text data. The main Arabica’s benefits are:

  • coding efficiency – the EDA is done in one line of code
  • cleaning implementation – no need for prior text data pre-processing
  • solid performance – runs fast even with datasets of (tens of) thousands of rows.

Arabica is available from PyPI. For source files, go to my GitHub. Enjoy it, and please let me know how it worked on your projects!

PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet, you can join here.


Arabica provides unigrams, bigrams, and trigrams frequencies by a period in a single line of code. Learn more in this tutorial.

Photo by Artem Sapegin on Unsplash

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Exploratory analysis of such datasets is a non-trivial coding task. Arabica makes it simple in a single line of python code.

Arabica takes a data frame of text data as the input, enables standard cleaning operations (numbers, punctuation, and stopwords removal), and provides unigram (e.g., dog), bigram (e.g., dog, goes), and trigram (e.g., dog, goes, home) frequencies over a monthly or yearly period.

Figure 1: Scheme of arabica_freq method

It uses cleantext, an excellent python library for punctuation cleaning, and ntlk corpus of stopwords for pre-processing. A list of available languages for stopwords is printed with:

Let’s illustrate Arabica’s coding on the example of IMDb 50K Movie Reviews (see the data license). To add a time dimension, the time column contains synthetic dates in the ‘yyyy-mm-dd’ format. Here is what the subset of data looks like:

Figure 2: IMDb 50K Movie Reviews data subset

1. First look at the data

Let’s first look at the raw data in yearly frequency to find out more about the narrative of movie reviewers over time. We’ll read the data:

Then we call arabica_freq, specify yearly aggregation, keep numbers and punct as False, and stopwords as None, to look at raw data, including stopwords, digits, and special characters. max_words is set to 2 so that the output table is easy to read.

Here is the table of aggregated n-gram frequencies for the first six years:

Figure 3: arabica_freq output, yearly n-gram frequencies

We can see that the data contains lots of unnecessary prepositions and other stopwords that should be removed.

2. More detailed inspection of clean data

Next, we’ll remove numbers, punctuation, and English stopwords and display monthly n-gram frequencies to dig more into the clean data.

We can see a significant variation of text data over time (note that we work with a synthetic example dataset). The first five rows of the table are:

Figure 4: arabica_freq output, monthly n-gram frequencies
Photo by River Fx on Unsplash

This package glorifies the best drink in the world, and hopefully, it will save you some time doing exploratory analysis of text data. The main Arabica’s benefits are:

  • coding efficiency – the EDA is done in one line of code
  • cleaning implementation – no need for prior text data pre-processing
  • solid performance – runs fast even with datasets of (tens of) thousands of rows.

Arabica is available from PyPI. For source files, go to my GitHub. Enjoy it, and please let me know how it worked on your projects!

PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet, you can join here.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment