Arabica: A Python Package for Exploratory Analysis of Text Data | by Petr Korab | Sep, 2022
Arabica provides unigrams, bigrams, and trigrams frequencies by a period in a single line of code. Learn more in this tutorial.
Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Exploratory analysis of such datasets is a non-trivial coding task. Arabica makes it simple in a single line of python code.
Arabica takes a data frame of text data as the input, enables standard cleaning operations (numbers, punctuation, and stopwords removal), and provides unigram (e.g., dog), bigram (e.g., dog, goes), and trigram (e.g., dog, goes, home) frequencies over a monthly or yearly period.
It uses cleantext, an excellent python library for punctuation cleaning, and ntlk corpus of stopwords for pre-processing. A list of available languages for stopwords is printed with:
Let’s illustrate Arabica’s coding on the example of IMDb 50K Movie Reviews (see the data license). To add a time dimension, the time
column contains synthetic dates in the ‘yyyy-mm-dd’ format. Here is what the subset of data looks like:
1. First look at the data
Let’s first look at the raw data in yearly frequency to find out more about the narrative of movie reviewers over time. We’ll read the data:
Then we call arabica_freq,
specify yearly aggregation, keep numbers
and punct
as False, and stopwords
as None, to look at raw data, including stopwords, digits, and special characters. max_words
is set to 2 so that the output table is easy to read.
Here is the table of aggregated n-gram frequencies for the first six years:
We can see that the data contains lots of unnecessary prepositions and other stopwords that should be removed.
2. More detailed inspection of clean data
Next, we’ll remove numbers, punctuation, and English stopwords and display monthly n-gram frequencies to dig more into the clean data.
We can see a significant variation of text data over time (note that we work with a synthetic example dataset). The first five rows of the table are:
This package glorifies the best drink in the world, and hopefully, it will save you some time doing exploratory analysis of text data. The main Arabica’s benefits are:
- coding efficiency – the EDA is done in one line of code
- cleaning implementation – no need for prior text data pre-processing
- solid performance – runs fast even with datasets of (tens of) thousands of rows.
Arabica is available from PyPI. For source files, go to my GitHub. Enjoy it, and please let me know how it worked on your projects!
PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet, you can join here.
Arabica provides unigrams, bigrams, and trigrams frequencies by a period in a single line of code. Learn more in this tutorial.
Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Exploratory analysis of such datasets is a non-trivial coding task. Arabica makes it simple in a single line of python code.
Arabica takes a data frame of text data as the input, enables standard cleaning operations (numbers, punctuation, and stopwords removal), and provides unigram (e.g., dog), bigram (e.g., dog, goes), and trigram (e.g., dog, goes, home) frequencies over a monthly or yearly period.
It uses cleantext, an excellent python library for punctuation cleaning, and ntlk corpus of stopwords for pre-processing. A list of available languages for stopwords is printed with:
Let’s illustrate Arabica’s coding on the example of IMDb 50K Movie Reviews (see the data license). To add a time dimension, the time
column contains synthetic dates in the ‘yyyy-mm-dd’ format. Here is what the subset of data looks like:
1. First look at the data
Let’s first look at the raw data in yearly frequency to find out more about the narrative of movie reviewers over time. We’ll read the data:
Then we call arabica_freq,
specify yearly aggregation, keep numbers
and punct
as False, and stopwords
as None, to look at raw data, including stopwords, digits, and special characters. max_words
is set to 2 so that the output table is easy to read.
Here is the table of aggregated n-gram frequencies for the first six years:
We can see that the data contains lots of unnecessary prepositions and other stopwords that should be removed.
2. More detailed inspection of clean data
Next, we’ll remove numbers, punctuation, and English stopwords and display monthly n-gram frequencies to dig more into the clean data.
We can see a significant variation of text data over time (note that we work with a synthetic example dataset). The first five rows of the table are:
This package glorifies the best drink in the world, and hopefully, it will save you some time doing exploratory analysis of text data. The main Arabica’s benefits are:
- coding efficiency – the EDA is done in one line of code
- cleaning implementation – no need for prior text data pre-processing
- solid performance – runs fast even with datasets of (tens of) thousands of rows.
Arabica is available from PyPI. For source files, go to my GitHub. Enjoy it, and please let me know how it worked on your projects!
PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet, you can join here.