Arabica: A Python Package for Exploratory Analysis of Text Data | by Petr Korab | Sep, 2022

By Jessie Hobb On Sep 12, 2022

Arabica provides unigrams, bigrams, and trigrams frequencies by a period in a single line of code. Learn more in this tutorial.

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Exploratory analysis of such datasets is a non-trivial coding task. Arabica makes it simple in a single line of python code.

Arabica takes a data frame of text data as the input, enables standard cleaning operations (numbers, punctuation, and stopwords removal), and provides unigram (e.g., dog), bigram (e.g., dog, goes), and trigram (e.g., dog, goes, home) frequencies over a monthly or yearly period.

Figure 1: Scheme of **arabica_freq** method

It uses cleantext, an excellent python library for punctuation cleaning, and ntlk corpus of stopwords for pre-processing. A list of available languages for stopwords is printed with:

Figure 2: IMDb 50K Movie Reviews data subset

1. First look at the data

Let’s first look at the raw data in yearly frequency to find out more about the narrative of movie reviewers over time. We’ll read the data:

Figure 3: **arabica_freq** output, yearly n-gram frequencies

2. More detailed inspection of clean data

Next, we’ll remove numbers, punctuation, and English stopwords and display monthly n-gram frequencies to dig more into the clean data.

Figure 4: **arabica_freq** output, monthly n-gram frequencies

This package glorifies the best drink in the world, and hopefully, it will save you some time doing exploratory analysis of text data. The main Arabica’s benefits are:

coding efficiency – the EDA is done in one line of code
cleaning implementation – no need for prior text data pre-processing
solid performance – runs fast even with datasets of (tens of) thousands of rows.

Arabica is available from PyPI. For source files, go to my GitHub. Enjoy it, and please let me know how it worked on your projects!

PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet, you can join here.