5 Powerful Python Libraries For EDA You Need to Know About | by Andy McDonald | Feb, 2023

By Jessie Hobb On Feb 16, 2023

Leverage the Power of Python to Explore and Understand Your Data

Ensuring data is of good quality before running machine learning models is essential. If we feed poor-quality data to these models, we may end up with unexpected or unintended consequences. However, carrying out the prep work on data and trying to understand what you have or don’t have is very time-consuming. Oftentimes this process can consume up to 90% of a projects available time.

If you carry out Exploratory Data Analysis (EDA) within Python, you will be aware of the common libraries such as pandas, matplotlib and seaborn. All are great libraries, but each has their own nuances, which can take time to learn or remember.

In recent years, there have been several powerful low-code python libraries that make the data exploration and analysis phase of projects much quicker and easier.

In this article, I will introduce you to 5 of these python libraries, which will enhance your data analysis workflow. All of which can be run within a Jupyter notebook environment.

The YData Profiling library, formerly known as Pandas Profiling, allows you to create detailed reports based on a pandas dataframe. It is very simple to navigate and provides information on the individual variables, missing data analysis, data correlations and interactions.

One slight issue with YData Profiling, is the ability to handle larger datasets, which can slow down the report generation.

How to Use The YData Profiling Library

YData Profiling can be installed via a terminal using pip:

pip install ydata-profiling

After the library has been installed in your Python environment, we can simply import the ProfileReport module from the library alongside pandas. Pandas is used to load our data from a CSV file or another format.

import pandas as pd
from ydata_profiling import ProfileReportdf = pd.read_csv('Data/Xeek_Well_15-9-15.csv')
ProfileReport(df)

Once the data has been read, we can pass our dataframe to ProfileReport , and the report will begin generating.

The length of time it takes to generate the report will be dependent on the size of your dataset. The larger the dataset, the longer it will take to generate.

After the report has been created, you can then begin scrolling through the report as seen below.

The Ydata Profile report of the selected dataset. Image by the author.

We can dig into each variable within the dataset and view information on data completeness, statistics and data types.

View key statistics of numeric variables within the dataset. Image by the author.

We can also create visualisations of data completeness. This allows us to understand what data is missing and how missingness is related between the different variables.

Identification of missing values through various views using the YData Profiling report. Image by the author.

You can explore more of the features in the Pandas Profiling (before it was renamed) in the article below.

D-Tale takes your Pandas dataframe to a whole new level. This powerful and fast library makes it very easy to interact with your data, carry out basic analysis and even edit it.

I have only recently found this library, but it has become one of my go-to libraries for exploring data.

If you want to give the library a try before downloading it, the library authors have provided a live example.

How to Use D-Tale

D-Tale can be installed via a terminal using pip:

pip install dtale

Then it can be imported alongside pandas, as seen below. Once the data has been read by pandas, the resultant dataframe can be passed over to dtale.show()

import pandas as pd
import dtaledf = pd.read_csv('Data/Xeek_Well_15-9-15.csv')
dtale.show(df)

After a little wait, the D-Tale interactive table will appear with all of the data contained within the dataframe.

D-Tale comes with a large number of features that allow you to interrogate the data, visualise its completeness, edit the data and more.

When we look into individual variables, such as the DTC column within this dataset, we can visualise the distribution of it using histograms:

Interactive histogram within the Describe module of D-Tale. Image by the author.

And view how that data is distributed amongst a categorical variable:

Easily visualise the data by categories such as lithology or geological formation. Image by the author.

If you want to explore more of the features of D-Tale, you can find out more in my article below:

SweetViz is another low code, interactive data visualisation and exploration library. From a couple of lines of code, we are able to create an interactive HTML file to explore our data.

How to Use SweetViz

Sweetviz can be installed via the terminal using pip:

pip install sweetviz

Once it has been installed, we can import it into our notebook and load our data using pandas.

import sweetviz as sv
import pandas as pddf = pd.read_csv('Data/Xeek_Well_15-9-15.csv')

We then need to call upon two more lines of code to be able to get our report:

report = sv.analyze(df)
report.show_html()

This will then open a new browser tab with the following setup.

SweetViz — a fast and powerful EDA Python library. Image by the author.

Once the browser tab has opened, you can go through each of the variables within the dataframe and view the key statistics and the completeness of each variable. When you click on any of the variables, it will open up histograms of the data distribution if it is numeric data or a count of values if it is categorical data.

Additionally, it will show the relationship, in numbers, of that variable with the other variables in the dataset.

If you want to see this visually, you can click on the Associations button at the top of the dashboard to open up a graphical correlation graph. In the image below, we can see a mixture of squares and circles, which represents categorical variables and numeric variables, respectively.

The size of the square/circle represents the strength of the relationship, and the colour represents the Pearson’s correlation coefficient value. This has to be one of the best visualisations of relationships between variables I have seen so far within Python.

Associations between variables generated using SweetViz. Image by the author.

One of the minor issues I have found with this library is that you need a wide screen to be able to view all of the horizontal content without scrolling. However, don’t let that detract you from the power this library can bring to your EDA.

If you are interested in using a lightweight library to explore the completeness of your data, then missingno is one you should definitely consider for your EDA toolbox.

It is a Python library that provides a series of visualisations to understand the presence and distribution of missing data within a pandas dataframe. The library provides you with a small number of plots (barplot, matrix plot, heatmap or dendrogram) to visualise what columns in your dataframe contain missing values and how the degree of missingness is related between the variables.

How to Use MissingNo

Missingno can be installed via the terminal using pip:

pip install missingno

Once the library has been installed, we can import it alongside pandas and load our data into a dataframe.

import pandas as pd
import missingno as msno
df = pd.read_csv('xeek_train_subset.csv')

We can then call upon our desired plot from the ones available:

msno.bar(df)
msno.matrix(df)
msno.denrogram(df)
msno.heatmap(df)

The four main plots within the missingno library. Image by the author.

The above four plots provide us insight into:

How complete each column within the dataframe is — msno.bar()
Where the missing data occurs — msno.matrix()
How correlated the missing values are — msno.heatmap() and msno.dendrogram()

The nice thing about this library is that the plots are clean, easy to understand and can be quickly incorporated into a report as they are.

To understand more about each of these plots, I recommend diving into the article below.

Sketch is a very new (as of Feb 2023) library that leverages the power of AI to help you understand your pandas dataframes by using natural language questions directly within Jupyter. You can also use it to generate sample code, for example how to make a plot of x and y within the dataframe, and then use that code to generate the required plot.

The library is mostly self contained, where it uses machine learning algorithms to understand the context of your question in relation to your dataset. There is a function that does rely on OpenAI’s API, but that does not detract from how the library can be used.

Sketch has a lot of potential to be powerful, especially if you are looking to provide an interface to customers with very limited knowledge of coding in Python.

How to Use Sketch

Sketch can be installed via the terminal using pip:

pip install sketch

We then import pandas and sketch into our notebook, followed by loading the data from our CSV file.

import sketch
import pandas as pddf = pd.read_csv('Data/Xeek_Well_15-9-15.csv')

Once sketch has been imported, three new methods will be available for our dataframe.

The first is the .ask method, which allows you to ask questions — using natural language — about the contents of the dataframe.

df.sketch.ask('What are the max values of each numerical column?')

This returns the following line with the max values of each of the numerical columns within the dataframe.

Response from Sketch when asked to return the max values of each column. Image by the author.

We can also ask it how complete the dataframe is:

df.sketch.ask('How complete is the data?')

And it will return the following response in a human written form rather than tables or graphs.

Response from sketch when asked about the completeness of the dataframe. Image by the author.

Very impressive. But that is not all.

We can even query the library about how to plot the data contained within the dataframe using .sketch.howto()

df.sketch.howto("""How do I plot RHOB against DEPTH_MD 
using a line plot and have the line coloured in red?""")

And it will return a code snippet of how to do it:

Returned code snippet from the sketch.howto function. Imaghe by the author.

Which, when run, will return the following plot:

Plot generated from code returned by sketch python library. Image by the author.

The third option available with sketch is the .apply method, which requires an OpenAI API in order to run. This function is handy when we want to create new features from existing features or generate new ones. As of this moment, I have not explored this option, but hope to in the near future.

Within this article, we have seen five powerful python libraries that can be used to speed up and enhance the exploratory data analysis phase of a project. These have ranged from simple graphics to interactions with the data using the power of natural language processing.

I highly recommend that you check these libraries out and explore their capabilities. You never know, you may just find your new favourite python library.