Things you can do in one line using Pandas
Training data-driven machine learning models has never been as easy as today. For instance, assume you are training a vanilla neural network. Here, adjusting the architecture for the number of hidden layers and their dimension, tweaking the hyperparameters, or changing the loss function can all be done with a slight modification in the model definition or its optimizer.
While on one hand, this is advantageous as it reduces the heavy lifting of spending time designing architectures from scratch. However, this has often led machine learning practitioners/researchers to neglect the importance of data visualizations and analysis — leading them to train deep models directly without establishing a clear understanding of their data.
Therefore, in this post, I would like to introduce you to a handful of essential and powerful one-liners specifically for tabular data using Pandas that will help you better understand your data and consequently (and hopefully) help you design and build better machine learning models.
For this post, I will experiment with a dummy dataset of one thousand Employees which I created myself in Python. The image below gives an overview of the dataset we are experimenting with.
The code block below demonstrates my implementation:
Next, let’s discuss some popular functions available in Pandas to make a meaningful understanding of the available data.
#1 n-largest values in a series
Say we want to start off by finding the top-n paid roles in this dataset. You can do this using the nlargest()
method in Pandas. This method returns the first n
rows with the largest values in column(s), ordered in descending order.
Note that nlargest()
returns the entire DataFrame, i.e., the function also returns the columns not specified for ordering. However, they are not used to order the DataFrame. The code snippet below depicts the use of nlargest()
method on our DataFrame.
When duplicate values exist, we need to specify which particular row(s) we want in the final output. This is done using the keep
argument that can take the following values:
keep = "first"
: prioritizes the first occurrence.keep = "last"
: prioritizes the last occurrence.keep = "all"
: Does not drop any duplicates, even if it means selecting more than n items (like in the image above).
It is often mistaken that the nlargest()
is precisely equivalent to using the sort_values()
method as follows:
However, the keep
argument used in nlargest()
makes all the difference. Considering the example above, nlargest()
with keep=”all"
returns potential duplicates as well. This, on the other hand, can not be done in the case of sort_values()
method.
#2 n-smallest values in a series
Similar to the nlargest()
method discussed above, you can find the rows corresponding to the lowest-n values using the nsmallest()
method in Pandas. This method returns the first n
rows with the smallest values in column(s), arranged in ascending order. The arguments passed here are the same as those specified in the nlargest()
method. The code snippet below depicts the use of nsmallest()
method on our DataFrame.
#3 CrossTabs
Crosstab allows you to compute a cross-tabulation of two (or more) columns/series and returns a frequency of each combination by default. In other words, crosstab()
takes one column/list, displays its unique values as indexes, and then takes another column/list and displays its unique values as the column headers. The values in the individual cells are computed using an aggregation function. By default, they indicate the co-occurrence frequency.
Say, for instance, we wish to compute the number of employees working from each location within every company. This can be done as follows:
As it can be hard to interpret numerical values in a crosstab (and to make it more visually appealing), we can generate a heatmap from a crosstab shown below as follows:
If you wish to compute aggregation on some column other than the ones that make up the indexes and the column headers, you can do so by passing the aggregation column to values
argument of crosstab()
as shown below:
#4 Pivot Table
Pivot tables are a commonly used data analysis tool in Excel. Similar to crosstabs discussed above, pivot tables in Pandas provide a way to cross-tabulate your data.
Although they both share numerous similarities and are conceptually the same in the context of Pandas, there are a few implementational differences that make them different (further reading here). The code snippet below demonstrates the use of pivot_table()
method to find the frequency of co-occurrence between the “Company Name” and “Location”:
Similar to what we did in Crosstab, we can create a heatmap to make it visually appealing as well as more interpretable. This can be done as shown in the code snippet to generate the following heatmap:
#5 Handling Duplicated Data
In addition to the regular data analysis, appropriately handling duplicate values in your data also plays a vital role in building your data pipeline. One major caveat of having duplicates in your data is that they take up unnecessary storage space and slow down the computation by acquiring resources. Furthermore, duplicate data can skew analysis results, leading us to draw wrong insights. Therefore, removing or handling duplicates in your data is extremely important.
First, let’s look at how you can mark duplicate values in your DataFrame. For this, we’ll use the duplicated()
method in Pandas. This returns a boolean Series that indicates duplicate rows. For demonstration purposes, I’ll only use a random sample of 10 rows of the original salary dataset, of which the last two rows have been intentionally duplicated. The sampled rows are shown in the image below.
Pandas allows you to assign boolean labels to rows based on all columns (or a subset of columns) which are duplicates. This can be done using the duplicated()
method of Pandas as shown below:
When there are duplicate values, keep
is used to indicate which specific duplicates to mark.
keep = "first"
: (Default) Marks all duplicates asTrue
except for the first occurrence.keep = "last"
: Marks all duplicates asTrue
except for the last occurrence.keep = False
: Marks all duplicates asTrue
.
You can filter all the rows which appear only once by passing the boolean series as flags for filtering a Pandas DataFrame as follows:
To check duplicates on a subset of columns, pass the list of columns as the subset
argument of duplicated()
method as shown below:
Filtering the DataFrame using the above boolean series as shown below outputs the DataFrame following the code:
In addition to marking potential duplicates using boolean labels discussed above, one might also need to get rid of duplicates. To reiterate, the data I am referring to specifically for the “Handling Duplicated Data” section comprises just ten rows. This is shown below:
You can remove the duplicate rows either based on values in all columns or a subset of columns using the drop_duplicates()
method as shown below:
Similar to duplicated()
, the keep
argument is used to indicate which specific duplicates you want to keep.
keep = "first"
: (Default) Drops all duplicates except for the first occurrence.keep = "last"
: Drops all duplicates except for the last occurrence.keep = False
: Drops all duplicates.
To drop duplicates based on the values in a subset of columns, pass the list of columns as the subset
argument to the drop_duplicates()
method:
Things you can do in one line using Pandas
Training data-driven machine learning models has never been as easy as today. For instance, assume you are training a vanilla neural network. Here, adjusting the architecture for the number of hidden layers and their dimension, tweaking the hyperparameters, or changing the loss function can all be done with a slight modification in the model definition or its optimizer.
While on one hand, this is advantageous as it reduces the heavy lifting of spending time designing architectures from scratch. However, this has often led machine learning practitioners/researchers to neglect the importance of data visualizations and analysis — leading them to train deep models directly without establishing a clear understanding of their data.
Therefore, in this post, I would like to introduce you to a handful of essential and powerful one-liners specifically for tabular data using Pandas that will help you better understand your data and consequently (and hopefully) help you design and build better machine learning models.
For this post, I will experiment with a dummy dataset of one thousand Employees which I created myself in Python. The image below gives an overview of the dataset we are experimenting with.
The code block below demonstrates my implementation:
Next, let’s discuss some popular functions available in Pandas to make a meaningful understanding of the available data.
#1 n-largest values in a series
Say we want to start off by finding the top-n paid roles in this dataset. You can do this using the nlargest()
method in Pandas. This method returns the first n
rows with the largest values in column(s), ordered in descending order.
Note that nlargest()
returns the entire DataFrame, i.e., the function also returns the columns not specified for ordering. However, they are not used to order the DataFrame. The code snippet below depicts the use of nlargest()
method on our DataFrame.
When duplicate values exist, we need to specify which particular row(s) we want in the final output. This is done using the keep
argument that can take the following values:
keep = "first"
: prioritizes the first occurrence.keep = "last"
: prioritizes the last occurrence.keep = "all"
: Does not drop any duplicates, even if it means selecting more than n items (like in the image above).
It is often mistaken that the nlargest()
is precisely equivalent to using the sort_values()
method as follows:
However, the keep
argument used in nlargest()
makes all the difference. Considering the example above, nlargest()
with keep=”all"
returns potential duplicates as well. This, on the other hand, can not be done in the case of sort_values()
method.
#2 n-smallest values in a series
Similar to the nlargest()
method discussed above, you can find the rows corresponding to the lowest-n values using the nsmallest()
method in Pandas. This method returns the first n
rows with the smallest values in column(s), arranged in ascending order. The arguments passed here are the same as those specified in the nlargest()
method. The code snippet below depicts the use of nsmallest()
method on our DataFrame.
#3 CrossTabs
Crosstab allows you to compute a cross-tabulation of two (or more) columns/series and returns a frequency of each combination by default. In other words, crosstab()
takes one column/list, displays its unique values as indexes, and then takes another column/list and displays its unique values as the column headers. The values in the individual cells are computed using an aggregation function. By default, they indicate the co-occurrence frequency.
Say, for instance, we wish to compute the number of employees working from each location within every company. This can be done as follows:
As it can be hard to interpret numerical values in a crosstab (and to make it more visually appealing), we can generate a heatmap from a crosstab shown below as follows:
If you wish to compute aggregation on some column other than the ones that make up the indexes and the column headers, you can do so by passing the aggregation column to values
argument of crosstab()
as shown below:
#4 Pivot Table
Pivot tables are a commonly used data analysis tool in Excel. Similar to crosstabs discussed above, pivot tables in Pandas provide a way to cross-tabulate your data.
Although they both share numerous similarities and are conceptually the same in the context of Pandas, there are a few implementational differences that make them different (further reading here). The code snippet below demonstrates the use of pivot_table()
method to find the frequency of co-occurrence between the “Company Name” and “Location”:
Similar to what we did in Crosstab, we can create a heatmap to make it visually appealing as well as more interpretable. This can be done as shown in the code snippet to generate the following heatmap:
#5 Handling Duplicated Data
In addition to the regular data analysis, appropriately handling duplicate values in your data also plays a vital role in building your data pipeline. One major caveat of having duplicates in your data is that they take up unnecessary storage space and slow down the computation by acquiring resources. Furthermore, duplicate data can skew analysis results, leading us to draw wrong insights. Therefore, removing or handling duplicates in your data is extremely important.
First, let’s look at how you can mark duplicate values in your DataFrame. For this, we’ll use the duplicated()
method in Pandas. This returns a boolean Series that indicates duplicate rows. For demonstration purposes, I’ll only use a random sample of 10 rows of the original salary dataset, of which the last two rows have been intentionally duplicated. The sampled rows are shown in the image below.
Pandas allows you to assign boolean labels to rows based on all columns (or a subset of columns) which are duplicates. This can be done using the duplicated()
method of Pandas as shown below:
When there are duplicate values, keep
is used to indicate which specific duplicates to mark.
keep = "first"
: (Default) Marks all duplicates asTrue
except for the first occurrence.keep = "last"
: Marks all duplicates asTrue
except for the last occurrence.keep = False
: Marks all duplicates asTrue
.
You can filter all the rows which appear only once by passing the boolean series as flags for filtering a Pandas DataFrame as follows:
To check duplicates on a subset of columns, pass the list of columns as the subset
argument of duplicated()
method as shown below:
Filtering the DataFrame using the above boolean series as shown below outputs the DataFrame following the code:
In addition to marking potential duplicates using boolean labels discussed above, one might also need to get rid of duplicates. To reiterate, the data I am referring to specifically for the “Handling Duplicated Data” section comprises just ten rows. This is shown below:
You can remove the duplicate rows either based on values in all columns or a subset of columns using the drop_duplicates()
method as shown below:
Similar to duplicated()
, the keep
argument is used to indicate which specific duplicates you want to keep.
keep = "first"
: (Default) Drops all duplicates except for the first occurrence.keep = "last"
: Drops all duplicates except for the last occurrence.keep = False
: Drops all duplicates.
To drop duplicates based on the values in a subset of columns, pass the list of columns as the subset
argument to the drop_duplicates()
method: