Powerful One-liners in Pandas Every Data Scientist Should Know | by Avi Chawla | Jul, 2022


Things you can do in one line using Pandas

Photo by KirstenMarie on Unsplash

Training data-driven machine learning models has never been as easy as today. For instance, assume you are training a vanilla neural network. Here, adjusting the architecture for the number of hidden layers and their dimension, tweaking the hyperparameters, or changing the loss function can all be done with a slight modification in the model definition or its optimizer.

While on one hand, this is advantageous as it reduces the heavy lifting of spending time designing architectures from scratch. However, this has often led machine learning practitioners/researchers to neglect the importance of data visualizations and analysis — leading them to train deep models directly without establishing a clear understanding of their data.

Therefore, in this post, I would like to introduce you to a handful of essential and powerful one-liners specifically for tabular data using Pandas that will help you better understand your data and consequently (and hopefully) help you design and build better machine learning models.

For this post, I will experiment with a dummy dataset of one thousand Employees which I created myself in Python. The image below gives an overview of the dataset we are experimenting with.

First five rows of the DataFrame (Image by author)

The code block below demonstrates my implementation:

Next, let’s discuss some popular functions available in Pandas to make a meaningful understanding of the available data.

#1 n-largest values in a series

Say we want to start off by finding the top-n paid roles in this dataset. You can do this using the nlargest() method in Pandas. This method returns the first n rows with the largest values in column(s), ordered in descending order.

Note that nlargest() returns the entire DataFrame, i.e., the function also returns the columns not specified for ordering. However, they are not used to order the DataFrame. The code snippet below depicts the use of nlargest() method on our DataFrame.

The output of the nlargest method (Image by Author)

When duplicate values exist, we need to specify which particular row(s) we want in the final output. This is done using the keep argument that can take the following values:

  1. keep = "first": prioritizes the first occurrence.
  2. keep = "last": prioritizes the last occurrence.
  3. keep = "all": Does not drop any duplicates, even if it means selecting more than n items (like in the image above).

It is often mistaken that the nlargest()is precisely equivalent to using the sort_values()method as follows:

Output of sort_values method (Image by Author)

However, the keepargument used in nlargest() makes all the difference. Considering the example above, nlargest() with keep=”all"returns potential duplicates as well. This, on the other hand, can not be done in the case of sort_values() method.

#2 n-smallest values in a series

Similar to the nlargest() method discussed above, you can find the rows corresponding to the lowest-n values using the nsmallest() method in Pandas. This method returns the first n rows with the smallest values in column(s), arranged in ascending order. The arguments passed here are the same as those specified in the nlargest() method. The code snippet below depicts the use of nsmallest() method on our DataFrame.

The output of the nsmallest method (Image by Author)

#3 CrossTabs

Crosstab allows you to compute a cross-tabulation of two (or more) columns/series and returns a frequency of each combination by default. In other words, crosstab() takes one column/list, displays its unique values as indexes, and then takes another column/list and displays its unique values as the column headers. The values in the individual cells are computed using an aggregation function. By default, they indicate the co-occurrence frequency.

Say, for instance, we wish to compute the number of employees working from each location within every company. This can be done as follows:

The output of Crosstab to compute the frequency of co-occurrence (Image by Author)

As it can be hard to interpret numerical values in a crosstab (and to make it more visually appealing), we can generate a heatmap from a crosstab shown below as follows:

Heatmap depicting the co-occurrence dataframe (Image by author)

If you wish to compute aggregation on some column other than the ones that make up the indexes and the column headers, you can do so by passing the aggregation column to values argument of crosstab()as shown below:

Heatmap depicting the average salary (Image by author)

#4 Pivot Table

Pivot tables are a commonly used data analysis tool in Excel. Similar to crosstabs discussed above, pivot tables in Pandas provide a way to cross-tabulate your data.

Although they both share numerous similarities and are conceptually the same in the context of Pandas, there are a few implementational differences that make them different (further reading here). The code snippet below demonstrates the use of pivot_table() method to find the frequency of co-occurrence between the “Company Name” and “Location”:

The output of the pivot table to compute the frequency of co-occurrence (Image by Author)

Similar to what we did in Crosstab, we can create a heatmap to make it visually appealing as well as more interpretable. This can be done as shown in the code snippet to generate the following heatmap:

Heatmap depicting the co-occurrence dataframe (Image by author)

#5 Handling Duplicated Data

In addition to the regular data analysis, appropriately handling duplicate values in your data also plays a vital role in building your data pipeline. One major caveat of having duplicates in your data is that they take up unnecessary storage space and slow down the computation by acquiring resources. Furthermore, duplicate data can skew analysis results, leading us to draw wrong insights. Therefore, removing or handling duplicates in your data is extremely important.

First, let’s look at how you can mark duplicate values in your DataFrame. For this, we’ll use the duplicated()method in Pandas. This returns a boolean Series that indicates duplicate rows. For demonstration purposes, I’ll only use a random sample of 10 rows of the original salary dataset, of which the last two rows have been intentionally duplicated. The sampled rows are shown in the image below.

A Dataframe with duplicates (Image by author)

Pandas allows you to assign boolean labels to rows based on all columns (or a subset of columns) which are duplicates. This can be done using the duplicated() method of Pandas as shown below:

When there are duplicate values, keep is used to indicate which specific duplicates to mark.

  1. keep = "first": (Default) Marks all duplicates as True except for the first occurrence.
  2. keep = "last": Marks all duplicates as True except for the last occurrence.
  3. keep = False: Marks all duplicates as True.

You can filter all the rows which appear only once by passing the boolean series as flags for filtering a Pandas DataFrame as follows:

A filtered Dataframe with no duplicates (Image by author)

To check duplicates on a subset of columns, pass the list of columns as the subset argument of duplicated() method as shown below:

Filtering the DataFrame using the above boolean series as shown below outputs the DataFrame following the code:

A filtered Dataframe with duplicates considering two columns (Image by author)

In addition to marking potential duplicates using boolean labels discussed above, one might also need to get rid of duplicates. To reiterate, the data I am referring to specifically for the “Handling Duplicated Data” section comprises just ten rows. This is shown below:

A Dataframe with duplicates (Image by author)

You can remove the duplicate rows either based on values in all columns or a subset of columns using the drop_duplicates() method as shown below:

A DataFrame after dropping duplicate rows (Image by author)

Similar to duplicated(), the keep argument is used to indicate which specific duplicates you want to keep.

  1. keep = "first": (Default) Drops all duplicates except for the first occurrence.
  2. keep = "last": Drops all duplicates except for the last occurrence.
  3. keep = False: Drops all duplicates.

To drop duplicates based on the values in a subset of columns, pass the list of columns as the subset argument to the drop_duplicates() method:

A DataFrame after dropping duplicate rows considering two columns (Image by author)


Things you can do in one line using Pandas

Photo by KirstenMarie on Unsplash

Training data-driven machine learning models has never been as easy as today. For instance, assume you are training a vanilla neural network. Here, adjusting the architecture for the number of hidden layers and their dimension, tweaking the hyperparameters, or changing the loss function can all be done with a slight modification in the model definition or its optimizer.

While on one hand, this is advantageous as it reduces the heavy lifting of spending time designing architectures from scratch. However, this has often led machine learning practitioners/researchers to neglect the importance of data visualizations and analysis — leading them to train deep models directly without establishing a clear understanding of their data.

Therefore, in this post, I would like to introduce you to a handful of essential and powerful one-liners specifically for tabular data using Pandas that will help you better understand your data and consequently (and hopefully) help you design and build better machine learning models.

For this post, I will experiment with a dummy dataset of one thousand Employees which I created myself in Python. The image below gives an overview of the dataset we are experimenting with.

First five rows of the DataFrame (Image by author)

The code block below demonstrates my implementation:

Next, let’s discuss some popular functions available in Pandas to make a meaningful understanding of the available data.

#1 n-largest values in a series

Say we want to start off by finding the top-n paid roles in this dataset. You can do this using the nlargest() method in Pandas. This method returns the first n rows with the largest values in column(s), ordered in descending order.

Note that nlargest() returns the entire DataFrame, i.e., the function also returns the columns not specified for ordering. However, they are not used to order the DataFrame. The code snippet below depicts the use of nlargest() method on our DataFrame.

The output of the nlargest method (Image by Author)

When duplicate values exist, we need to specify which particular row(s) we want in the final output. This is done using the keep argument that can take the following values:

  1. keep = "first": prioritizes the first occurrence.
  2. keep = "last": prioritizes the last occurrence.
  3. keep = "all": Does not drop any duplicates, even if it means selecting more than n items (like in the image above).

It is often mistaken that the nlargest()is precisely equivalent to using the sort_values()method as follows:

Output of sort_values method (Image by Author)

However, the keepargument used in nlargest() makes all the difference. Considering the example above, nlargest() with keep=”all"returns potential duplicates as well. This, on the other hand, can not be done in the case of sort_values() method.

#2 n-smallest values in a series

Similar to the nlargest() method discussed above, you can find the rows corresponding to the lowest-n values using the nsmallest() method in Pandas. This method returns the first n rows with the smallest values in column(s), arranged in ascending order. The arguments passed here are the same as those specified in the nlargest() method. The code snippet below depicts the use of nsmallest() method on our DataFrame.

The output of the nsmallest method (Image by Author)

#3 CrossTabs

Crosstab allows you to compute a cross-tabulation of two (or more) columns/series and returns a frequency of each combination by default. In other words, crosstab() takes one column/list, displays its unique values as indexes, and then takes another column/list and displays its unique values as the column headers. The values in the individual cells are computed using an aggregation function. By default, they indicate the co-occurrence frequency.

Say, for instance, we wish to compute the number of employees working from each location within every company. This can be done as follows:

The output of Crosstab to compute the frequency of co-occurrence (Image by Author)

As it can be hard to interpret numerical values in a crosstab (and to make it more visually appealing), we can generate a heatmap from a crosstab shown below as follows:

Heatmap depicting the co-occurrence dataframe (Image by author)

If you wish to compute aggregation on some column other than the ones that make up the indexes and the column headers, you can do so by passing the aggregation column to values argument of crosstab()as shown below:

Heatmap depicting the average salary (Image by author)

#4 Pivot Table

Pivot tables are a commonly used data analysis tool in Excel. Similar to crosstabs discussed above, pivot tables in Pandas provide a way to cross-tabulate your data.

Although they both share numerous similarities and are conceptually the same in the context of Pandas, there are a few implementational differences that make them different (further reading here). The code snippet below demonstrates the use of pivot_table() method to find the frequency of co-occurrence between the “Company Name” and “Location”:

The output of the pivot table to compute the frequency of co-occurrence (Image by Author)

Similar to what we did in Crosstab, we can create a heatmap to make it visually appealing as well as more interpretable. This can be done as shown in the code snippet to generate the following heatmap:

Heatmap depicting the co-occurrence dataframe (Image by author)

#5 Handling Duplicated Data

In addition to the regular data analysis, appropriately handling duplicate values in your data also plays a vital role in building your data pipeline. One major caveat of having duplicates in your data is that they take up unnecessary storage space and slow down the computation by acquiring resources. Furthermore, duplicate data can skew analysis results, leading us to draw wrong insights. Therefore, removing or handling duplicates in your data is extremely important.

First, let’s look at how you can mark duplicate values in your DataFrame. For this, we’ll use the duplicated()method in Pandas. This returns a boolean Series that indicates duplicate rows. For demonstration purposes, I’ll only use a random sample of 10 rows of the original salary dataset, of which the last two rows have been intentionally duplicated. The sampled rows are shown in the image below.

A Dataframe with duplicates (Image by author)

Pandas allows you to assign boolean labels to rows based on all columns (or a subset of columns) which are duplicates. This can be done using the duplicated() method of Pandas as shown below:

When there are duplicate values, keep is used to indicate which specific duplicates to mark.

  1. keep = "first": (Default) Marks all duplicates as True except for the first occurrence.
  2. keep = "last": Marks all duplicates as True except for the last occurrence.
  3. keep = False: Marks all duplicates as True.

You can filter all the rows which appear only once by passing the boolean series as flags for filtering a Pandas DataFrame as follows:

A filtered Dataframe with no duplicates (Image by author)

To check duplicates on a subset of columns, pass the list of columns as the subset argument of duplicated() method as shown below:

Filtering the DataFrame using the above boolean series as shown below outputs the DataFrame following the code:

A filtered Dataframe with duplicates considering two columns (Image by author)

In addition to marking potential duplicates using boolean labels discussed above, one might also need to get rid of duplicates. To reiterate, the data I am referring to specifically for the “Handling Duplicated Data” section comprises just ten rows. This is shown below:

A Dataframe with duplicates (Image by author)

You can remove the duplicate rows either based on values in all columns or a subset of columns using the drop_duplicates() method as shown below:

A DataFrame after dropping duplicate rows (Image by author)

Similar to duplicated(), the keep argument is used to indicate which specific duplicates you want to keep.

  1. keep = "first": (Default) Drops all duplicates except for the first occurrence.
  2. keep = "last": Drops all duplicates except for the last occurrence.
  3. keep = False: Drops all duplicates.

To drop duplicates based on the values in a subset of columns, pass the list of columns as the subset argument to the drop_duplicates() method:

A DataFrame after dropping duplicate rows considering two columns (Image by author)

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai NewsAviChawlaDataJulOnelinersPandaspowerfulScientistTech NewsTechnoblender
Comments (0)
Add Comment