Techno Blender
Digitally Yours.

Pandas Exercise for Data Scientists — Part 2 | by Avi Chawla | Jun, 2022

0 68


A set of challenging Pandas Questions

Photo by ALAN DE LA CRUZ on Unsplash

Pandas library has always intrigued Data Scientists to do amazing things with it. It is undoubtedly the go-to tool for tabular data handling, manipulation, and processing.

Therefore, to scale your expertise, challenge your existing knowledge, and introduce you to numerous popular Pandas functions among Data Scientists, I am presenting Part 2 of the Pandas Exercise. You can find the Part 1 of the Pandas Exercise here:

The objective is to strengthen your logical muscle and help internalize data manipulation with one of the best Python packages for data analysis.

Find the notebook with all questions for this quiz here: GitHub.

Table of Contents:

1. The cumulative sum of a column in DataFrame
2. Assign Unique IDs to every Group
3. Check if a column has NaN values
4. Append a list as a row to a DataFrame
5. Get the first row of every unique value in a column
6. Identify the source of each row in Pandas Merge
7. Filter n-largest and n-smallest values from a DataFrame
8. Map categorical data to unique integral values
9. Add prefix to every column name
10. Convert categorical columns to one hot values

As an exercise, I recommend you attempt the questions yourself and then look at the solution I have provided.

Note that the solutions I have provided here may not be the only way to solve the problem. You may come up with something different and still be correct. However, if that happens, do drop a comment, and I’ll be interested to know your approach.

Let’s begin!

Prompt: You are given a DataFrame. Your task is to generate a new column from the integral column, which represents the cumulative sum of the column.

Input and Expected Output:

Solution:

Here, we can use the cumsum() method on the given series and obtain the cumulative sum as shown below:

P.S. Can you also try the Cumulative Product, Cumulative Maximum, and Cumulative Minimum?

Prompt: Next, you have a DataFrame in which one column has repeating values. Your task is to generate a new series so that every group gets a unique number.

Input and Expected Output:

Below, the value “A” in col_A has been assigned the value 1 in the new series. Further, for every occurrence of “A”, the value in the group_num column is always 1.

Solution:

Here, after group_by, you can use the grouper.group_info method as shown below:

Prompt: As the next problem, your task is to determine whether there is a NaN value present in a column or not. You don’t need to find the number of NaN values or anything, just True or False whether there are one or more NaN values in the column.

Input and Expected Output:

Solution:

Here, we can use the hasnans method on the series to get the desired result as demonstrated below:

Prompt: Everyone knows how to push elements to a python list (using the append method on the list). However, have you ever appended a new row to a DataFrame? For the next task, you are given a DataFrame and a list that should be appended as a new row in the DataFrame.

Input and Expected Output:

Solution:

Here, we can use loc and assign the new row to a new index of the DataFrame as shown below:

Prompt: Given a DataFrame, your task is to get the entire row of the first occurrence of every unique element in the column col_A.

Input and Expected Output:

Solution:

Here, we will use GroupBy on the given column and get the first row as shown below:

Prompt: Next, consider that you have two DataFrames. Your task is to join them so that the output contains a column that denotes the source of the row from the original DataFrame.

Input and Expected Output:

Solution:

We can use the merge method and pass the indicator argument as True, as shown below:

Prompt: In this exercise, you are given a DataFrame. Your task is to get the entire row whose value in col_B belongs to the top-k entries of the column.

Input and Expected Output:

Solution:

We can use the nlargest method and pass the number of top values we need from the specified column:

Similar to the above method, you can use the nsmallest method to get the top-k smallest values from the column.

Prompt: Next, given a DataFrame, you need to map every unique entry of a column to a unique integral identifier.

Input and Expected Output:

Solution:

Using the pd.factorize method, you can generate a new series that denotes the integer-based encodings of the given column.

Prompt: Similar to earlier tasks, you are given the same DataFrame. Your job is to rename all the columns and add “pre_” as a prefix to all of them.

Input and Expected Output:

Solution:

Here, we can use the add_prefix method and pass the string we want as a prefix in all column names as shown below:

Prompt: Lastly, you are given a categorical column in a DataFrame. You need to convert it to one-hot values.

Input and Expected Output:

Solution:

Here, we can use the get_dummies method and pass the series as an argument, as shown below:


A set of challenging Pandas Questions

Photo by ALAN DE LA CRUZ on Unsplash

Pandas library has always intrigued Data Scientists to do amazing things with it. It is undoubtedly the go-to tool for tabular data handling, manipulation, and processing.

Therefore, to scale your expertise, challenge your existing knowledge, and introduce you to numerous popular Pandas functions among Data Scientists, I am presenting Part 2 of the Pandas Exercise. You can find the Part 1 of the Pandas Exercise here:

The objective is to strengthen your logical muscle and help internalize data manipulation with one of the best Python packages for data analysis.

Find the notebook with all questions for this quiz here: GitHub.

Table of Contents:

1. The cumulative sum of a column in DataFrame
2. Assign Unique IDs to every Group
3. Check if a column has NaN values
4. Append a list as a row to a DataFrame
5. Get the first row of every unique value in a column
6. Identify the source of each row in Pandas Merge
7. Filter n-largest and n-smallest values from a DataFrame
8. Map categorical data to unique integral values
9. Add prefix to every column name
10. Convert categorical columns to one hot values

As an exercise, I recommend you attempt the questions yourself and then look at the solution I have provided.

Note that the solutions I have provided here may not be the only way to solve the problem. You may come up with something different and still be correct. However, if that happens, do drop a comment, and I’ll be interested to know your approach.

Let’s begin!

Prompt: You are given a DataFrame. Your task is to generate a new column from the integral column, which represents the cumulative sum of the column.

Input and Expected Output:

Solution:

Here, we can use the cumsum() method on the given series and obtain the cumulative sum as shown below:

P.S. Can you also try the Cumulative Product, Cumulative Maximum, and Cumulative Minimum?

Prompt: Next, you have a DataFrame in which one column has repeating values. Your task is to generate a new series so that every group gets a unique number.

Input and Expected Output:

Below, the value “A” in col_A has been assigned the value 1 in the new series. Further, for every occurrence of “A”, the value in the group_num column is always 1.

Solution:

Here, after group_by, you can use the grouper.group_info method as shown below:

Prompt: As the next problem, your task is to determine whether there is a NaN value present in a column or not. You don’t need to find the number of NaN values or anything, just True or False whether there are one or more NaN values in the column.

Input and Expected Output:

Solution:

Here, we can use the hasnans method on the series to get the desired result as demonstrated below:

Prompt: Everyone knows how to push elements to a python list (using the append method on the list). However, have you ever appended a new row to a DataFrame? For the next task, you are given a DataFrame and a list that should be appended as a new row in the DataFrame.

Input and Expected Output:

Solution:

Here, we can use loc and assign the new row to a new index of the DataFrame as shown below:

Prompt: Given a DataFrame, your task is to get the entire row of the first occurrence of every unique element in the column col_A.

Input and Expected Output:

Solution:

Here, we will use GroupBy on the given column and get the first row as shown below:

Prompt: Next, consider that you have two DataFrames. Your task is to join them so that the output contains a column that denotes the source of the row from the original DataFrame.

Input and Expected Output:

Solution:

We can use the merge method and pass the indicator argument as True, as shown below:

Prompt: In this exercise, you are given a DataFrame. Your task is to get the entire row whose value in col_B belongs to the top-k entries of the column.

Input and Expected Output:

Solution:

We can use the nlargest method and pass the number of top values we need from the specified column:

Similar to the above method, you can use the nsmallest method to get the top-k smallest values from the column.

Prompt: Next, given a DataFrame, you need to map every unique entry of a column to a unique integral identifier.

Input and Expected Output:

Solution:

Using the pd.factorize method, you can generate a new series that denotes the integer-based encodings of the given column.

Prompt: Similar to earlier tasks, you are given the same DataFrame. Your job is to rename all the columns and add “pre_” as a prefix to all of them.

Input and Expected Output:

Solution:

Here, we can use the add_prefix method and pass the string we want as a prefix in all column names as shown below:

Prompt: Lastly, you are given a categorical column in a DataFrame. You need to convert it to one-hot values.

Input and Expected Output:

Solution:

Here, we can use the get_dummies method and pass the series as an argument, as shown below:

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment