The Only 30 Methods You Should Master To Become A Pandas Pro | by Avi Chawla | Oct, 2022

By Jessie Hobb On Oct 11, 2022

After using pandas for over three years, here are the 30 methods I have used almost all the time

Photo by Glenn Carstens-Peters on Unsplash

Pandas is undoubtedly one of the best libraries ever built in Python for tabular data-wrangling and processing tasks.

Being open-source, numerous developers from different parts of the world have contributed to its development and brought it to where it is today — supporting hundreds of methods for various tasks.

However, if you are a newbie and trying to get a firm hold at the Pandas library, things can appear very daunting and overwhelming at first if you start with Pandas’ Official Documentation.

The list of topics is shown below:

List of Topics in Official Pandas API Documentation (Image by Author) (Source: here)

Having been there myself, this blog is intended to assist you in getting started with Pandas.

In other words, in this blog, I will reflect on my 3+ years of experience using Pandas and share those 30 specific methods that I have used almost all the time.

You can find the code for this article here.

Let’s begin 🚀!

Of course, if you want to use the Pandas library, you should import it. The widely-adopted convention here is to set the alias of pandas as pd.

CSVs are typically the most prevalent file format to read Pandas DataFrames from.

You can use the pd.read_csv() method to create a Pandas DataFrame:

We can verify the type of object created using the type() method.

Just as CSVs are prevalent to read a DataFrame from, they are also widely used to dump a DataFrame to as well.

Use the df.to_csv() method as shown below:

The separator (sep) indicates the column delimiter and index=False instructs Pandas to NOT write the index of the DataFrame in the CSV file.

To create a Pandas DataFrame, the pd.DataFrame() method is used:

From a list of lists

One popular way is to convert a given list of lists to a DataFrame:

From a Dictionary

Another popular way is to convert a Python dictionary to a DataFrame:

You can read more about creating a DataFrame here.

A DataFrame is essentially a matrix with column headers. Therefore, it has a specific number of rows and columns.

You can print the dimensions with the shape argument as follows:

Here, the first element of the tuple (2) is the number of rows and the second element (3) is the number of columns.

Typically, in real-world datasets, you would have many rows.

In such situations, one is usually interested in viewing just the first n rows of the DataFrame.

You can use the df.head(n) method to print the first n rows:

Pandas assigns an appropriate data type to every column in the DataFrame.

You can print the datatype of all columns using the dtypes argument:

If you want to change the datatype of a column, you can use the astype() method as follows:

Method 1

The first method (df.info()) is used to print the missing-value stats and the datatypes.

Method 2

This is relatively more descriptive and prints standard statistics like mean, standard deviation, maximum etc. of every numeric-valued column.

The method is df.describe().

Missing data is almost inevitable in real-world datasets.

Here, you can use the df.fillna() method to replace them with a specific value.

Read more about handling missing data in my previous blog:

If you want to merge two DataFrames with a joining key, use the pd.merge() method:

Sorting is another typical operation that Data Scientists use to order a DataFrame.

You can use the df.sort_values() method to sort a DataFrame.

To group a DataFrame and perform aggregations, use the groupby() method in Pandas, as shown below:

If you want to rename the column headers, use the df.rename() method, as demonstrated below:

If you want to delete a column, use the df.drop() method:

The two widely used approaches to add new columns are:

Method 1

You can use the assignment operator to add a new column:

Method 2

Alternatively, you can also use the df.assign() method as follows:

There are various ways to filter a DataFrame based on conditions.

Method 1: Boolean Filtering

Here, a row is selected if the condition on that row evaluates to True.

The value in col2 should be greater than 5 for a row to be filtered.

The isin() method is used to select rows whose value belongs to a list of values.

You can read about string-based filtering in my previous blog:

Method 2: Getting a Column

You can also filter an entire column as follows:

Method 3: Selecting by Label

In label-based selection, every label asked for must be in the index of the DataFrame.

Integers are valid labels too, but they refer to the label and not the position.

Consider the following DataFrame.

We use df.loc method for label-based selection.

However, in df.loc[], you are not allowed to use position to filter the DataFrame, as shown below:

To achieve the above, you should use position-based selection using df.iloc[].

Method 4: Selecting by Position

To print all the distinct values in a column, use the unique() method.

If you want to print the number of unique values, use nunique() instead.

If you want to apply a function to a column, use the apply() method as demonstrated below:

You can also apply a method to a single column as follows:

You can mark all the repeated rows using the df.duplicated() method:

All the rows that are duplicates get marked as True with keep=False.

Further, you can drop the duplicated rows using the df.drop_duplicates() method as follows:

One copy of the duplicate row is preserved.

To find the frequency of each unique value in a column, use the value_counts() method:

To reset the index of the DataFrame, use the df.reset_index() method:

To drop the old index, pass drop=True as an argument to the above method:

To return the frequency of each combination of values across two columns, use the pd.crosstab() method:

Pivot tables are a commonly used data analysis tool in Excel. Similar to crosstabs discussed above, pivot tables in Pandas provide a way to cross-tabulate your data.

Consider the DataFrame below:

With the pd.pivot_table() method, you can convert the column entries to column headers: