5 String-Based Filtering Methods Every Pandas User Should Know | by Avi Chawla | Aug, 2022

By Jessie Hobb On Aug 5, 2022

Before I proceed with the popular methods in Pandas to filter data on string values, let’s understand how you can identify a column with a string data type.

In Pandas, the data type of a string column is represented as object. To determine the data type, you can use the dtype attribute of a series as follows:

Here, you should note that even if a single value in a series is a string, the whole column will be interpreted as a string-type column. For instance, let’s change the first value in col2 from 1 to “1".

This time, the data type of col2 is object rather than int64— depicting a string data type.

Next, let’s proceed with understanding methods that you can use to filter DataFrames on a column with object data type.

#1 Filter based on a single categorical value

First, say you want to filter all the rows whose value in the string column belongs to a single categorical value in the column. This is demonstrated in the image below:

The above filtering is implemented below:

The above approach filters all the rows where the value in col1 is “A”.

This can also be implemented using the query() method as shown below:

Note: While filtering using the query() method on a string column, you should enclose the filter value in single quotes as demonstrated above.

#2 Filter based on multiple categorical values

Similar to the above filtering, if you want to filter multiple values in a single go, you can do so in three ways.

The above condition states that the value in col1 should either be “A” or “B”.

The second way is to use the isin() method as demonstrated below:

The isin() method used above accepts a list of values to filter.

Lastly, we can use the query() method as shown below:

The isin() method used above accepts a list of filter values. On the other hand, the query() method evaluates a string expression to filter rows from a DataFrame.

#3 Filter based on the length of string

Here, say you want to filter all the rows from a DataFrame where the length of the strings in a column is greater/less than a threshold.

Invoking the len() method on a series lets you compute the length of individual entries, which can then be used to filter the rows according to a threshold.

Below, we filter all the strings from col1 whose length is greater than 4.

Before executing a method on an object column, values should be retrieved as string type using the str attribute, over which you can run a range of string methods available in python, such as strip(), isupper(), upper(), len() etc.

#4 Filter based on the presence of a substring

Next, say you want to extract rows for which the values in the string column contain a particular substring.

There are three widely used methods for this.

Match at the beginning of the string

As the name suggests, this method will return a row only if the substring matches the beginning of the string-value column.

Say you want to find all strings which begin with the substring “Jo”. We will use the startswith() method demonstrated below. Also, recall from the previous filtering method (#3), we should first convert the object column to a string using the str attribute.

If your column has NaN values, you should specify nan=False in the startswith() method, otherwise, it will raise an error

The error block is shown below:

Specifying nan=False ignores NaN values:

Match at the end of the string

Matching at the end of the string has a similar syntax to startswith(). Here, we use the endswith() method as shown below:

Note: Both startswith() and endswith() are case-sensitive methods.

Match anywhere in the string

In contrast to the startswith() and endswith() method that only match a substring at the start and the end of the string, respectively, the contains() method can find potential matches anywhere within the string-valued column.

By default, the contains() method performs case-sensitive matches. However, it can perform case-insensitive matches as well by passing the case=False argument as shown below:

#5 Filter based on the type of characters in a string

This type of filtering is based on the type of characters present in the string, such as:

- Filter if all characters are upper-case   : isupper()
- Filter if all characters are lower-case   : islower()
- Filter if all characters are alphabetic   : isalpha()
- Filter if all characters are numeric      : isnumeric()
- Filter if all characters are digits       : isdigit()
- Filter if all characters are decimal      : isdecimal()
- Filter if all characters are whitespace   : isspace()
- Filter if all characters are titlecase    : istitle()
- Filter if all characters are alphanumeric : isalnum()

I have demonstrated a couple of these methods below.

Filter alphanumeric strings from the DataFrame:

Filter numeric strings from the DataFrame: