5 String-Based Filtering Methods Every Pandas User Should Know | by Avi Chawla | Aug, 2022
Before I proceed with the popular methods in Pandas to filter data on string values, let’s understand how you can identify a column with a string data type.
In Pandas, the data type of a string column is represented as object
. To determine the data type, you can use the dtype
attribute of a series as follows:
Here, you should note that even if a single value in a series is a string, the whole column will be interpreted as a string-type column. For instance, let’s change the first value in col2
from 1
to “1"
.
This time, the data type of col2
is object
rather than int64
— depicting a string data type.
Next, let’s proceed with understanding methods that you can use to filter DataFrames on a column with object
data type.
#1 Filter based on a single categorical value
First, say you want to filter all the rows whose value in the string column belongs to a single categorical value in the column. This is demonstrated in the image below:
The above filtering is implemented below:
The above approach filters all the rows where the value in col1
is “A”
.
This can also be implemented using the query()
method as shown below:
Note: While filtering using the query()
method on a string column, you should enclose the filter value in single quotes as demonstrated above.
#2 Filter based on multiple categorical values
Similar to the above filtering, if you want to filter multiple values in a single go, you can do so in three ways.
The above condition states that the value in col1
should either be “A”
or “B”
.
- The second way is to use the
isin()
method as demonstrated below:
The isin()
method used above accepts a list of values to filter.
- Lastly, we can use the
query()
method as shown below:
The isin()
method used above accepts a list of filter values. On the other hand, the query()
method evaluates a string expression to filter rows from a DataFrame.
#3 Filter based on the length of string
Here, say you want to filter all the rows from a DataFrame where the length of the strings in a column is greater/less than a threshold.
Invoking the len()
method on a series lets you compute the length of individual entries, which can then be used to filter the rows according to a threshold.
Below, we filter all the strings from col1 whose length is greater than 4
.
Before executing a method on an object
column, values should be retrieved as string type using the str
attribute, over which you can run a range of string methods available in python, such as strip()
, isupper()
, upper()
, len()
etc.
#4 Filter based on the presence of a substring
Next, say you want to extract rows for which the values in the string column contain a particular substring.
There are three widely used methods for this.
- Match at the beginning of the string
As the name suggests, this method will return a row only if the substring matches the beginning of the string-value column.
Say you want to find all strings which begin with the substring “Jo”. We will use the startswith()
method demonstrated below. Also, recall from the previous filtering method (#3), we should first convert the object column to a string using the str
attribute.
If your column has NaN values, you should specify nan=False
in the startswith()
method, otherwise, it will raise an error
The error block is shown below:
Specifying nan=False
ignores NaN values:
- Match at the end of the string
Matching at the end of the string has a similar syntax to startswith()
. Here, we use the endswith()
method as shown below:
Note: Both
startswith()
andendswith()
are case-sensitive methods.
- Match anywhere in the string
In contrast to the startswith()
and endswith()
method that only match a substring at the start and the end of the string, respectively, the contains()
method can find potential matches anywhere within the string-valued column.
By default, the contains() method performs case-sensitive matches. However, it can perform case-insensitive matches as well by passing the case=False
argument as shown below:
#5 Filter based on the type of characters in a string
This type of filtering is based on the type of characters present in the string, such as:
- Filter if all characters are upper-case : isupper()
- Filter if all characters are lower-case : islower()
- Filter if all characters are alphabetic : isalpha()
- Filter if all characters are numeric : isnumeric()
- Filter if all characters are digits : isdigit()
- Filter if all characters are decimal : isdecimal()
- Filter if all characters are whitespace : isspace()
- Filter if all characters are titlecase : istitle()
- Filter if all characters are alphanumeric : isalnum()
I have demonstrated a couple of these methods below.
- Filter alphanumeric strings from the DataFrame:
- Filter numeric strings from the DataFrame:
Before I proceed with the popular methods in Pandas to filter data on string values, let’s understand how you can identify a column with a string data type.
In Pandas, the data type of a string column is represented as object
. To determine the data type, you can use the dtype
attribute of a series as follows:
Here, you should note that even if a single value in a series is a string, the whole column will be interpreted as a string-type column. For instance, let’s change the first value in col2
from 1
to “1"
.
This time, the data type of col2
is object
rather than int64
— depicting a string data type.
Next, let’s proceed with understanding methods that you can use to filter DataFrames on a column with object
data type.
#1 Filter based on a single categorical value
First, say you want to filter all the rows whose value in the string column belongs to a single categorical value in the column. This is demonstrated in the image below:
The above filtering is implemented below:
The above approach filters all the rows where the value in col1
is “A”
.
This can also be implemented using the query()
method as shown below:
Note: While filtering using the query()
method on a string column, you should enclose the filter value in single quotes as demonstrated above.
#2 Filter based on multiple categorical values
Similar to the above filtering, if you want to filter multiple values in a single go, you can do so in three ways.
The above condition states that the value in col1
should either be “A”
or “B”
.
- The second way is to use the
isin()
method as demonstrated below:
The isin()
method used above accepts a list of values to filter.
- Lastly, we can use the
query()
method as shown below:
The isin()
method used above accepts a list of filter values. On the other hand, the query()
method evaluates a string expression to filter rows from a DataFrame.
#3 Filter based on the length of string
Here, say you want to filter all the rows from a DataFrame where the length of the strings in a column is greater/less than a threshold.
Invoking the len()
method on a series lets you compute the length of individual entries, which can then be used to filter the rows according to a threshold.
Below, we filter all the strings from col1 whose length is greater than 4
.
Before executing a method on an object
column, values should be retrieved as string type using the str
attribute, over which you can run a range of string methods available in python, such as strip()
, isupper()
, upper()
, len()
etc.
#4 Filter based on the presence of a substring
Next, say you want to extract rows for which the values in the string column contain a particular substring.
There are three widely used methods for this.
- Match at the beginning of the string
As the name suggests, this method will return a row only if the substring matches the beginning of the string-value column.
Say you want to find all strings which begin with the substring “Jo”. We will use the startswith()
method demonstrated below. Also, recall from the previous filtering method (#3), we should first convert the object column to a string using the str
attribute.
If your column has NaN values, you should specify nan=False
in the startswith()
method, otherwise, it will raise an error
The error block is shown below:
Specifying nan=False
ignores NaN values:
- Match at the end of the string
Matching at the end of the string has a similar syntax to startswith()
. Here, we use the endswith()
method as shown below:
Note: Both
startswith()
andendswith()
are case-sensitive methods.
- Match anywhere in the string
In contrast to the startswith()
and endswith()
method that only match a substring at the start and the end of the string, respectively, the contains()
method can find potential matches anywhere within the string-valued column.
By default, the contains() method performs case-sensitive matches. However, it can perform case-insensitive matches as well by passing the case=False
argument as shown below:
#5 Filter based on the type of characters in a string
This type of filtering is based on the type of characters present in the string, such as:
- Filter if all characters are upper-case : isupper()
- Filter if all characters are lower-case : islower()
- Filter if all characters are alphabetic : isalpha()
- Filter if all characters are numeric : isnumeric()
- Filter if all characters are digits : isdigit()
- Filter if all characters are decimal : isdecimal()
- Filter if all characters are whitespace : isspace()
- Filter if all characters are titlecase : istitle()
- Filter if all characters are alphanumeric : isalnum()
I have demonstrated a couple of these methods below.
- Filter alphanumeric strings from the DataFrame:
- Filter numeric strings from the DataFrame: