What If .apply() Is Too Slow?. Sometimes we need to apply some… | by Yufeng | Dec, 2022

By Jessie Hobb On Dec 8, 2022

Sometimes we need to apply some function to the Pandas Dataframe in Python by using its column(s) as the function’s input. However, one of the most used methods .apply() to the entire data frame can take you longer than expected. What should we do?

If you are working with data in Python, Pandas must be one of your most used libraries because of its convenient and powerful data processing features.

If we want to apply the same function to the values from an entire column in the Pandas data frame, we can simply use .apply(). Both Pandas Dataframe and Pandas Series (a column in the data frame) can be used with .apply().

However, have you noticed that the .apply() can be very slow when we have a super large dataset?

In this article, I’m going to talk about the tricks to speed up data manipulations when you want to apply some function to the column/s.

Apply a function to a single column

For example, here’s our toy dataset.

import pandas as pd
import numpy as np
import timeitd = {'category': ['apple', 'pear', 'peach'], 'radius': [3, 4, 2], 'sweetness': [1, 2, 3]}
df = pd.DataFrame(data=d)
df

If we want to add one more column, ‘diameter’, to the data frame based on the values in the radius column, where basically diameter = radius * 2, we can go ahead to use .apply() here.

df['diameter'] = df['radius'].apply(lambda x: x*2)
df

Then, we calculate the time for executing the command line 10k times,

# Timing
setup_code = """
import pandas as pd
d = {'category': ['apple', 'pear', 'peach'], 'radius': [3, 4, 2], 'sweetness': [1, 2, 3]}
df = pd.DataFrame(data=d)
"""mycode = '''
df['radius'].apply(lambda x: x*2)
'''
# timeit statement
t1 = timeit.timeit(setup=setup_code,
stmt = mycode,
number = 10000)
print(f"10000 runs of mycode is {t1}")

which yields 0.55 secs. It’s not bad, hmm..? But remember it’s only a toy data with 3 rows. What if we have millions of rows?

You may have noticed that we do not need to use .apply() here, where you can simply do the following,

df['diameter'] = df['radius']*2
df

We can see the output is the same as that of using .apply(). And if we calculate the execution time of 10k runs,

# Timing
setup_code = """
import pandas as pd
d = {'category': ['apple', 'pear', 'peach'], 'radius': [3, 4, 2], 'sweetness': [1, 2, 3]}
df = pd.DataFrame(data=d)
"""mycode = '''
df['radius']*2
'''
# timeit statement
t1 = timeit.timeit(setup=setup_code,
stmt = mycode,
number = 10000)
print(f"10000 runs of mycode is {t1}")

which gave us 0.32 secs in total, which is faster than that of the .apply() function.

Note that we can simply avoid using .apply() here only because we are using a very simple function that calculates a value multiplied by 2. But for most cases, we need to apply a more complex function to the column.

For example, we want to add a column of the larger value between the radius and a constant, say 3, for each observation. If you simply do the following,

max(df['radius'],3)

it will generate the error message below,

So, we need to write the comparison codes in the apply() function.

df['radius_or_3'] = df['radius'].apply(lambda x: max(x,3))
df

Let’s calculate the execution time,

# Timing
setup_code = """
import pandas as pd
d = {'category': ['apple', 'pear', 'peach'], 'radius': [3, 4, 2], 'sweetness': [1, 2, 3]}
df = pd.DataFrame(data=d)
"""mycode = '''
df['radius'].apply(lambda x: max(x,3))
'''
# timeit statement
t1 = timeit.timeit(setup=setup_code,
stmt = mycode,
number = 10000)
print(f"10000 runs of mycode is {t1}")

which gave us 0.56 secs. But how long does it take if the data has millions of rows? I didn’t show it here, but it will take tens of minutes. It’s not acceptable for such a simple manipulation, right?

How should we speed it up?

Here’s the trick via using NumPy instead of the .apply() function.

df['radius_or_3'] = np.maximum(df['radius'],3)

The NumPy function maximum here is a better vectorized one compared to .apply(). Let’s calculate the time.

# Timing
setup_code = """
import pandas as pd
import numpy as np
d = {'category': ['apple', 'pear', 'peach'], 'radius': [3, 4, 2], 'sweetness': [1, 2, 3]}
df = pd.DataFrame(data=d)
"""mycode = '''
np.maximum(df['radius'],3)
'''
# timeit statement
t1 = timeit.timeit(setup=setup_code,
stmt = mycode,
number = 10000)
print(f"10000 runs of mycode is {t1}")

which yields 0.31 secs, which is 2 times faster than the .apply() function, right?

So, the takeaway is that try to find corresponding NumPy functions for your tasks first before simply using the .apply() function for everything.

Apply a function to multiple columns

Sometimes we need to use multiple columns from the data as the input of a function. For example, we want to create a column of lists that records the possible sizes between ‘radius_or_3’ and ‘diameter’.

We can use .apply() to the entire data frame,

df['sizes'] = df.apply(lambda x: list(range(x.radius_or_3,x.diameter)), axis=1)
df

This step actually is very time-consuming, since we actually passed a lot of unnecessary stuff inside the .apply() function. The execution time is,

# Timing
setup_code = """
import pandas as pd
import numpy as np
d = {'category': ['apple', 'pear', 'peach'], 'radius': [3, 4, 2], 'sweetness': [1, 2, 3]}
df = pd.DataFrame(data=d)
df['diameter'] = df['radius']*2
df['radius_or_3'] = np.maximum(df['radius'],3)
"""mycode = '''
df.apply(lambda x: list(range(x.radius_or_3,x.diameter)), axis=1)
'''
# timeit statement
t1 = timeit.timeit(setup=setup_code,
stmt = mycode,
number = 10000)
print(f"10000 runs of mycode is {t1}")

which gave us 1.84 seconds. And I’ll tell you it will take more than 20 minutes for a data frame of millions of rows.

Are we able to find a more efficient way to do the task?

The answer is yes. The only thing we need is to create a function that takes as many NumPy Arrays (pandas series) as the input as needed.

def create_range(a,b):
range_l = np.empty((len(a),1),object)
for i,val in enumerate(a):
range_l[i,0] = list(range(val,b[i]))
return range_ldf['sizes'] = create_range(df['radius_or_3'].values,df['diameter'].values)
df

This chunk of codes has a function, create_range that takes in two Numpy Arrays and returns one Numpy Array via a simple for loop. And the returned Numpy Array can be transformed automatically to Pandas Series.

Let’s check how much time we saved.

# Timing
setup_code = """
import pandas as pd
import numpy as np
d = {'category': ['apple', 'pear', 'peach'], 'radius': [3, 4, 2], 'sweetness': [1, 2, 3]}
df = pd.DataFrame(data=d)
df['diameter'] = df['radius']*2
df['radius_or_3'] = np.maximum(df['radius'],3)
"""mycode = '''
def create_range(a,b):
range_l = np.empty((len(a),1),object)
for i,val in enumerate(a):
range_l[i,0] = list(range(val,b[i]))
return range_l
create_range(df['radius_or_3'].values,df['diameter'].values)
'''
# timeit statement
t1 = timeit.timeit(setup=setup_code,
stmt = mycode,
number = 10000)
print(f"10000 runs of mycode is {t1}")

it gave us 0.07 secs !!!!

See? It is 26 times faster than the .apply() function on the entire data frame!!

Takeaways

If you try to use .apply() to a single column in a Pandas data frame, try to find simpler execution, e.g., df[‘radius’]*2. Or try to find existing NumPy functions for the tasks.
If you want to use .apply() to multiple columns in a Pandas data frame, try to avoid the .apply(,axis=1) format. And write a standalone function that can take in the Numpy Arrays as inputs, and then directly use it on the .values of the Pandas Series (columns of a data frame).

That’s all I want to share! Cheers!

If you like my article, don’t forget to subscribe to my email list or become a referred member of Medium!!