Techno Blender
Digitally Yours.

Tips and Tricks for Loading Large CSV Files into Pandas DataFrames — Part 1 | by Wei-Meng Lee | Dec, 2022

0 51


Photo by Jordan Harrison on Unsplash

Most datasets you use in the real world are usually humongous, and they come in gigabytes and contain million of rows. For this article, I will discuss some techniques that you can employ when dealing with large CSV datasets.

When dealing with large CSV files, there are two main concerns:

  • The amount of memory used in loading large CSV files.
  • The amount of time spent in loading large CSV files.

Ideally, you want to minimize the memory footprint of your DataFrame, as well as the time loading it. In this article, I will walk you through using a sample dataset.

For this article, I will use the Japanese Trade Statistics data available from https://www.kaggle.com/datasets/4e614ec846ab778f6a2ff166232d5a65f5e6786b4f5781690588bd2cccd71cb6?resource=download.

License type: CC BY-SA 4.0

This dataset contains trade data from 1988 to 2020. It contains more than 100 million rows and the CSV file takes up a whopping 4.5 Gb. And so it is an ideal dataset to illustrate the concepts in this article.

Let’s first start by loading the entire CSV file with more than 100 million rows. I am interested to see how long it takes to load the DataFrame, as well as its memory footprint:

import time
import pandas as pd

start = time.time()

df = pd.read_csv("custom_1988_2020.csv")

print(time.time() - start, ' seconds')
display(df)
display(df.info())

The output is as shown:

The total memory footprint is a whopping 6.8 GB (!) and it took me 30 seconds to load it into a Pandas DataFrame.

For reference I am using the Mac Studio with 32GB of memory.

Let’s examine the columns in the dataframe:

df.columns

You should now realize that this CSV file does not have a header, and hence Pandas will assume that the first line in the CSV file contains the header:

Index(['198801', '1', '103', '100', '000000190', '0', '35843', '34353'], dtype='object')

Since the CSV file has no header, the least you could do is to use the header parameter to tell Pandas that there is no header in the CSV file:

# loading with no headers specified
df = pd.read_csv("custom_1988_2020.csv", header=None)
display(df)

Pandas will now automatically name the columns starting with 0, followed by 1, and so on.

From the description of the dataset on the https://www.kaggle.com/datasets/4e614ec846ab778f6a2ff166232d5a65f5e6786b4f5781690588bd2cccd71cb6?resource=download, we know the meaning of the various columns:

  • ym(Year + month)
  • exp_imp(export: 1, import: 2)
  • hs9(HS code)
  • Customs
  • Country
  • Q1
  • Q2(quantity)
  • Value(in thousands of yen)

Let’s name the columns using the names parameter:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'])
display(df)

The DataFrame now has the following column names — ‘YearMonth’, ‘ExportImport’, ‘HSCode’, ‘Customs’, ‘Country’, ‘Q1’, ‘Q2_Quantity’, ‘Value’.

Since the CSV file is so large, the next question you want to ask yourself is — do you really need all the columns? To load specific columns, you can use the usecols parameter to specify the columns that you want to load:

start = time.time()

df = pd.read_csv("custom_1988_2020.csv",
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
usecols = ["YearMonth", "Value"])

print(time.time() - start, ' seconds')
display(df)
display(df.info())

As you can see from the output above, the memory footprint has been reduced to 1.7GB and the time taken to load it is now reduced to 17 seconds.

The usecols parameter also supports column-position index. The above can also be rewritten using column numbers — 0 and 7:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
usecols = [0, 7])

print(time.time() - start, ' seconds')
display(df)

Note that you cannot use -1 to indicate the last column, like this:

usecols = [0, -1])   # -1 is not supported

The usecols parameter also supports lambda function. For example, if you want to retrieve all columns except the Country column, you can use the following lambda expression:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
usecols = lambda column: column not in ['Country'])
display(df)

The Country column would now be excluded from the result.

The use of lambda function in the usecols parameter allows you to do some interesting things, such as loading columns whose name contains “Q”, for example:

usecols = lambda column: "Q" in column

Or whose length of column name is more than, say, seven characters:

usecols = lambda column: len(column) > 7

In a lot of situations, you do not need all the rows in the entire CSV file. Perhaps the first 100 rows are sufficient. For this, you can use the nrows parameter to specify the first n rows that you want to load:

start = time.time()

df = pd.read_csv("custom_1988_2020.csv",
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
nrows=100)

print(time.time() - start, ' seconds')
display(df[:15])
display(df.info())

From the result above, you see that it now only takes 0.1 second to load the first 100 rows and the resultant DataFrame only occupies 6.4 KB.

There are also times where you might want to skip certain rows in your CSV file. To do so, use the skiprows parameter:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=2,
nrows=100
)
display(df[:15])

The result above shows the first two lines of the CSV file is skipped:

You can also skip specific rows:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=[0,2,4],
nrows=100
)
display(df[:15])

The above result shows row 0, 2, and 4 were skipped:

You can also use a range object to specify the range of rows to skip:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=range(5,10),
nrows=100
)
display(df[:15])

The above result shows rows 5 to 9 were skipped. The value of the skiprows parameter can also be written using a lambda function, like this:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=lambda x: 5 <= x < 10,
nrows=100
)

Using the lambda function, you can skip all the even numbered rows:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=lambda x: x % 2 == 0,
nrows=100
)

print(time.time() - start, ' seconds')
display(df[:15])

The above result shows all even numbered rows were skipped:

Up till this point, you learned how to load the first n rows, as well as skip specific rows in CSV file. How about loading specific rows in a CSV file? There is no parameter that allows you to do that, but you can make use of the skiprows parameter to get what you want.

Using the lambda function in the skiprows parameter, you can specify which rows not to skip (which essentially means which rows you want to load):

start = time.time()

df = pd.read_csv("custom_1988_2020.csv",
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=lambda x: x not in [1,3],
nrows=100
)

print(time.time() - start, ' seconds')
display(df[:15])
display(df.info())

The above result shows that rows numbered 1 and 3 were retained:

The downside to this approach is that the entire CSV file must be scanned, hence the 20 seconds it took to load the only two rows.

The final challenge I want to discuss is how to load the last n rows from a CSV file. While it is easy to load the first n rows, loading the last n rows is not so straight-forward. But you can make use of what you have learned up till this point to solve this problem.

First, count how many rows are there in the CSV file:

# read the last n rows
start = time.time()

row_count = sum(1 for l in open('custom_1988_2020.csv'))

print(time.time() - start, ' seconds')
row_count

Since there are more than 100 million lines in the CSV file, it took about 10 seconds to count the number of rows. Also, remember that for this CSV file, there is no header. So 113607322 is the actual number of rows of records.

Then, to load the last 20 rows, use the skiprows parameter and pass it a lambda function to skip all the rows except the last 20:

# read the last n rows
start = time.time()

df = pd.read_csv("custom_1988_2020.csv",
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=lambda x: 0 <= x < row_count - 20,
nrows=100)

print(time.time() - start, ' seconds')
display(df)
display(df.info())

The result shows the last 20 rows loaded into the Pandas DataFrame.

Like the previous section, the downside is that the entire CSV file must be scanned during the loading process (hence the 22 seconds to load the DataFrame).

In this article, I have gone through quite a number of techniques to load a Pandas DataFrame from a CSV file. Often, there is no need to load the entire CSV file into the DataFrame. Not only do you save time by loading only what you need, you also save the memory needed to hold your DataFrame in memory. In the next article, I will show you the techniques to reduce the memory footprint of your DataFrame. Stay tuned!


Photo by Jordan Harrison on Unsplash

Most datasets you use in the real world are usually humongous, and they come in gigabytes and contain million of rows. For this article, I will discuss some techniques that you can employ when dealing with large CSV datasets.

When dealing with large CSV files, there are two main concerns:

  • The amount of memory used in loading large CSV files.
  • The amount of time spent in loading large CSV files.

Ideally, you want to minimize the memory footprint of your DataFrame, as well as the time loading it. In this article, I will walk you through using a sample dataset.

For this article, I will use the Japanese Trade Statistics data available from https://www.kaggle.com/datasets/4e614ec846ab778f6a2ff166232d5a65f5e6786b4f5781690588bd2cccd71cb6?resource=download.

License type: CC BY-SA 4.0

This dataset contains trade data from 1988 to 2020. It contains more than 100 million rows and the CSV file takes up a whopping 4.5 Gb. And so it is an ideal dataset to illustrate the concepts in this article.

Let’s first start by loading the entire CSV file with more than 100 million rows. I am interested to see how long it takes to load the DataFrame, as well as its memory footprint:

import time
import pandas as pd

start = time.time()

df = pd.read_csv("custom_1988_2020.csv")

print(time.time() - start, ' seconds')
display(df)
display(df.info())

The output is as shown:

The total memory footprint is a whopping 6.8 GB (!) and it took me 30 seconds to load it into a Pandas DataFrame.

For reference I am using the Mac Studio with 32GB of memory.

Let’s examine the columns in the dataframe:

df.columns

You should now realize that this CSV file does not have a header, and hence Pandas will assume that the first line in the CSV file contains the header:

Index(['198801', '1', '103', '100', '000000190', '0', '35843', '34353'], dtype='object')

Since the CSV file has no header, the least you could do is to use the header parameter to tell Pandas that there is no header in the CSV file:

# loading with no headers specified
df = pd.read_csv("custom_1988_2020.csv", header=None)
display(df)

Pandas will now automatically name the columns starting with 0, followed by 1, and so on.

From the description of the dataset on the https://www.kaggle.com/datasets/4e614ec846ab778f6a2ff166232d5a65f5e6786b4f5781690588bd2cccd71cb6?resource=download, we know the meaning of the various columns:

  • ym(Year + month)
  • exp_imp(export: 1, import: 2)
  • hs9(HS code)
  • Customs
  • Country
  • Q1
  • Q2(quantity)
  • Value(in thousands of yen)

Let’s name the columns using the names parameter:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'])
display(df)

The DataFrame now has the following column names — ‘YearMonth’, ‘ExportImport’, ‘HSCode’, ‘Customs’, ‘Country’, ‘Q1’, ‘Q2_Quantity’, ‘Value’.

Since the CSV file is so large, the next question you want to ask yourself is — do you really need all the columns? To load specific columns, you can use the usecols parameter to specify the columns that you want to load:

start = time.time()

df = pd.read_csv("custom_1988_2020.csv",
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
usecols = ["YearMonth", "Value"])

print(time.time() - start, ' seconds')
display(df)
display(df.info())

As you can see from the output above, the memory footprint has been reduced to 1.7GB and the time taken to load it is now reduced to 17 seconds.

The usecols parameter also supports column-position index. The above can also be rewritten using column numbers — 0 and 7:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
usecols = [0, 7])

print(time.time() - start, ' seconds')
display(df)

Note that you cannot use -1 to indicate the last column, like this:

usecols = [0, -1])   # -1 is not supported

The usecols parameter also supports lambda function. For example, if you want to retrieve all columns except the Country column, you can use the following lambda expression:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
usecols = lambda column: column not in ['Country'])
display(df)

The Country column would now be excluded from the result.

The use of lambda function in the usecols parameter allows you to do some interesting things, such as loading columns whose name contains “Q”, for example:

usecols = lambda column: "Q" in column

Or whose length of column name is more than, say, seven characters:

usecols = lambda column: len(column) > 7

In a lot of situations, you do not need all the rows in the entire CSV file. Perhaps the first 100 rows are sufficient. For this, you can use the nrows parameter to specify the first n rows that you want to load:

start = time.time()

df = pd.read_csv("custom_1988_2020.csv",
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
nrows=100)

print(time.time() - start, ' seconds')
display(df[:15])
display(df.info())

From the result above, you see that it now only takes 0.1 second to load the first 100 rows and the resultant DataFrame only occupies 6.4 KB.

There are also times where you might want to skip certain rows in your CSV file. To do so, use the skiprows parameter:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=2,
nrows=100
)
display(df[:15])

The result above shows the first two lines of the CSV file is skipped:

You can also skip specific rows:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=[0,2,4],
nrows=100
)
display(df[:15])

The above result shows row 0, 2, and 4 were skipped:

You can also use a range object to specify the range of rows to skip:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=range(5,10),
nrows=100
)
display(df[:15])

The above result shows rows 5 to 9 were skipped. The value of the skiprows parameter can also be written using a lambda function, like this:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=lambda x: 5 <= x < 10,
nrows=100
)

Using the lambda function, you can skip all the even numbered rows:

df = pd.read_csv("custom_1988_2020.csv", 
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=lambda x: x % 2 == 0,
nrows=100
)

print(time.time() - start, ' seconds')
display(df[:15])

The above result shows all even numbered rows were skipped:

Up till this point, you learned how to load the first n rows, as well as skip specific rows in CSV file. How about loading specific rows in a CSV file? There is no parameter that allows you to do that, but you can make use of the skiprows parameter to get what you want.

Using the lambda function in the skiprows parameter, you can specify which rows not to skip (which essentially means which rows you want to load):

start = time.time()

df = pd.read_csv("custom_1988_2020.csv",
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=lambda x: x not in [1,3],
nrows=100
)

print(time.time() - start, ' seconds')
display(df[:15])
display(df.info())

The above result shows that rows numbered 1 and 3 were retained:

The downside to this approach is that the entire CSV file must be scanned, hence the 20 seconds it took to load the only two rows.

The final challenge I want to discuss is how to load the last n rows from a CSV file. While it is easy to load the first n rows, loading the last n rows is not so straight-forward. But you can make use of what you have learned up till this point to solve this problem.

First, count how many rows are there in the CSV file:

# read the last n rows
start = time.time()

row_count = sum(1 for l in open('custom_1988_2020.csv'))

print(time.time() - start, ' seconds')
row_count

Since there are more than 100 million lines in the CSV file, it took about 10 seconds to count the number of rows. Also, remember that for this CSV file, there is no header. So 113607322 is the actual number of rows of records.

Then, to load the last 20 rows, use the skiprows parameter and pass it a lambda function to skip all the rows except the last 20:

# read the last n rows
start = time.time()

df = pd.read_csv("custom_1988_2020.csv",
header=None,
names=['YearMonth', 'ExportImport', 'HSCode', 'Customs',
'Country', 'Q1', 'Q2_Quantity', 'Value'],
skiprows=lambda x: 0 <= x < row_count - 20,
nrows=100)

print(time.time() - start, ' seconds')
display(df)
display(df.info())

The result shows the last 20 rows loaded into the Pandas DataFrame.

Like the previous section, the downside is that the entire CSV file must be scanned during the loading process (hence the 22 seconds to load the DataFrame).

In this article, I have gone through quite a number of techniques to load a Pandas DataFrame from a CSV file. Often, there is no need to load the entire CSV file into the DataFrame. Not only do you save time by loading only what you need, you also save the memory needed to hold your DataFrame in memory. In the next article, I will show you the techniques to reduce the memory footprint of your DataFrame. Stay tuned!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment