Few things can seem as intimidating as working in the command line terminal. Many data scientists don’t have a computer science background, and so are not used to this. It might even seem like working in the command line terminal is a relic of the past 😧
I am here to tell you that this is false. In fact, with the increase in cloud-based shells like e.g. the Azure Cloud Shell there is a higher value than ever to learn the command line. The truth is that the command line is an efficiency booster. Using the graphical user interface (GUI) for folders is easiest for simple things, but gets increasingly tricky for harder things.
There are loads of terminal tools for quickly working with data in the terminal for a data scientist/data engineer. In this blog post, I will tell you about a favourite of mine called csvkit. This is a tool for handling CSV files in the terminal. This is super useful if you are working with monthly reports that come in CSV files or Excel files. With csvkit, you can easily:
- Convert excel files to CSV files
- Print CSV files out nicely formatted
- Cut out specific columns
- Get statistical information about columns
- Search within columns with regular expressions
and much more! You might be able to do the same with Excel. However, the real value of doing things with csvkit is that you can then automate the process (with e.g. cronjobs).
I will give you an introduction to working with csvkit so that you can handle CSV files in the terminal. Specifically, I will teach you the five csvkit commands that I think are the most important ones. If you are more of a visual learner, then I have also made a free youtube video series on csvkit that you can check out 😃
Prerequisites: You should have installed a command line terminal running BASH. If you are using Linux then this should not be a problem, while Windows users might want to take a look at cmder or enable the Linux subsystem with WSL2.
To get started with the csvkit package, you need to install it. The csvkit package is a Python package, so all you need to do (given that you have Pyhon’s package installer pip installed) is to run:
pip install csvkit
If you don’t want to install the csvkit package globally, then consider using virtualenv to set up a virtual environment. To check that the installation worked, you can run the command:
in2csv --help
You should now get the help page for the command in2csv
if everything was installed correctly. The command in2csv
is one of several commands that csvkit has installed for us.
Let’s open a new folder called csvkit_sandbox
where you can play around with your new toy:
mkdir csvkit_sandbox
cd csvkit_sandbox
Finally, you need some data to actually work on: Let us download the data that the csvkit documentation is also using:
curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/realdata/ne_1033_data.xlsx
After running the above command, you should have the file ne_1033_data.xlsx
installed in your folder. You can run ls
to check the contents of your csvkit_sandbox
folder if you don’t trust me 😉
The first command we will look at is the in2csv
command. This command takes an excel file and turns it into a CSV file (duh!). This is super useful because data scientists often get reports in excel format, and have to convert them to CSV files to work with them efficiently.
To use the in2csv
command you simply write:
in2csv ne_1033_data.xlsx
If you use the command like this, then you simply dump all the output to the terminal. For even moderate files, this is awful. A better option is to redirect the output to a file as follows:
in2csv ne_1033_data.xlsx > new_file.csv
Now you should have a new CSV file in your csvkit_sandbox
folder called new_file.csv
. A trick I sometimes use is to delete the original file in the same command by adding on a rm
command as follows:
in2csv ne_1033_data.xlsx > new_file.csv && rm ne_1033_data
If you run the above command, then you should only have the file new_file.csv
in you csvkit_sandbox
folder. If you want to check out the file new_file.csv
, then try out the command:
head new_file.csv
This command outputs the first lines in the file new_file.csv
to the terminal.
In the last section, you retrieved all the columns from the excel file and turned it into a CSV file. But often you don’t need all the columns for further analysis. If this is the case, then dragging with us useless columns will be a hassle. Fear not! With the command csvcut
you can select certain columns of a CSV file:
csvcut -c county,total_cost,ship_date new_file.csvOutput:
county,total_cost,ship_date
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BURT,499.0,2013-06-20
...
The above command only selects the columns county
, total_cost
, and ship_date
from the CSV file new_file.csv
.
But what if you don’t remember the column names? Then a quick trick is to run the command:
csvcut -n new_file.csvOutput:
1: state
2: county
3: fips
4: nsn
5: item_name
6: quantity
7: ui
8: acquisition_cost
9: total_cost
10: ship_date
11: federal_supply_category
12: federal_supply_category_name
13: federal_supply_class
14: federal_supply_class_name
The optional argument -n
gives you the names of the columns. Neat, right? The command csvcut
is super useful for selecting the columns you want to proceed with.
When you outputted the CSV file in the last section to the terminal, it did not look very good. When printing raw CSV files to the terminal, it usually becomes awful to read like this:
csvcut -c county,total_cost,ship_date new_file.csvOutput:
county,total_cost,ship_date
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BURT,499.0,2013-06-20
...
The command csvlook
comes to the rescue. You can pass the output of the command csvcut
as input to the command csvlook
by using the pipe operator |
as follows:
csvcut -c county,total_cost,ship_date new_file.csv | csvlookOutput:
| county | total_cost | ship_date |
| ---------- | ---------- | ---------- |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BURT | 499.00 | 2013-06-20 |
...
That looks a lot better! The output is now much easier to parse with your eyes.
Remember the trick you used by passing the output of csvcut
to the input of csvlook
with the pipe operator |
. This is very common in csvkit and is the key to creating more complex pipelines 🔥
When dealing with CSV files as data analysts or data scientists, we often want to extract summary statistics. There are hundreds of ways to extract the median and mean of columns in a CSV file. Which one should you use? Import the data to Excel? Use Pandas for quick data analysis?
I think you know where I am going with this. Of course, csvkit has a command for this called csvstat
. The command csvstat
extracts basic statistics from each column in a CSV file:
csvstat new_file.csvOutput:1. "state"
Type of data: Text
Contains null values: False
Unique values: 1
Longest value: 2 characters
Most common values: NE (1036x)2. "county"
Type of data: Text
Contains null values: False
Unique values: 35
Longest value: 10 characters
Most common values: DOUGLAS (760x)
DAKOTA (42x)
CASS (37x)
HALL (23x)
LANCASTER (18x)3. "fips"
Type of data: Number
Contains null values: False
Unique values: 35
Smallest value: 31001
Largest value: 31185
Sum: 32176888
Mean: 31058.772
Median: 31055
StDev: 23.881
Most common values: 31055 (760x)
31043 (42x)
31025 (37x)
31079 (23x)
31109 (18x)
...
If you read the above output carefully, then you will realize that csvstat
gives out different statistics based on what the column type is. If one of your columns is a TEXT format, then it makes little sense to talk about the standard deviation of that column.
The command csvstat
does type inference. It tries to deduce the type of the column, and then find summary statistics that are appropriate. This, in my experience, works wonderfully.
Sometimes you don’t want statistics on all the columns, but only a few ones. To do this, use the command csvcut
to filter out the necessary columns and then pipe this to csvstat
to get statistics:
csvcut -c total_cost,ship_date new_file.csv | csvstatOutput:1. "total_cost"
Type of data: Number
Contains null values: False
Unique values: 92
Smallest value: 0
Largest value: 412000
Sum: 5553135.17
Mean: 5360.169
Median: 6000
StDev: 13352.139
Most common values: 6800 (304x)
10747 (195x)
6000 (105x)
499 (98x)
0 (81x)2. "ship_date"
Type of data: Date
Contains null values: False
Unique values: 84
Smallest value: 2006-03-07
Largest value: 2014-01-30
Most common values: 2013-04-25 (495x)
2013-04-26 (160x)
2008-05-20 (28x)
2012-04-16 (26x)
2006-11-17 (20x)Row count: 1036
A final command that I cannot recommend enough is the csvgrep
command. The command csvgrep
allows you to search through CSV files to find specific records. If you are used to the terminal command grep
, then the csvgrep
command is similar.
To see it in action, let us search for the county HOLT in our CSV file:
csvgrep -c county -m "HOLT" new_file.csvOutput:
NE,HOLT,31089.0,1005-00-073-9421,"RIFLE,5.56 MILLIMETER",1.0,Each,499.0,499.0,2008-05-19,10.0,WEAPONS,1005.0,"Guns, through 30 mm"NE,HOLT,31089.0,1005-00-589-1271,"RIFLE,7.62 MILLIMETER",1.0,Each,138.0,138.0,2006-03-08,10.0,WEAPONS,1005.0,"Guns, through 30 mm"NE,HOLT,31089.0,1005-00-589-1271,"RIFLE,7.62 MILLIMETER",1.0,Each,138.0,138.0,2006-03-08,10.0,WEAPONS,1005.0,"Guns, through 30 mm"
...
It worked! Now you only get records with the county HOLT.
- The parameter
-c
is used to select the column. - The parameter
-m
is used to find a string match.
Pro tip: The
csvgrep
command also allows for REGEX search with the optional argument-r
. Check out the csvgrep documentation for more on this!
My final advice to you would be to play around with combining several csvkit commands to get cool results. Check this out!
csvcut -c county,total_cost new_file.csv | csvgrep -c county -m "HOLT" | csvlookOutput:
| county | total_cost |
| ------ | ---------- |
| HOLT | 499.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 58.71 |
Firstly I selected only the two columns county
and total_cost
with csvcut
. Then I search for the county HOLT with the command csvgrep
. Finally, I display the result nicely with the command csvlook
. Now it’s your turn to play around with csvkit ✌️
Hopefully, you are starting to love csvkit as much as I am. Again, if you want to learn some more cool tips and tricks, then check out the csvkit documentation or my youtube videos on csvkit.
Like my writing? Check out some of my other posts for more Python content:
If you are interested in data science, programming, or anything in between, then feel free to add me on LinkedIn and say hi ✋
Few things can seem as intimidating as working in the command line terminal. Many data scientists don’t have a computer science background, and so are not used to this. It might even seem like working in the command line terminal is a relic of the past 😧
I am here to tell you that this is false. In fact, with the increase in cloud-based shells like e.g. the Azure Cloud Shell there is a higher value than ever to learn the command line. The truth is that the command line is an efficiency booster. Using the graphical user interface (GUI) for folders is easiest for simple things, but gets increasingly tricky for harder things.
There are loads of terminal tools for quickly working with data in the terminal for a data scientist/data engineer. In this blog post, I will tell you about a favourite of mine called csvkit. This is a tool for handling CSV files in the terminal. This is super useful if you are working with monthly reports that come in CSV files or Excel files. With csvkit, you can easily:
- Convert excel files to CSV files
- Print CSV files out nicely formatted
- Cut out specific columns
- Get statistical information about columns
- Search within columns with regular expressions
and much more! You might be able to do the same with Excel. However, the real value of doing things with csvkit is that you can then automate the process (with e.g. cronjobs).
I will give you an introduction to working with csvkit so that you can handle CSV files in the terminal. Specifically, I will teach you the five csvkit commands that I think are the most important ones. If you are more of a visual learner, then I have also made a free youtube video series on csvkit that you can check out 😃
Prerequisites: You should have installed a command line terminal running BASH. If you are using Linux then this should not be a problem, while Windows users might want to take a look at cmder or enable the Linux subsystem with WSL2.
To get started with the csvkit package, you need to install it. The csvkit package is a Python package, so all you need to do (given that you have Pyhon’s package installer pip installed) is to run:
pip install csvkit
If you don’t want to install the csvkit package globally, then consider using virtualenv to set up a virtual environment. To check that the installation worked, you can run the command:
in2csv --help
You should now get the help page for the command in2csv
if everything was installed correctly. The command in2csv
is one of several commands that csvkit has installed for us.
Let’s open a new folder called csvkit_sandbox
where you can play around with your new toy:
mkdir csvkit_sandbox
cd csvkit_sandbox
Finally, you need some data to actually work on: Let us download the data that the csvkit documentation is also using:
curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/realdata/ne_1033_data.xlsx
After running the above command, you should have the file ne_1033_data.xlsx
installed in your folder. You can run ls
to check the contents of your csvkit_sandbox
folder if you don’t trust me 😉
The first command we will look at is the in2csv
command. This command takes an excel file and turns it into a CSV file (duh!). This is super useful because data scientists often get reports in excel format, and have to convert them to CSV files to work with them efficiently.
To use the in2csv
command you simply write:
in2csv ne_1033_data.xlsx
If you use the command like this, then you simply dump all the output to the terminal. For even moderate files, this is awful. A better option is to redirect the output to a file as follows:
in2csv ne_1033_data.xlsx > new_file.csv
Now you should have a new CSV file in your csvkit_sandbox
folder called new_file.csv
. A trick I sometimes use is to delete the original file in the same command by adding on a rm
command as follows:
in2csv ne_1033_data.xlsx > new_file.csv && rm ne_1033_data
If you run the above command, then you should only have the file new_file.csv
in you csvkit_sandbox
folder. If you want to check out the file new_file.csv
, then try out the command:
head new_file.csv
This command outputs the first lines in the file new_file.csv
to the terminal.
In the last section, you retrieved all the columns from the excel file and turned it into a CSV file. But often you don’t need all the columns for further analysis. If this is the case, then dragging with us useless columns will be a hassle. Fear not! With the command csvcut
you can select certain columns of a CSV file:
csvcut -c county,total_cost,ship_date new_file.csvOutput:
county,total_cost,ship_date
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BURT,499.0,2013-06-20
...
The above command only selects the columns county
, total_cost
, and ship_date
from the CSV file new_file.csv
.
But what if you don’t remember the column names? Then a quick trick is to run the command:
csvcut -n new_file.csvOutput:
1: state
2: county
3: fips
4: nsn
5: item_name
6: quantity
7: ui
8: acquisition_cost
9: total_cost
10: ship_date
11: federal_supply_category
12: federal_supply_category_name
13: federal_supply_class
14: federal_supply_class_name
The optional argument -n
gives you the names of the columns. Neat, right? The command csvcut
is super useful for selecting the columns you want to proceed with.
When you outputted the CSV file in the last section to the terminal, it did not look very good. When printing raw CSV files to the terminal, it usually becomes awful to read like this:
csvcut -c county,total_cost,ship_date new_file.csvOutput:
county,total_cost,ship_date
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
ADAMS,138.0,2008-07-11
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BUFFALO,499.0,2008-09-24
BURT,499.0,2013-06-20
...
The command csvlook
comes to the rescue. You can pass the output of the command csvcut
as input to the command csvlook
by using the pipe operator |
as follows:
csvcut -c county,total_cost,ship_date new_file.csv | csvlookOutput:
| county | total_cost | ship_date |
| ---------- | ---------- | ---------- |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| ADAMS | 138.00 | 2008-07-11 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BUFFALO | 499.00 | 2008-09-24 |
| BURT | 499.00 | 2013-06-20 |
...
That looks a lot better! The output is now much easier to parse with your eyes.
Remember the trick you used by passing the output of csvcut
to the input of csvlook
with the pipe operator |
. This is very common in csvkit and is the key to creating more complex pipelines 🔥
When dealing with CSV files as data analysts or data scientists, we often want to extract summary statistics. There are hundreds of ways to extract the median and mean of columns in a CSV file. Which one should you use? Import the data to Excel? Use Pandas for quick data analysis?
I think you know where I am going with this. Of course, csvkit has a command for this called csvstat
. The command csvstat
extracts basic statistics from each column in a CSV file:
csvstat new_file.csvOutput:1. "state"
Type of data: Text
Contains null values: False
Unique values: 1
Longest value: 2 characters
Most common values: NE (1036x)2. "county"
Type of data: Text
Contains null values: False
Unique values: 35
Longest value: 10 characters
Most common values: DOUGLAS (760x)
DAKOTA (42x)
CASS (37x)
HALL (23x)
LANCASTER (18x)3. "fips"
Type of data: Number
Contains null values: False
Unique values: 35
Smallest value: 31001
Largest value: 31185
Sum: 32176888
Mean: 31058.772
Median: 31055
StDev: 23.881
Most common values: 31055 (760x)
31043 (42x)
31025 (37x)
31079 (23x)
31109 (18x)
...
If you read the above output carefully, then you will realize that csvstat
gives out different statistics based on what the column type is. If one of your columns is a TEXT format, then it makes little sense to talk about the standard deviation of that column.
The command csvstat
does type inference. It tries to deduce the type of the column, and then find summary statistics that are appropriate. This, in my experience, works wonderfully.
Sometimes you don’t want statistics on all the columns, but only a few ones. To do this, use the command csvcut
to filter out the necessary columns and then pipe this to csvstat
to get statistics:
csvcut -c total_cost,ship_date new_file.csv | csvstatOutput:1. "total_cost"
Type of data: Number
Contains null values: False
Unique values: 92
Smallest value: 0
Largest value: 412000
Sum: 5553135.17
Mean: 5360.169
Median: 6000
StDev: 13352.139
Most common values: 6800 (304x)
10747 (195x)
6000 (105x)
499 (98x)
0 (81x)2. "ship_date"
Type of data: Date
Contains null values: False
Unique values: 84
Smallest value: 2006-03-07
Largest value: 2014-01-30
Most common values: 2013-04-25 (495x)
2013-04-26 (160x)
2008-05-20 (28x)
2012-04-16 (26x)
2006-11-17 (20x)Row count: 1036
A final command that I cannot recommend enough is the csvgrep
command. The command csvgrep
allows you to search through CSV files to find specific records. If you are used to the terminal command grep
, then the csvgrep
command is similar.
To see it in action, let us search for the county HOLT in our CSV file:
csvgrep -c county -m "HOLT" new_file.csvOutput:
NE,HOLT,31089.0,1005-00-073-9421,"RIFLE,5.56 MILLIMETER",1.0,Each,499.0,499.0,2008-05-19,10.0,WEAPONS,1005.0,"Guns, through 30 mm"NE,HOLT,31089.0,1005-00-589-1271,"RIFLE,7.62 MILLIMETER",1.0,Each,138.0,138.0,2006-03-08,10.0,WEAPONS,1005.0,"Guns, through 30 mm"NE,HOLT,31089.0,1005-00-589-1271,"RIFLE,7.62 MILLIMETER",1.0,Each,138.0,138.0,2006-03-08,10.0,WEAPONS,1005.0,"Guns, through 30 mm"
...
It worked! Now you only get records with the county HOLT.
- The parameter
-c
is used to select the column. - The parameter
-m
is used to find a string match.
Pro tip: The
csvgrep
command also allows for REGEX search with the optional argument-r
. Check out the csvgrep documentation for more on this!
My final advice to you would be to play around with combining several csvkit commands to get cool results. Check this out!
csvcut -c county,total_cost new_file.csv | csvgrep -c county -m "HOLT" | csvlookOutput:
| county | total_cost |
| ------ | ---------- |
| HOLT | 499.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 138.00 |
| HOLT | 58.71 |
Firstly I selected only the two columns county
and total_cost
with csvcut
. Then I search for the county HOLT with the command csvgrep
. Finally, I display the result nicely with the command csvlook
. Now it’s your turn to play around with csvkit ✌️
Hopefully, you are starting to love csvkit as much as I am. Again, if you want to learn some more cool tips and tricks, then check out the csvkit documentation or my youtube videos on csvkit.
Like my writing? Check out some of my other posts for more Python content:
If you are interested in data science, programming, or anything in between, then feel free to add me on LinkedIn and say hi ✋