Pandas and Python Tips and Tricks for Data Science and Data Analysis | by Zoumana Keita | Dec, 2022
Take your efficiency to the next level with these Pandas and Python Tricks!
This blog regroups all the Pandas and Python tricks & tips I share on a basis on my LinkedIn page. I have decided to centralize them into a single blog to help you make the most out of your learning process by easily finding what you are looking for.
The content is is divided into two main sections:
- Pandas tricks & tips are related to only Pandas.
- Python tricks & tips related to Python.
This section provides a list of all the tricks
1. ๐๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐ป๐ฒ๐ ๐ฐ๐ผ๐น๐๐บ๐ป ๐ณ๐ฟ๐ผ๐บ ๐บ๐๐น๐๐ถ๐ฝ๐น๐ฒ ๐ฐ๐ผ๐น๐๐บ๐ป๐ ๐ถ๐ป ๐๐ผ๐๐ฟ ๐ฑ๐ฎ๐๐ฎ๐ณ๐ฟ๐ฎ๐บ๐ฒ.
Performing simple arithmetic tasks such as creating a new column as the sum of two other columns can be straightforward.
๐ค But, what if you want to implement a more complex function and use it as the logic behind column creation? Here is where things can get a bit challenging.
Guess whatโฆ
โ ๐๐ฅ๐ฅ๐ก๐ฎ and ๐ก๐๐ข๐๐๐ can help you easily apply whatever logic to your columns using the following format:
๐๐[๐ฃ๐๐ฌ_๐๐ค๐ก] = ๐๐.๐๐ฅ๐ฅ๐ก๐ฎ(๐ก๐๐ข๐๐๐ ๐ง๐ค๐ฌ: ๐๐ช๐ฃ๐(๐ง๐ค๐ฌ), ๐๐ญ๐๐จ=1)
where:
โก ๐๐ is your dataframe.
โก ๐ง๐ค๐ฌ will correspond to each row in your data frame.
โก ๐๐ช๐ฃ๐ is the function you want to apply to your data frame.
โก ๐๐ญ๐๐จ=1 to apply the function to each row in your data frame.
๐ก Below is an illustration.
The `candidate_info` function combines each candidateโs information to create a single description column about that candidate.
2. Convert categorical data into numerical ones
This process mainly can occur in the feature engineering phase. Some of its benefits are:
- the identification of outliers, invalid, and missing values in the data.
- reduction of the chance of overfitting by creating more robust models.
โก Use these two functions from Pandas, depending on your need. Examples are provided in the image below.
1๏ธโฃ .๐๐ช๐ฉ() to specifically define your bin edges.
๐๐๐๐ฃ๐๐ง๐๐ค
Categorize candidates by expertise with respect to their number of experience, where:
- Entry level: 0โ1 year
- Mid-level: 2โ3 years
- Senior level: 4โ5 years
2๏ธโฃ .๐ฆ๐๐ช๐ฉ() to divide your data into equal-sized bins.
It uses the underlying percentiles of the distribution of the data, rather than the edges of the bins.
๐๐๐๐ฃ๐๐ง๐๐ค: categorize the commute time of the candidates into ๐๐ค๐ค๐, ๐๐๐๐๐ฅ๐ฉ๐๐๐ก๐, or ๐ฉ๐ค๐ค ๐ก๐ค๐ฃ๐.
๐๐๐๐ฅ ๐๐ฃ ๐ข๐๐ฃ๐ ๐ก
- When using .๐๐ช๐ฉ(): a number of bins = number of labels + 1.
- When using .๐ฆ๐๐ช๐ฉ(): a number of bins = number of labels.
- With .๐๐ช๐ฉ(): set ๐๐ฃ๐๐ก๐ช๐๐_๐ก๐ค๐ฌ๐๐จ๐ฉ=๐๐ง๐ช๐, otherwise, the lowest value will be converted to NaN.
3. Select rows from a Pandas Dataframe based on column(s) values
โก use .๐ฆ๐ช๐๐ง๐ฎ() function by specifying the filter condition.
โก the filter expression can contain any operators (<, >, ==, !=, etc.)
โก use the @ฬท sign to use a variable in the expression.
4. Deal with zip files
Sometimes it can be efficient to read and write .zip files without extracting them from your local disk. Below is an illustration.
5. Select ๐ฎ ๐๐๐ฏ๐๐ฒ๐ ๐ผ๐ณ ๐๐ผ๐๐ฟ ๐ฃ๐ฎ๐ป๐ฑ๐ฎ๐ ๐ฑ๐ฎ๐๐ฎ๐ณ๐ฟ๐ฎ๐บ๐ฒ ๐๐ถ๐๐ต ๐๐ฝ๐ฒ๐ฐ๐ถ๐ณ๐ถ๐ฐ ๐ฐ๐ผ๐น๐๐บ๐ป ๐๐๐ฝ๐ฒ๐
You can use the ๐จ๐๐ก๐๐๐ฉ_๐๐ฉ๐ฎ๐ฅ๐๐จ function. It takes two main parameters: ๐๐๐๐๐๐๐ ๐๐๐ ๐๐ก๐๐๐๐๐.
- ๐๐.๐๐๐๐๐๐_๐๐๐ข๐๐๐(๐๐๐๐๐๐๐ = [โ๐๐ข๐๐_๐ทโ, โ๐๐ข๐๐_๐ธโ, โฆ โ๐๐ข๐๐_๐โ]) means I want the subset of my data frame WITH columns of ๐๐ข๐๐_๐ท, ๐๐ข๐๐_๐ธ,โฆ, ๐๐ข๐๐_๐.
- ๐๐.๐๐๐๐๐๐_๐๐๐ข๐๐๐(๐๐ก๐๐๐๐๐ = [โ๐๐ข๐๐_๐ทโ, โ๐๐ข๐๐_๐ธโ, โฆ โ๐๐ข๐๐_๐โ]) means I want the subset of my data frame WITHOUT columns of ๐๐ข๐๐_๐ท, ๐๐ข๐๐_๐ธ,โฆ, ๐๐ข๐๐_๐.
โจ Below is an illustration
6. Remove comments from Pandas dataframe column
Imagine that I want clean this data (candidates.csv) by removing comments from the application date column. This can be done on the fly while loading your pandas dataframe using the ๐๐ค๐ข๐ข๐๐ฃ๐ฉ parameter as follow:
โก ๐๐๐๐๐_๐๐๐๐ = ๐๐.๐๐๐๐_๐๐๐(๐๐๐๐_๐๐_๐๐๐๐, ๐๐ค๐ข๐ข๐๐ฃ๐ฉ=โ๐๐ข๐๐๐๐โ)
In my case, ๐๐ค๐ข๐ข๐๐ฃ๐ฉ=โ#โ but it could be any other character (|, /, etc.) depending on your case. An illustration is the first scenario.
โ๐ฝ Wait, what if I want to create a new column for those comments and still remove them from the application date column? An illustration is the second scenario.
7. Print Pandas dataframe in Tabular format from consol
โ No, the application of the ๐๐๐๐๐() function to a pandas data frame does not always render an output that is easy to read, especially for data frames with multiple columns.
โ
If you want to get a nice console-friendly tabular output
Use the .๐๐_๐๐๐๐๐๐() function as illustrated below.
8. Highlight data points in Pandas
Applying colors to a pandas data frame can be a good way to emphasize certain data points for quick analysis.
โ This is where ๐๐๐๐๐๐.๐๐๐ข๐๐ module comes in handy. It has many features, but is not limited to the followings:
โจ ๐๐.๐๐๐ข๐๐.๐๐๐๐๐๐๐๐๐_๐๐๐ก() to assign a color to the maximum value of each column.
โจ ๐๐.๐๐๐ข๐๐.๐๐๐๐๐๐๐๐๐_๐in() to assign a color to the minimum value of each column.
โจ ๐๐.๐๐๐ข๐๐.๐๐๐๐๐ข(๐๐ข_๐๐๐๐๐๐_๐๐๐๐๐๐๐๐) to apply your custom function to your data frame.
9. Reduce decimal points in your data
Sometimes, very long decimal values in your data set do not provide significant information and can be painful ๐คฏ to look at.
So, you might want to convert your data to about 2 to 3 decimal points to facilitate your analysis.
โ This is something you can perform using the ๐๐๐๐๐๐.๐ณ๐๐๐๐ต๐๐๐๐.๐๐๐๐๐() function as illustrated below.
10. Replace some values in your data frame
You might want to replace some information in your data frame to keep it as up-to-date as possible.
โ This can be achieved using the Pandas ๐๐๐๐๐๐๐๐๐.๐๐๐๐๐๐๐() function as illustrated below.
11. Compare two data frames and get their differences
Sometimes, when comparing two pandas data frames, not only do you want to know if they are equivalent, but also where the difference lies if they are not equivalent.
โ This is where the .๐๐๐๐๐๐๐() function comes in handy.
โจ It generates a data frame showing columns with differences side by side. Its shape is different from (0, 0) only if the two data being compared are the same.
โจ If you want to show values that are equal, set the ๐๐๐๐_๐๐๐๐๐ parameter to ๐๐๐๐. Otherwise, they are shown as ๐ฝ๐๐ฝ.
12. Get a subset of a very large dataset for quick analysis
Sometimes, we just need a subset of a very large dataset for quick analysis. One of the approaches could be to read the whole data in memory before getting your sample.
This can require a lot of memory depending on how big your data is. Also, it can take significant time to read your data.
โ You can use ๐๐๐๐ ๐ parameter in the pandas ๐๐๐๐_๐๐๐() function by specifying the number of rows you want.
13. Transform your data frame from a wide to a long format
Sometimes it can be useful ๐๐๐๐๐๐๐๐๐ ๐ข๐๐๐ ๐๐๐๐๐๐๐๐๐ ๐๐๐๐ ๐ ๐ ๐๐๐ ๐๐ ๐ ๐๐๐๐ ๐๐๐๐๐๐ which is more flexible for better analysis, especially when dealing with time series data.
- ๐๐๐๐ฉ ๐๐ค ๐ฎ๐ค๐ช ๐ข๐๐๐ฃ ๐๐ฎ ๐ฌ๐๐๐ & ๐ก๐ค๐ฃ๐?
โจ Wide format is when you have a lot of columns.
โจ Long format on the other side is when you have a lot of rows.
โ ๐ฟ๐๐๐๐๐.๐๐๐๐() is a perfect candidate for this task.
Below is an illustration
14. Reduce the size of your Pandas data frame by ignoring the index
Do you know that you can reduce the size of your Pandas data frame by ignoring the index when saving it?
โ Something like ๐๐๐๐๐ก = ๐ต๐๐๐๐ when saving the file.
Below is an illustration.
15. Parquet instead of CSV
Very often, I donโt manually look ๐ at the content of a CSV or Excel file that will be used by Pandas for further analysis.
If thatโs your case, maybe you should not use .CSV anymore and think of a better option.
Especially if you are only concerned about
โจ Processing speed
โจ Speed in saving and loading
โจ Disk space occupied by the data frame
โ In that case, .๐ฅ๐๐ง๐ฆ๐ช๐๐ฉ format is your best option as illustrated below.
16. Transform your data frame into a markdown
It is always better to print your data frame in a way that makes it easier to understand.
โ One way of doing that is to render it in a markdown format using the .๐๐_๐๐๐๐๐๐๐ ๐() function.
๐ก Below is an illustration
17. Format Date Time column
When loading Pandas dataframes, date columns are represented as ๐ผ๐ฏ๐ท๐ฒ๐ฐ๐ by default, which is not โ the correct date format.
โ You can specify the target column in the ๐ฝ๐ฎ๐ฟ๐๐ฒ_๐ฑ๐ฎ๐๐ฒ๐ argument to get the correct column type.
1. Create a progress bar with tqdm and rich
Using the progress bar is beneficial when you want to have a visual status of a given task.
#!pip -q install rich
from rich.progress import track
from tqdm import tqdm
import time
Implement the callback function
def compute_double(x):
return 2*x
Create the progress bars
2. Get day, month, year, day of the week, the month of the year
3. Smallest and largest values of a column
If you want to get the rows with the largest or lowest values for a given column, you can use the following functions:
โจ ๐๐.๐๐๐๐๐๐๐๐(๐ฝ, โ๐ฒ๐๐_๐ฝ๐๐๐โ) โ top ๐ฝ rows based on ๐ฒ๐๐_๐ฝ๐๐๐
โจ ๐๐.๐๐๐๐๐๐๐๐๐(๐ฝ, โ๐ฒ๐๐_๐ฝ๐๐๐โ) โ ๐ฝ smallest rows based on ๐ฒ๐๐_๐ฝ๐๐๐
โจ ๐ฒ๐๐_๐ฝ๐๐๐ is the name of the column you are interested in.
4. Ignore the log output of the pip install command
Sometimes when installing a library from your jupyter notebook, you might not want to have all the details about the installation process generated by the default ๐๐๐ ๐๐๐๐๐๐๐ command.
โ You can specify the -q or โ quiet option to get rid of that information.
Below is an illustration ๐ก
5. Run multiple commands in a single notebook cell
The exclamation mark โ!โ is essential to successfully run a shell command from your Jupyter notebook.
However, this approach can be quite repetitive ๐ when dealing with multiple commands or a very long and complicated one.
โ A better way to tackle this issue is to use the %%๐๐๐ฌ๐ก expression at the beginning of your notebook cell.
๐ก Below is an illustration
6. Virtual environment.
A Data Science project can involve multiple dependencies, and dealing with all of them can be a bit annoying. ๐คฏ
โจ A good practice is to organize your project in a way that it can be easily shared with your team members and reproduced with the least amount of effort.
โ One way of doing this is to use virtual environments.
โ๏ธ ๐๐ฟ๐ฒ๐ฎ๐๐ฒ ๐๐ถ๐ฟ๐๐๐ฎ๐น ๐ฒ๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐ ๐ฎ๐ป๐ฑ ๐ถ๐ป๐๐๐ฎ๐น๐น ๐น๐ถ๐ฏ๐ฟ๐ฎ๐ฟ๐ถ๐ฒ๐.
โ Install the virtual environment module.
๐๐๐ ๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐
โ Create your environment by giving a meaningful name.
๐๐๐๐๐๐๐๐๐๐ [๐ข๐๐๐_๐๐๐๐๐๐๐๐๐๐๐_๐๐๐๐]
โ Activate your environment.
๐๐๐๐๐๐ [๐ข๐๐๐_๐๐๐๐๐๐๐๐๐๐๐_๐๐๐๐]/๐๐๐/๐๐๐๐๐๐๐๐
โ Start installing the dependencies for your project.
๐๐๐ ๐๐๐๐๐๐๐ ๐๐๐๐๐๐
โฆ
All this is great ๐๐ผ, BUTโฆ the virtual environment you just created is local to your machine๐.
๐๐๐๐ฉ ๐ฉ๐ค ๐๐ค?๐คท๐ปโโ๏ธ
๐ก You need to permanently save those dependencies in order to share them with others using this command:
โ ๐๐๐ ๐๐๐๐๐ฃ๐ > ๐๐๐๐๐๐๐๐๐๐๐๐.๐๐ก๐
This will create ๐๐๐๐๐๐๐๐๐๐๐๐.๐๐ก๐ file containing your project dependencies.
๐ Finally, anyone can install the exact same dependencies by running this command:
โ ๐๐๐ ๐๐๐๐๐๐๐ -๐ ๐๐๐๐๐๐๐๐๐๐๐๐.๐๐ก๐
7. Run multiple metrics at once
Scikit learn metrics
8. Chain multiple lists as a single sequence
You can use a single for loop to iterate through multiple lists as a single sequence ๐.
โ This can be achieved using the ๐๐๐๐๐() โ function from Python ๐ถ๐๐ฒ๐ฟ๐๐ผ๐ผ๐น๐ module.
9. Pretty print of JSON data
โ Have ever wanted to print your JSON data in a correct indented format for better visualization?
โ The indent parameter of the dumps() method can be used to specify the indentation level of your formatted string output.
Thank you for reading! ๐ ๐พ
I hope you found this list of Python and Pandas tricks helpful! Keep an eye on here, because the content will be maintained with more tricks on a daily basis.
Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.
Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!
Take your efficiency to the next level with these Pandas and Python Tricks!
This blog regroups all the Pandas and Python tricks & tips I share on a basis on my LinkedIn page. I have decided to centralize them into a single blog to help you make the most out of your learning process by easily finding what you are looking for.
The content is is divided into two main sections:
- Pandas tricks & tips are related to only Pandas.
- Python tricks & tips related to Python.
This section provides a list of all the tricks
1. ๐๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐ป๐ฒ๐ ๐ฐ๐ผ๐น๐๐บ๐ป ๐ณ๐ฟ๐ผ๐บ ๐บ๐๐น๐๐ถ๐ฝ๐น๐ฒ ๐ฐ๐ผ๐น๐๐บ๐ป๐ ๐ถ๐ป ๐๐ผ๐๐ฟ ๐ฑ๐ฎ๐๐ฎ๐ณ๐ฟ๐ฎ๐บ๐ฒ.
Performing simple arithmetic tasks such as creating a new column as the sum of two other columns can be straightforward.
๐ค But, what if you want to implement a more complex function and use it as the logic behind column creation? Here is where things can get a bit challenging.
Guess whatโฆ
โ ๐๐ฅ๐ฅ๐ก๐ฎ and ๐ก๐๐ข๐๐๐ can help you easily apply whatever logic to your columns using the following format:
๐๐[๐ฃ๐๐ฌ_๐๐ค๐ก] = ๐๐.๐๐ฅ๐ฅ๐ก๐ฎ(๐ก๐๐ข๐๐๐ ๐ง๐ค๐ฌ: ๐๐ช๐ฃ๐(๐ง๐ค๐ฌ), ๐๐ญ๐๐จ=1)
where:
โก ๐๐ is your dataframe.
โก ๐ง๐ค๐ฌ will correspond to each row in your data frame.
โก ๐๐ช๐ฃ๐ is the function you want to apply to your data frame.
โก ๐๐ญ๐๐จ=1 to apply the function to each row in your data frame.
๐ก Below is an illustration.
The `candidate_info` function combines each candidateโs information to create a single description column about that candidate.
2. Convert categorical data into numerical ones
This process mainly can occur in the feature engineering phase. Some of its benefits are:
- the identification of outliers, invalid, and missing values in the data.
- reduction of the chance of overfitting by creating more robust models.
โก Use these two functions from Pandas, depending on your need. Examples are provided in the image below.
1๏ธโฃ .๐๐ช๐ฉ() to specifically define your bin edges.
๐๐๐๐ฃ๐๐ง๐๐ค
Categorize candidates by expertise with respect to their number of experience, where:
- Entry level: 0โ1 year
- Mid-level: 2โ3 years
- Senior level: 4โ5 years
2๏ธโฃ .๐ฆ๐๐ช๐ฉ() to divide your data into equal-sized bins.
It uses the underlying percentiles of the distribution of the data, rather than the edges of the bins.
๐๐๐๐ฃ๐๐ง๐๐ค: categorize the commute time of the candidates into ๐๐ค๐ค๐, ๐๐๐๐๐ฅ๐ฉ๐๐๐ก๐, or ๐ฉ๐ค๐ค ๐ก๐ค๐ฃ๐.
๐๐๐๐ฅ ๐๐ฃ ๐ข๐๐ฃ๐ ๐ก
- When using .๐๐ช๐ฉ(): a number of bins = number of labels + 1.
- When using .๐ฆ๐๐ช๐ฉ(): a number of bins = number of labels.
- With .๐๐ช๐ฉ(): set ๐๐ฃ๐๐ก๐ช๐๐_๐ก๐ค๐ฌ๐๐จ๐ฉ=๐๐ง๐ช๐, otherwise, the lowest value will be converted to NaN.
3. Select rows from a Pandas Dataframe based on column(s) values
โก use .๐ฆ๐ช๐๐ง๐ฎ() function by specifying the filter condition.
โก the filter expression can contain any operators (<, >, ==, !=, etc.)
โก use the @ฬท sign to use a variable in the expression.
4. Deal with zip files
Sometimes it can be efficient to read and write .zip files without extracting them from your local disk. Below is an illustration.
5. Select ๐ฎ ๐๐๐ฏ๐๐ฒ๐ ๐ผ๐ณ ๐๐ผ๐๐ฟ ๐ฃ๐ฎ๐ป๐ฑ๐ฎ๐ ๐ฑ๐ฎ๐๐ฎ๐ณ๐ฟ๐ฎ๐บ๐ฒ ๐๐ถ๐๐ต ๐๐ฝ๐ฒ๐ฐ๐ถ๐ณ๐ถ๐ฐ ๐ฐ๐ผ๐น๐๐บ๐ป ๐๐๐ฝ๐ฒ๐
You can use the ๐จ๐๐ก๐๐๐ฉ_๐๐ฉ๐ฎ๐ฅ๐๐จ function. It takes two main parameters: ๐๐๐๐๐๐๐ ๐๐๐ ๐๐ก๐๐๐๐๐.
- ๐๐.๐๐๐๐๐๐_๐๐๐ข๐๐๐(๐๐๐๐๐๐๐ = [โ๐๐ข๐๐_๐ทโ, โ๐๐ข๐๐_๐ธโ, โฆ โ๐๐ข๐๐_๐โ]) means I want the subset of my data frame WITH columns of ๐๐ข๐๐_๐ท, ๐๐ข๐๐_๐ธ,โฆ, ๐๐ข๐๐_๐.
- ๐๐.๐๐๐๐๐๐_๐๐๐ข๐๐๐(๐๐ก๐๐๐๐๐ = [โ๐๐ข๐๐_๐ทโ, โ๐๐ข๐๐_๐ธโ, โฆ โ๐๐ข๐๐_๐โ]) means I want the subset of my data frame WITHOUT columns of ๐๐ข๐๐_๐ท, ๐๐ข๐๐_๐ธ,โฆ, ๐๐ข๐๐_๐.
โจ Below is an illustration
6. Remove comments from Pandas dataframe column
Imagine that I want clean this data (candidates.csv) by removing comments from the application date column. This can be done on the fly while loading your pandas dataframe using the ๐๐ค๐ข๐ข๐๐ฃ๐ฉ parameter as follow:
โก ๐๐๐๐๐_๐๐๐๐ = ๐๐.๐๐๐๐_๐๐๐(๐๐๐๐_๐๐_๐๐๐๐, ๐๐ค๐ข๐ข๐๐ฃ๐ฉ=โ๐๐ข๐๐๐๐โ)
In my case, ๐๐ค๐ข๐ข๐๐ฃ๐ฉ=โ#โ but it could be any other character (|, /, etc.) depending on your case. An illustration is the first scenario.
โ๐ฝ Wait, what if I want to create a new column for those comments and still remove them from the application date column? An illustration is the second scenario.
7. Print Pandas dataframe in Tabular format from consol
โ No, the application of the ๐๐๐๐๐() function to a pandas data frame does not always render an output that is easy to read, especially for data frames with multiple columns.
โ
If you want to get a nice console-friendly tabular output
Use the .๐๐_๐๐๐๐๐๐() function as illustrated below.
8. Highlight data points in Pandas
Applying colors to a pandas data frame can be a good way to emphasize certain data points for quick analysis.
โ This is where ๐๐๐๐๐๐.๐๐๐ข๐๐ module comes in handy. It has many features, but is not limited to the followings:
โจ ๐๐.๐๐๐ข๐๐.๐๐๐๐๐๐๐๐๐_๐๐๐ก() to assign a color to the maximum value of each column.
โจ ๐๐.๐๐๐ข๐๐.๐๐๐๐๐๐๐๐๐_๐in() to assign a color to the minimum value of each column.
โจ ๐๐.๐๐๐ข๐๐.๐๐๐๐๐ข(๐๐ข_๐๐๐๐๐๐_๐๐๐๐๐๐๐๐) to apply your custom function to your data frame.
9. Reduce decimal points in your data
Sometimes, very long decimal values in your data set do not provide significant information and can be painful ๐คฏ to look at.
So, you might want to convert your data to about 2 to 3 decimal points to facilitate your analysis.
โ This is something you can perform using the ๐๐๐๐๐๐.๐ณ๐๐๐๐ต๐๐๐๐.๐๐๐๐๐() function as illustrated below.
10. Replace some values in your data frame
You might want to replace some information in your data frame to keep it as up-to-date as possible.
โ This can be achieved using the Pandas ๐๐๐๐๐๐๐๐๐.๐๐๐๐๐๐๐() function as illustrated below.
11. Compare two data frames and get their differences
Sometimes, when comparing two pandas data frames, not only do you want to know if they are equivalent, but also where the difference lies if they are not equivalent.
โ This is where the .๐๐๐๐๐๐๐() function comes in handy.
โจ It generates a data frame showing columns with differences side by side. Its shape is different from (0, 0) only if the two data being compared are the same.
โจ If you want to show values that are equal, set the ๐๐๐๐_๐๐๐๐๐ parameter to ๐๐๐๐. Otherwise, they are shown as ๐ฝ๐๐ฝ.
12. Get a subset of a very large dataset for quick analysis
Sometimes, we just need a subset of a very large dataset for quick analysis. One of the approaches could be to read the whole data in memory before getting your sample.
This can require a lot of memory depending on how big your data is. Also, it can take significant time to read your data.
โ You can use ๐๐๐๐ ๐ parameter in the pandas ๐๐๐๐_๐๐๐() function by specifying the number of rows you want.
13. Transform your data frame from a wide to a long format
Sometimes it can be useful ๐๐๐๐๐๐๐๐๐ ๐ข๐๐๐ ๐๐๐๐๐๐๐๐๐ ๐๐๐๐ ๐ ๐ ๐๐๐ ๐๐ ๐ ๐๐๐๐ ๐๐๐๐๐๐ which is more flexible for better analysis, especially when dealing with time series data.
- ๐๐๐๐ฉ ๐๐ค ๐ฎ๐ค๐ช ๐ข๐๐๐ฃ ๐๐ฎ ๐ฌ๐๐๐ & ๐ก๐ค๐ฃ๐?
โจ Wide format is when you have a lot of columns.
โจ Long format on the other side is when you have a lot of rows.
โ ๐ฟ๐๐๐๐๐.๐๐๐๐() is a perfect candidate for this task.
Below is an illustration
14. Reduce the size of your Pandas data frame by ignoring the index
Do you know that you can reduce the size of your Pandas data frame by ignoring the index when saving it?
โ Something like ๐๐๐๐๐ก = ๐ต๐๐๐๐ when saving the file.
Below is an illustration.
15. Parquet instead of CSV
Very often, I donโt manually look ๐ at the content of a CSV or Excel file that will be used by Pandas for further analysis.
If thatโs your case, maybe you should not use .CSV anymore and think of a better option.
Especially if you are only concerned about
โจ Processing speed
โจ Speed in saving and loading
โจ Disk space occupied by the data frame
โ In that case, .๐ฅ๐๐ง๐ฆ๐ช๐๐ฉ format is your best option as illustrated below.
16. Transform your data frame into a markdown
It is always better to print your data frame in a way that makes it easier to understand.
โ One way of doing that is to render it in a markdown format using the .๐๐_๐๐๐๐๐๐๐ ๐() function.
๐ก Below is an illustration
17. Format Date Time column
When loading Pandas dataframes, date columns are represented as ๐ผ๐ฏ๐ท๐ฒ๐ฐ๐ by default, which is not โ the correct date format.
โ You can specify the target column in the ๐ฝ๐ฎ๐ฟ๐๐ฒ_๐ฑ๐ฎ๐๐ฒ๐ argument to get the correct column type.
1. Create a progress bar with tqdm and rich
Using the progress bar is beneficial when you want to have a visual status of a given task.
#!pip -q install rich
from rich.progress import track
from tqdm import tqdm
import time
Implement the callback function
def compute_double(x):
return 2*x
Create the progress bars
2. Get day, month, year, day of the week, the month of the year
3. Smallest and largest values of a column
If you want to get the rows with the largest or lowest values for a given column, you can use the following functions:
โจ ๐๐.๐๐๐๐๐๐๐๐(๐ฝ, โ๐ฒ๐๐_๐ฝ๐๐๐โ) โ top ๐ฝ rows based on ๐ฒ๐๐_๐ฝ๐๐๐
โจ ๐๐.๐๐๐๐๐๐๐๐๐(๐ฝ, โ๐ฒ๐๐_๐ฝ๐๐๐โ) โ ๐ฝ smallest rows based on ๐ฒ๐๐_๐ฝ๐๐๐
โจ ๐ฒ๐๐_๐ฝ๐๐๐ is the name of the column you are interested in.
4. Ignore the log output of the pip install command
Sometimes when installing a library from your jupyter notebook, you might not want to have all the details about the installation process generated by the default ๐๐๐ ๐๐๐๐๐๐๐ command.
โ You can specify the -q or โ quiet option to get rid of that information.
Below is an illustration ๐ก
5. Run multiple commands in a single notebook cell
The exclamation mark โ!โ is essential to successfully run a shell command from your Jupyter notebook.
However, this approach can be quite repetitive ๐ when dealing with multiple commands or a very long and complicated one.
โ A better way to tackle this issue is to use the %%๐๐๐ฌ๐ก expression at the beginning of your notebook cell.
๐ก Below is an illustration
6. Virtual environment.
A Data Science project can involve multiple dependencies, and dealing with all of them can be a bit annoying. ๐คฏ
โจ A good practice is to organize your project in a way that it can be easily shared with your team members and reproduced with the least amount of effort.
โ One way of doing this is to use virtual environments.
โ๏ธ ๐๐ฟ๐ฒ๐ฎ๐๐ฒ ๐๐ถ๐ฟ๐๐๐ฎ๐น ๐ฒ๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐ ๐ฎ๐ป๐ฑ ๐ถ๐ป๐๐๐ฎ๐น๐น ๐น๐ถ๐ฏ๐ฟ๐ฎ๐ฟ๐ถ๐ฒ๐.
โ Install the virtual environment module.
๐๐๐ ๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐
โ Create your environment by giving a meaningful name.
๐๐๐๐๐๐๐๐๐๐ [๐ข๐๐๐_๐๐๐๐๐๐๐๐๐๐๐_๐๐๐๐]
โ Activate your environment.
๐๐๐๐๐๐ [๐ข๐๐๐_๐๐๐๐๐๐๐๐๐๐๐_๐๐๐๐]/๐๐๐/๐๐๐๐๐๐๐๐
โ Start installing the dependencies for your project.
๐๐๐ ๐๐๐๐๐๐๐ ๐๐๐๐๐๐
โฆ
All this is great ๐๐ผ, BUTโฆ the virtual environment you just created is local to your machine๐.
๐๐๐๐ฉ ๐ฉ๐ค ๐๐ค?๐คท๐ปโโ๏ธ
๐ก You need to permanently save those dependencies in order to share them with others using this command:
โ ๐๐๐ ๐๐๐๐๐ฃ๐ > ๐๐๐๐๐๐๐๐๐๐๐๐.๐๐ก๐
This will create ๐๐๐๐๐๐๐๐๐๐๐๐.๐๐ก๐ file containing your project dependencies.
๐ Finally, anyone can install the exact same dependencies by running this command:
โ ๐๐๐ ๐๐๐๐๐๐๐ -๐ ๐๐๐๐๐๐๐๐๐๐๐๐.๐๐ก๐
7. Run multiple metrics at once
Scikit learn metrics
8. Chain multiple lists as a single sequence
You can use a single for loop to iterate through multiple lists as a single sequence ๐.
โ This can be achieved using the ๐๐๐๐๐() โ function from Python ๐ถ๐๐ฒ๐ฟ๐๐ผ๐ผ๐น๐ module.
9. Pretty print of JSON data
โ Have ever wanted to print your JSON data in a correct indented format for better visualization?
โ The indent parameter of the dumps() method can be used to specify the indentation level of your formatted string output.
Thank you for reading! ๐ ๐พ
I hope you found this list of Python and Pandas tricks helpful! Keep an eye on here, because the content will be maintained with more tricks on a daily basis.
Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.
Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!