Techno Blender
Digitally Yours.

Pandas and Python Tips and Tricks for Data Science and Data Analysis | by Zoumana Keita | Dec, 2022

0 38


Take your efficiency to the next level with these Pandas and Python Tricks!

Photo by Andrew Neel on Unsplash

This blog regroups all the Pandas and Python tricks & tips I share on a basis on my LinkedIn page. I have decided to centralize them into a single blog to help you make the most out of your learning process by easily finding what you are looking for.

The content is is divided into two main sections:

  • Pandas tricks & tips are related to only Pandas.
  • Python tricks & tips related to Python.

This section provides a list of all the tricks

1. ๐—–๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ ๐—ฎ ๐—ป๐—ฒ๐˜„ ๐—ฐ๐—ผ๐—น๐˜‚๐—บ๐—ป ๐—ณ๐—ฟ๐—ผ๐—บ ๐—บ๐˜‚๐—น๐˜๐—ถ๐—ฝ๐—น๐—ฒ ๐—ฐ๐—ผ๐—น๐˜‚๐—บ๐—ป๐˜€ ๐—ถ๐—ป ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ.

Performing simple arithmetic tasks such as creating a new column as the sum of two other columns can be straightforward.

๐Ÿค” But, what if you want to implement a more complex function and use it as the logic behind column creation? Here is where things can get a bit challenging.

Guess whatโ€ฆ

โœ… ๐™–๐™ฅ๐™ฅ๐™ก๐™ฎ and ๐™ก๐™–๐™ข๐™—๐™™๐™– can help you easily apply whatever logic to your columns using the following format:

๐™™๐™›[๐™ฃ๐™š๐™ฌ_๐™˜๐™ค๐™ก] = ๐™™๐™›.๐™–๐™ฅ๐™ฅ๐™ก๐™ฎ(๐™ก๐™–๐™ข๐™—๐™™๐™– ๐™ง๐™ค๐™ฌ: ๐™›๐™ช๐™ฃ๐™˜(๐™ง๐™ค๐™ฌ), ๐™–๐™ญ๐™ž๐™จ=1) 

where:

โžก ๐™™๐™› is your dataframe.

โžก ๐™ง๐™ค๐™ฌ will correspond to each row in your data frame.

โžก ๐™›๐™ช๐™ฃ๐™˜ is the function you want to apply to your data frame.

โžก ๐™–๐™ญ๐™ž๐™จ=1 to apply the function to each row in your data frame.

๐Ÿ’ก Below is an illustration.

The `candidate_info` function combines each candidateโ€™s information to create a single description column about that candidate.

Result of Pandas apply and lambda (Image by Author)

2. Convert categorical data into numerical ones

This process mainly can occur in the feature engineering phase. Some of its benefits are:

  • the identification of outliers, invalid, and missing values in the data.
  • reduction of the chance of overfitting by creating more robust models.

โžก Use these two functions from Pandas, depending on your need. Examples are provided in the image below.

1๏ธโƒฃ .๐™˜๐™ช๐™ฉ() to specifically define your bin edges.

๐™Ž๐™˜๐™š๐™ฃ๐™–๐™ง๐™ž๐™ค
Categorize candidates by expertise with respect to their number of experience, where:

  • Entry level: 0โ€“1 year
  • Mid-level: 2โ€“3 years
  • Senior level: 4โ€“5 years
Result of the .cut function (Image by Author)

2๏ธโƒฃ .๐™ฆ๐™˜๐™ช๐™ฉ() to divide your data into equal-sized bins.
It uses the underlying percentiles of the distribution of the data, rather than the edges of the bins.

๐™Ž๐™˜๐™š๐™ฃ๐™–๐™ง๐™ž๐™ค: categorize the commute time of the candidates into ๐™œ๐™ค๐™ค๐™™, ๐™–๐™˜๐™˜๐™š๐™ฅ๐™ฉ๐™–๐™—๐™ก๐™š, or ๐™ฉ๐™ค๐™ค ๐™ก๐™ค๐™ฃ๐™œ.

Result of the .qcut function (Image by Author)

๐™†๐™š๐™š๐™ฅ ๐™ž๐™ฃ ๐™ข๐™ž๐™ฃ๐™™ ๐Ÿ’ก

  • When using .๐™˜๐™ช๐™ฉ(): a number of bins = number of labels + 1.
  • When using .๐™ฆ๐™˜๐™ช๐™ฉ(): a number of bins = number of labels.
  • With .๐™˜๐™ช๐™ฉ(): set ๐™ž๐™ฃ๐™˜๐™ก๐™ช๐™™๐™š_๐™ก๐™ค๐™ฌ๐™š๐™จ๐™ฉ=๐™๐™ง๐™ช๐™š, otherwise, the lowest value will be converted to NaN.

3. Select rows from a Pandas Dataframe based on column(s) values

โžก use .๐™ฆ๐™ช๐™š๐™ง๐™ฎ() function by specifying the filter condition.

โžก the filter expression can contain any operators (<, >, ==, !=, etc.)

โžก use the @ฬท sign to use a variable in the expression.

Select rows from a Pandas Dataframe based on column(s) values (Image by Author)

4. Deal with zip files

Sometimes it can be efficient to read and write .zip files without extracting them from your local disk. Below is an illustration.

5. Select ๐—ฎ ๐˜€๐˜‚๐—ฏ๐˜€๐—ฒ๐˜ ๐—ผ๐—ณ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฃ๐—ฎ๐—ป๐—ฑ๐—ฎ๐˜€ ๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ ๐˜„๐—ถ๐˜๐—ต ๐˜€๐—ฝ๐—ฒ๐—ฐ๐—ถ๐—ณ๐—ถ๐—ฐ ๐—ฐ๐—ผ๐—น๐˜‚๐—บ๐—ป ๐˜๐˜†๐—ฝ๐—ฒ๐˜€

You can use the ๐™จ๐™š๐™ก๐™š๐™˜๐™ฉ_๐™™๐™ฉ๐™ฎ๐™ฅ๐™š๐™จ function. It takes two main parameters: ๐š’๐š—๐šŒ๐š•๐šž๐š๐šŽ ๐šŠ๐š—๐š ๐šŽ๐šก๐šŒ๐š•๐šž๐š๐šŽ.

  • ๐š๐š.๐šœ๐šŽ๐š•๐šŽ๐šŒ๐š_๐š๐š๐šข๐š™๐šŽ๐šœ(๐š’๐š—๐šŒ๐š•๐šž๐š๐šŽ = [โ€˜๐š๐šข๐š™๐šŽ_๐Ÿทโ€™, โ€˜๐š๐šข๐š™๐šŽ_๐Ÿธโ€™, โ€ฆ โ€˜๐š๐šข๐š™๐šŽ_๐š—โ€™]) means I want the subset of my data frame WITH columns of ๐š๐šข๐š™๐šŽ_๐Ÿท, ๐š๐šข๐š™๐šŽ_๐Ÿธ,โ€ฆ, ๐š๐šข๐š™๐šŽ_๐š—.
  • ๐š๐š.๐šœ๐šŽ๐š•๐šŽ๐šŒ๐š_๐š๐š๐šข๐š™๐šŽ๐šœ(๐šŽ๐šก๐šŒ๐š•๐šž๐š๐šŽ = [โ€˜๐š๐šข๐š™๐šŽ_๐Ÿทโ€™, โ€˜๐š๐šข๐š™๐šŽ_๐Ÿธโ€™, โ€ฆ โ€˜๐š๐šข๐š™๐šŽ_๐š—โ€™]) means I want the subset of my data frame WITHOUT columns of ๐š๐šข๐š™๐šŽ_๐Ÿท, ๐š๐šข๐š™๐šŽ_๐Ÿธ,โ€ฆ, ๐š๐šข๐š™๐šŽ_๐š—.

โœจ Below is an illustration

select_subset_column_types.py
Columns subset selection (Image by Author)

6. Remove comments from Pandas dataframe column

Imagine that I want clean this data (candidates.csv) by removing comments from the application date column. This can be done on the fly while loading your pandas dataframe using the ๐™˜๐™ค๐™ข๐™ข๐™š๐™ฃ๐™ฉ parameter as follow:

โžก ๐šŒ๐š•๐šŽ๐šŠ๐š—_๐š๐šŠ๐š๐šŠ = ๐š™๐š.๐š›๐šŽ๐šŠ๐š_๐šŒ๐šœ๐šŸ(๐š™๐šŠ๐š๐š‘_๐š๐š˜_๐š๐šŠ๐š๐šŠ, ๐™˜๐™ค๐™ข๐™ข๐™š๐™ฃ๐™ฉ=โ€™๐šœ๐šข๐š–๐š‹๐š˜๐š•โ€™)

In my case, ๐™˜๐™ค๐™ข๐™ข๐™š๐™ฃ๐™ฉ=โ€™#โ€™ but it could be any other character (|, /, etc.) depending on your case. An illustration is the first scenario.

โœ‹๐Ÿฝ Wait, what if I want to create a new column for those comments and still remove them from the application date column? An illustration is the second scenario.

Remove comments from pandas dataframe (Image by Author)

7. Print Pandas dataframe in Tabular format from consol

โŒ No, the application of the ๐š™๐š›๐š’๐š—๐š() function to a pandas data frame does not always render an output that is easy to read, especially for data frames with multiple columns.

โœ… If you want to get a nice console-friendly tabular output
Use the .๐š๐š˜_๐šœ๐š๐š›๐š’๐š—๐š() function as illustrated below.

8. Highlight data points in Pandas

Applying colors to a pandas data frame can be a good way to emphasize certain data points for quick analysis.

โœ… This is where ๐š™๐šŠ๐š—๐š๐šŠ๐šœ.๐šœ๐š๐šข๐š•๐šŽ module comes in handy. It has many features, but is not limited to the followings:

โœจ ๐š๐š.๐šœ๐š๐šข๐š•๐šŽ.๐š‘๐š’๐š๐š‘๐š•๐š’๐š๐š‘๐š_๐š–๐šŠ๐šก() to assign a color to the maximum value of each column.

โœจ ๐š๐š.๐šœ๐š๐šข๐š•๐šŽ.๐š‘๐š’๐š๐š‘๐š•๐š’๐š๐š‘๐š_๐š–in() to assign a color to the minimum value of each column.

โœจ ๐š๐š.๐šœ๐š๐šข๐š•๐šŽ.๐šŠ๐š™๐š™๐š•๐šข(๐š–๐šข_๐šŒ๐šž๐šœ๐š๐š˜๐š–_๐š๐šž๐š—๐šŒ๐š๐š’๐š˜๐š—) to apply your custom function to your data frame.

Highlight data points in Pandas (Image by Author)

9. Reduce decimal points in your data

Sometimes, very long decimal values in your data set do not provide significant information and can be painful ๐Ÿคฏ to look at.

So, you might want to convert your data to about 2 to 3 decimal points to facilitate your analysis.

โœ… This is something you can perform using the ๐š™๐šŠ๐š—๐š๐šŠ๐šœ.๐™ณ๐šŠ๐š๐šŠ๐™ต๐š›๐šŠ๐š–๐šŽ.๐š›๐š˜๐šž๐š—๐š() function as illustrated below.

Reduce decimal points in your data (Image by Author)

10. Replace some values in your data frame

You might want to replace some information in your data frame to keep it as up-to-date as possible.

โœ… This can be achieved using the Pandas ๐š๐šŠ๐š๐šŠ๐š๐š›๐šŠ๐š–๐šŽ.๐š›๐šŽ๐š™๐š•๐šŠ๐šŒ๐šŽ() function as illustrated below.

Replace some values in your data frame (Image by Author)

11. Compare two data frames and get their differences

Sometimes, when comparing two pandas data frames, not only do you want to know if they are equivalent, but also where the difference lies if they are not equivalent.

โœ… This is where the .๐šŒ๐š˜๐š–๐š™๐šŠ๐š›๐šŽ() function comes in handy.

โœจ It generates a data frame showing columns with differences side by side. Its shape is different from (0, 0) only if the two data being compared are the same.

โœจ If you want to show values that are equal, set the ๐š”๐šŽ๐šŽ๐š™_๐šŽ๐šš๐šž๐šŠ๐š• parameter to ๐šƒ๐š›๐šž๐šŽ. Otherwise, they are shown as ๐™ฝ๐šŠ๐™ฝ.

Compare two data frames and get their differences (Image by Author)

12. Get a subset of a very large dataset for quick analysis

Sometimes, we just need a subset of a very large dataset for quick analysis. One of the approaches could be to read the whole data in memory before getting your sample.

This can require a lot of memory depending on how big your data is. Also, it can take significant time to read your data.

โœ… You can use ๐š—๐š›๐š˜๐š ๐šœ parameter in the pandas ๐š›๐šŽ๐šŠ๐š_๐šŒ๐šœ๐šŸ() function by specifying the number of rows you want.

Get a subset of a very large dataset for quick analysis (Image by Author)

13. Transform your data frame from a wide to a long format

Sometimes it can be useful ๐š๐š›๐šŠ๐š—๐šœ๐š๐š˜๐š›๐š– ๐šข๐š˜๐šž๐š› ๐š๐šŠ๐š๐šŠ๐š๐š›๐šŠ๐š–๐šŽ ๐š๐š›๐š˜๐š– ๐šŠ ๐š ๐š’๐š๐šŽ ๐š๐š˜ ๐šŠ ๐š•๐š˜๐š—๐š ๐š๐š˜๐š›๐š–๐šŠ๐š which is more flexible for better analysis, especially when dealing with time series data.

  • ๐™’๐™๐™–๐™ฉ ๐™™๐™ค ๐™ฎ๐™ค๐™ช ๐™ข๐™š๐™–๐™ฃ ๐™—๐™ฎ ๐™ฌ๐™ž๐™™๐™š & ๐™ก๐™ค๐™ฃ๐™œ?

โœจ Wide format is when you have a lot of columns.
โœจ Long format on the other side is when you have a lot of rows.

โœ… ๐™ฟ๐šŠ๐š—๐š๐šŠ๐šœ.๐š–๐šŽ๐š•๐š() is a perfect candidate for this task.

Below is an illustration

Transform your data frame from a wide to a long format (Image by Author)

14. Reduce the size of your Pandas data frame by ignoring the index

Do you know that you can reduce the size of your Pandas data frame by ignoring the index when saving it?

โœ… Something like ๐š’๐š—๐š๐šŽ๐šก = ๐™ต๐šŠ๐š•๐šœ๐šŽ when saving the file.

Below is an illustration.

Reduce the size of your Pandas data frame by ignoring the index (Image by Author)

15. Parquet instead of CSV

Very often, I donโ€™t manually look ๐Ÿ‘€ at the content of a CSV or Excel file that will be used by Pandas for further analysis.

If thatโ€™s your case, maybe you should not use .CSV anymore and think of a better option.

Especially if you are only concerned about

โœจ Processing speed

โœจ Speed in saving and loading

โœจ Disk space occupied by the data frame

โœ… In that case, .๐™ฅ๐™–๐™ง๐™ฆ๐™ช๐™š๐™ฉ format is your best option as illustrated below.

Parquet instead of CSV (Image by Author)

16. Transform your data frame into a markdown

It is always better to print your data frame in a way that makes it easier to understand.

โœ… One way of doing that is to render it in a markdown format using the .๐š๐š˜_๐š–๐šŠ๐š›๐š”๐š๐š˜๐š ๐š—() function.

๐Ÿ’ก Below is an illustration

17. Format Date Time column

When loading Pandas dataframes, date columns are represented as ๐—ผ๐—ฏ๐—ท๐—ฒ๐—ฐ๐˜ by default, which is not โŒ the correct date format.

โœ… You can specify the target column in the ๐—ฝ๐—ฎ๐—ฟ๐˜€๐—ฒ_๐—ฑ๐—ฎ๐˜๐—ฒ๐˜€ argument to get the correct column type.

DateTime Formating

1. Create a progress bar with tqdm and rich

Using the progress bar is beneficial when you want to have a visual status of a given task.

#!pip -q install rich
from rich.progress import track
from tqdm import tqdm
import time

Implement the callback function

def compute_double(x):
return 2*x

Create the progress bars

rich progress bar implementation
tqdm progress bar implementation

2. Get day, month, year, day of the week, the month of the year

Get day, month, year, day of the week, the month of the year (Image by author)

3. Smallest and largest values of a column

If you want to get the rows with the largest or lowest values for a given column, you can use the following functions:

โœจ ๐š๐š.๐š—๐š•๐šŠ๐š›๐š๐šŽ๐šœ๐š(๐™ฝ, โ€œ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽโ€) โ†’ top ๐™ฝ rows based on ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽ

โœจ ๐š๐š.๐š—๐šœ๐š–๐šŠ๐š•๐š•๐šŽ๐šœ๐š(๐™ฝ, โ€œ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽโ€) โ†’ ๐™ฝ smallest rows based on ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽ

โœจ ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽ is the name of the column you are interested in.

Smallest and largest values illustration (Image by Author)

4. Ignore the log output of the pip install command

Sometimes when installing a library from your jupyter notebook, you might not want to have all the details about the installation process generated by the default ๐š™๐š’๐š™ ๐š’๐š—๐šœ๐š๐šŠ๐š•๐š• command.

โœ… You can specify the -q or โ€” quiet option to get rid of that information.

Below is an illustration ๐Ÿ’ก

pip install illustration (Animation by Author)

5. Run multiple commands in a single notebook cell

The exclamation mark โ€˜!โ€™ is essential to successfully run a shell command from your Jupyter notebook.

However, this approach can be quite repetitive ๐Ÿ”‚ when dealing with multiple commands or a very long and complicated one.

โœ… A better way to tackle this issue is to use the %%๐›๐š๐ฌ๐ก expression at the beginning of your notebook cell.

๐Ÿ’ก Below is an illustration

Illustration of %%bash statement (Animation by Autor)

6. Virtual environment.

A Data Science project can involve multiple dependencies, and dealing with all of them can be a bit annoying. ๐Ÿคฏ

โœจ A good practice is to organize your project in a way that it can be easily shared with your team members and reproduced with the least amount of effort.

โœ… One way of doing this is to use virtual environments.

โš™๏ธ ๐—–๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ ๐˜ƒ๐—ถ๐—ฟ๐˜๐˜‚๐—ฎ๐—น ๐—ฒ๐—ป๐˜ƒ๐—ถ๐—ฟ๐—ผ๐—ป๐—บ๐—ฒ๐—ป๐˜ ๐—ฎ๐—ป๐—ฑ ๐—ถ๐—ป๐˜€๐˜๐—ฎ๐—น๐—น ๐—น๐—ถ๐—ฏ๐—ฟ๐—ฎ๐—ฟ๐—ถ๐—ฒ๐˜€.

โ†’ Install the virtual environment module.
๐š™๐š’๐š™ ๐š’๐š—๐šœ๐š๐šŠ๐š•๐š• ๐šŸ๐š’๐š›๐š๐šž๐šŠ๐š•๐šŽ๐š—๐šŸ

โ†’ Create your environment by giving a meaningful name.
๐šŸ๐š’๐š›๐š๐šž๐šŠ๐š•๐šŽ๐š—๐šŸ [๐šข๐š˜๐šž๐š›_๐šŽ๐š—๐šŸ๐š’๐š›๐š˜๐š—๐š–๐šŽ๐š—๐š_๐š—๐šŠ๐š–๐šŽ]

โ†’ Activate your environment.
๐šœ๐š˜๐šž๐š›๐šŒ๐šŽ [๐šข๐š˜๐šž๐š›_๐šŽ๐š—๐šŸ๐š’๐š›๐š˜๐š—๐š–๐šŽ๐š—๐š_๐š—๐šŠ๐š–๐šŽ]/๐š‹๐š’๐š—/๐šŠ๐šŒ๐š๐š’๐šŸ๐šŠ๐š๐šŽ

โ†’ Start installing the dependencies for your project.
๐š™๐š’๐š™ ๐š’๐š—๐šœ๐š๐šŠ๐š•๐š• ๐š™๐šŠ๐š—๐š๐šŠ๐šœ
โ€ฆ

All this is great ๐Ÿ‘๐Ÿผ, BUTโ€ฆ the virtual environment you just created is local to your machine๐Ÿ˜.

๐™’๐™๐™–๐™ฉ ๐™ฉ๐™ค ๐™™๐™ค?๐Ÿคท๐Ÿปโ€โ™‚๏ธ

๐Ÿ’ก You need to permanently save those dependencies in order to share them with others using this command:

โ†’ ๐š™๐š’๐š™ ๐š๐š›๐šŽ๐šŽ๐šฃ๐šŽ > ๐š›๐šŽ๐šš๐šž๐š’๐š›๐šŽ๐š–๐šŽ๐š—๐š๐šœ.๐š๐šก๐š

This will create ๐š›๐šŽ๐šš๐šž๐š’๐š›๐šŽ๐š–๐šŽ๐š—๐š๐šœ.๐š๐šก๐š file containing your project dependencies.

๐Ÿ”š Finally, anyone can install the exact same dependencies by running this command:
โ†’ ๐š™๐š’๐š™ ๐š’๐š—๐šœ๐š๐šŠ๐š•๐š• -๐š› ๐š›๐šŽ๐šš๐šž๐š’๐š›๐šŽ๐š–๐šŽ๐š—๐š๐šœ.๐š๐šก๐š

7. Run multiple metrics at once

Scikit learn metrics

8. Chain multiple lists as a single sequence

You can use a single for loop to iterate through multiple lists as a single sequence ๐Ÿ”‚.

โœ… This can be achieved using the ๐šŒ๐š‘๐šŠ๐š’๐š—() โ›“ function from Python ๐—ถ๐˜๐—ฒ๐—ฟ๐˜๐—ผ๐—ผ๐—น๐˜€ module.

List chaining

9. Pretty print of JSON data

โ“ Have ever wanted to print your JSON data in a correct indented format for better visualization?

โœ… The indent parameter of the dumps() method can be used to specify the indentation level of your formatted string output.

Pretty print your JSON data

Thank you for reading! ๐ŸŽ‰ ๐Ÿพ

I hope you found this list of Python and Pandas tricks helpful! Keep an eye on here, because the content will be maintained with more tricks on a daily basis.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!




Take your efficiency to the next level with these Pandas and Python Tricks!

Photo by Andrew Neel on Unsplash

This blog regroups all the Pandas and Python tricks & tips I share on a basis on my LinkedIn page. I have decided to centralize them into a single blog to help you make the most out of your learning process by easily finding what you are looking for.

The content is is divided into two main sections:

  • Pandas tricks & tips are related to only Pandas.
  • Python tricks & tips related to Python.

This section provides a list of all the tricks

1. ๐—–๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ ๐—ฎ ๐—ป๐—ฒ๐˜„ ๐—ฐ๐—ผ๐—น๐˜‚๐—บ๐—ป ๐—ณ๐—ฟ๐—ผ๐—บ ๐—บ๐˜‚๐—น๐˜๐—ถ๐—ฝ๐—น๐—ฒ ๐—ฐ๐—ผ๐—น๐˜‚๐—บ๐—ป๐˜€ ๐—ถ๐—ป ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ.

Performing simple arithmetic tasks such as creating a new column as the sum of two other columns can be straightforward.

๐Ÿค” But, what if you want to implement a more complex function and use it as the logic behind column creation? Here is where things can get a bit challenging.

Guess whatโ€ฆ

โœ… ๐™–๐™ฅ๐™ฅ๐™ก๐™ฎ and ๐™ก๐™–๐™ข๐™—๐™™๐™– can help you easily apply whatever logic to your columns using the following format:

๐™™๐™›[๐™ฃ๐™š๐™ฌ_๐™˜๐™ค๐™ก] = ๐™™๐™›.๐™–๐™ฅ๐™ฅ๐™ก๐™ฎ(๐™ก๐™–๐™ข๐™—๐™™๐™– ๐™ง๐™ค๐™ฌ: ๐™›๐™ช๐™ฃ๐™˜(๐™ง๐™ค๐™ฌ), ๐™–๐™ญ๐™ž๐™จ=1) 

where:

โžก ๐™™๐™› is your dataframe.

โžก ๐™ง๐™ค๐™ฌ will correspond to each row in your data frame.

โžก ๐™›๐™ช๐™ฃ๐™˜ is the function you want to apply to your data frame.

โžก ๐™–๐™ญ๐™ž๐™จ=1 to apply the function to each row in your data frame.

๐Ÿ’ก Below is an illustration.

The `candidate_info` function combines each candidateโ€™s information to create a single description column about that candidate.

Result of Pandas apply and lambda (Image by Author)

2. Convert categorical data into numerical ones

This process mainly can occur in the feature engineering phase. Some of its benefits are:

  • the identification of outliers, invalid, and missing values in the data.
  • reduction of the chance of overfitting by creating more robust models.

โžก Use these two functions from Pandas, depending on your need. Examples are provided in the image below.

1๏ธโƒฃ .๐™˜๐™ช๐™ฉ() to specifically define your bin edges.

๐™Ž๐™˜๐™š๐™ฃ๐™–๐™ง๐™ž๐™ค
Categorize candidates by expertise with respect to their number of experience, where:

  • Entry level: 0โ€“1 year
  • Mid-level: 2โ€“3 years
  • Senior level: 4โ€“5 years
Result of the .cut function (Image by Author)

2๏ธโƒฃ .๐™ฆ๐™˜๐™ช๐™ฉ() to divide your data into equal-sized bins.
It uses the underlying percentiles of the distribution of the data, rather than the edges of the bins.

๐™Ž๐™˜๐™š๐™ฃ๐™–๐™ง๐™ž๐™ค: categorize the commute time of the candidates into ๐™œ๐™ค๐™ค๐™™, ๐™–๐™˜๐™˜๐™š๐™ฅ๐™ฉ๐™–๐™—๐™ก๐™š, or ๐™ฉ๐™ค๐™ค ๐™ก๐™ค๐™ฃ๐™œ.

Result of the .qcut function (Image by Author)

๐™†๐™š๐™š๐™ฅ ๐™ž๐™ฃ ๐™ข๐™ž๐™ฃ๐™™ ๐Ÿ’ก

  • When using .๐™˜๐™ช๐™ฉ(): a number of bins = number of labels + 1.
  • When using .๐™ฆ๐™˜๐™ช๐™ฉ(): a number of bins = number of labels.
  • With .๐™˜๐™ช๐™ฉ(): set ๐™ž๐™ฃ๐™˜๐™ก๐™ช๐™™๐™š_๐™ก๐™ค๐™ฌ๐™š๐™จ๐™ฉ=๐™๐™ง๐™ช๐™š, otherwise, the lowest value will be converted to NaN.

3. Select rows from a Pandas Dataframe based on column(s) values

โžก use .๐™ฆ๐™ช๐™š๐™ง๐™ฎ() function by specifying the filter condition.

โžก the filter expression can contain any operators (<, >, ==, !=, etc.)

โžก use the @ฬท sign to use a variable in the expression.

Select rows from a Pandas Dataframe based on column(s) values (Image by Author)

4. Deal with zip files

Sometimes it can be efficient to read and write .zip files without extracting them from your local disk. Below is an illustration.

5. Select ๐—ฎ ๐˜€๐˜‚๐—ฏ๐˜€๐—ฒ๐˜ ๐—ผ๐—ณ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฃ๐—ฎ๐—ป๐—ฑ๐—ฎ๐˜€ ๐—ฑ๐—ฎ๐˜๐—ฎ๐—ณ๐—ฟ๐—ฎ๐—บ๐—ฒ ๐˜„๐—ถ๐˜๐—ต ๐˜€๐—ฝ๐—ฒ๐—ฐ๐—ถ๐—ณ๐—ถ๐—ฐ ๐—ฐ๐—ผ๐—น๐˜‚๐—บ๐—ป ๐˜๐˜†๐—ฝ๐—ฒ๐˜€

You can use the ๐™จ๐™š๐™ก๐™š๐™˜๐™ฉ_๐™™๐™ฉ๐™ฎ๐™ฅ๐™š๐™จ function. It takes two main parameters: ๐š’๐š—๐šŒ๐š•๐šž๐š๐šŽ ๐šŠ๐š—๐š ๐šŽ๐šก๐šŒ๐š•๐šž๐š๐šŽ.

  • ๐š๐š.๐šœ๐šŽ๐š•๐šŽ๐šŒ๐š_๐š๐š๐šข๐š™๐šŽ๐šœ(๐š’๐š—๐šŒ๐š•๐šž๐š๐šŽ = [โ€˜๐š๐šข๐š™๐šŽ_๐Ÿทโ€™, โ€˜๐š๐šข๐š™๐šŽ_๐Ÿธโ€™, โ€ฆ โ€˜๐š๐šข๐š™๐šŽ_๐š—โ€™]) means I want the subset of my data frame WITH columns of ๐š๐šข๐š™๐šŽ_๐Ÿท, ๐š๐šข๐š™๐šŽ_๐Ÿธ,โ€ฆ, ๐š๐šข๐š™๐šŽ_๐š—.
  • ๐š๐š.๐šœ๐šŽ๐š•๐šŽ๐šŒ๐š_๐š๐š๐šข๐š™๐šŽ๐šœ(๐šŽ๐šก๐šŒ๐š•๐šž๐š๐šŽ = [โ€˜๐š๐šข๐š™๐šŽ_๐Ÿทโ€™, โ€˜๐š๐šข๐š™๐šŽ_๐Ÿธโ€™, โ€ฆ โ€˜๐š๐šข๐š™๐šŽ_๐š—โ€™]) means I want the subset of my data frame WITHOUT columns of ๐š๐šข๐š™๐šŽ_๐Ÿท, ๐š๐šข๐š™๐šŽ_๐Ÿธ,โ€ฆ, ๐š๐šข๐š™๐šŽ_๐š—.

โœจ Below is an illustration

select_subset_column_types.py
Columns subset selection (Image by Author)

6. Remove comments from Pandas dataframe column

Imagine that I want clean this data (candidates.csv) by removing comments from the application date column. This can be done on the fly while loading your pandas dataframe using the ๐™˜๐™ค๐™ข๐™ข๐™š๐™ฃ๐™ฉ parameter as follow:

โžก ๐šŒ๐š•๐šŽ๐šŠ๐š—_๐š๐šŠ๐š๐šŠ = ๐š™๐š.๐š›๐šŽ๐šŠ๐š_๐šŒ๐šœ๐šŸ(๐š™๐šŠ๐š๐š‘_๐š๐š˜_๐š๐šŠ๐š๐šŠ, ๐™˜๐™ค๐™ข๐™ข๐™š๐™ฃ๐™ฉ=โ€™๐šœ๐šข๐š–๐š‹๐š˜๐š•โ€™)

In my case, ๐™˜๐™ค๐™ข๐™ข๐™š๐™ฃ๐™ฉ=โ€™#โ€™ but it could be any other character (|, /, etc.) depending on your case. An illustration is the first scenario.

โœ‹๐Ÿฝ Wait, what if I want to create a new column for those comments and still remove them from the application date column? An illustration is the second scenario.

Remove comments from pandas dataframe (Image by Author)

7. Print Pandas dataframe in Tabular format from consol

โŒ No, the application of the ๐š™๐š›๐š’๐š—๐š() function to a pandas data frame does not always render an output that is easy to read, especially for data frames with multiple columns.

โœ… If you want to get a nice console-friendly tabular output
Use the .๐š๐š˜_๐šœ๐š๐š›๐š’๐š—๐š() function as illustrated below.

8. Highlight data points in Pandas

Applying colors to a pandas data frame can be a good way to emphasize certain data points for quick analysis.

โœ… This is where ๐š™๐šŠ๐š—๐š๐šŠ๐šœ.๐šœ๐š๐šข๐š•๐šŽ module comes in handy. It has many features, but is not limited to the followings:

โœจ ๐š๐š.๐šœ๐š๐šข๐š•๐šŽ.๐š‘๐š’๐š๐š‘๐š•๐š’๐š๐š‘๐š_๐š–๐šŠ๐šก() to assign a color to the maximum value of each column.

โœจ ๐š๐š.๐šœ๐š๐šข๐š•๐šŽ.๐š‘๐š’๐š๐š‘๐š•๐š’๐š๐š‘๐š_๐š–in() to assign a color to the minimum value of each column.

โœจ ๐š๐š.๐šœ๐š๐šข๐š•๐šŽ.๐šŠ๐š™๐š™๐š•๐šข(๐š–๐šข_๐šŒ๐šž๐šœ๐š๐š˜๐š–_๐š๐šž๐š—๐šŒ๐š๐š’๐š˜๐š—) to apply your custom function to your data frame.

Highlight data points in Pandas (Image by Author)

9. Reduce decimal points in your data

Sometimes, very long decimal values in your data set do not provide significant information and can be painful ๐Ÿคฏ to look at.

So, you might want to convert your data to about 2 to 3 decimal points to facilitate your analysis.

โœ… This is something you can perform using the ๐š™๐šŠ๐š—๐š๐šŠ๐šœ.๐™ณ๐šŠ๐š๐šŠ๐™ต๐š›๐šŠ๐š–๐šŽ.๐š›๐š˜๐šž๐š—๐š() function as illustrated below.

Reduce decimal points in your data (Image by Author)

10. Replace some values in your data frame

You might want to replace some information in your data frame to keep it as up-to-date as possible.

โœ… This can be achieved using the Pandas ๐š๐šŠ๐š๐šŠ๐š๐š›๐šŠ๐š–๐šŽ.๐š›๐šŽ๐š™๐š•๐šŠ๐šŒ๐šŽ() function as illustrated below.

Replace some values in your data frame (Image by Author)

11. Compare two data frames and get their differences

Sometimes, when comparing two pandas data frames, not only do you want to know if they are equivalent, but also where the difference lies if they are not equivalent.

โœ… This is where the .๐šŒ๐š˜๐š–๐š™๐šŠ๐š›๐šŽ() function comes in handy.

โœจ It generates a data frame showing columns with differences side by side. Its shape is different from (0, 0) only if the two data being compared are the same.

โœจ If you want to show values that are equal, set the ๐š”๐šŽ๐šŽ๐š™_๐šŽ๐šš๐šž๐šŠ๐š• parameter to ๐šƒ๐š›๐šž๐šŽ. Otherwise, they are shown as ๐™ฝ๐šŠ๐™ฝ.

Compare two data frames and get their differences (Image by Author)

12. Get a subset of a very large dataset for quick analysis

Sometimes, we just need a subset of a very large dataset for quick analysis. One of the approaches could be to read the whole data in memory before getting your sample.

This can require a lot of memory depending on how big your data is. Also, it can take significant time to read your data.

โœ… You can use ๐š—๐š›๐š˜๐š ๐šœ parameter in the pandas ๐š›๐šŽ๐šŠ๐š_๐šŒ๐šœ๐šŸ() function by specifying the number of rows you want.

Get a subset of a very large dataset for quick analysis (Image by Author)

13. Transform your data frame from a wide to a long format

Sometimes it can be useful ๐š๐š›๐šŠ๐š—๐šœ๐š๐š˜๐š›๐š– ๐šข๐š˜๐šž๐š› ๐š๐šŠ๐š๐šŠ๐š๐š›๐šŠ๐š–๐šŽ ๐š๐š›๐š˜๐š– ๐šŠ ๐š ๐š’๐š๐šŽ ๐š๐š˜ ๐šŠ ๐š•๐š˜๐š—๐š ๐š๐š˜๐š›๐š–๐šŠ๐š which is more flexible for better analysis, especially when dealing with time series data.

  • ๐™’๐™๐™–๐™ฉ ๐™™๐™ค ๐™ฎ๐™ค๐™ช ๐™ข๐™š๐™–๐™ฃ ๐™—๐™ฎ ๐™ฌ๐™ž๐™™๐™š & ๐™ก๐™ค๐™ฃ๐™œ?

โœจ Wide format is when you have a lot of columns.
โœจ Long format on the other side is when you have a lot of rows.

โœ… ๐™ฟ๐šŠ๐š—๐š๐šŠ๐šœ.๐š–๐šŽ๐š•๐š() is a perfect candidate for this task.

Below is an illustration

Transform your data frame from a wide to a long format (Image by Author)

14. Reduce the size of your Pandas data frame by ignoring the index

Do you know that you can reduce the size of your Pandas data frame by ignoring the index when saving it?

โœ… Something like ๐š’๐š—๐š๐šŽ๐šก = ๐™ต๐šŠ๐š•๐šœ๐šŽ when saving the file.

Below is an illustration.

Reduce the size of your Pandas data frame by ignoring the index (Image by Author)

15. Parquet instead of CSV

Very often, I donโ€™t manually look ๐Ÿ‘€ at the content of a CSV or Excel file that will be used by Pandas for further analysis.

If thatโ€™s your case, maybe you should not use .CSV anymore and think of a better option.

Especially if you are only concerned about

โœจ Processing speed

โœจ Speed in saving and loading

โœจ Disk space occupied by the data frame

โœ… In that case, .๐™ฅ๐™–๐™ง๐™ฆ๐™ช๐™š๐™ฉ format is your best option as illustrated below.

Parquet instead of CSV (Image by Author)

16. Transform your data frame into a markdown

It is always better to print your data frame in a way that makes it easier to understand.

โœ… One way of doing that is to render it in a markdown format using the .๐š๐š˜_๐š–๐šŠ๐š›๐š”๐š๐š˜๐š ๐š—() function.

๐Ÿ’ก Below is an illustration

17. Format Date Time column

When loading Pandas dataframes, date columns are represented as ๐—ผ๐—ฏ๐—ท๐—ฒ๐—ฐ๐˜ by default, which is not โŒ the correct date format.

โœ… You can specify the target column in the ๐—ฝ๐—ฎ๐—ฟ๐˜€๐—ฒ_๐—ฑ๐—ฎ๐˜๐—ฒ๐˜€ argument to get the correct column type.

DateTime Formating

1. Create a progress bar with tqdm and rich

Using the progress bar is beneficial when you want to have a visual status of a given task.

#!pip -q install rich
from rich.progress import track
from tqdm import tqdm
import time

Implement the callback function

def compute_double(x):
return 2*x

Create the progress bars

rich progress bar implementation
tqdm progress bar implementation

2. Get day, month, year, day of the week, the month of the year

Get day, month, year, day of the week, the month of the year (Image by author)

3. Smallest and largest values of a column

If you want to get the rows with the largest or lowest values for a given column, you can use the following functions:

โœจ ๐š๐š.๐š—๐š•๐šŠ๐š›๐š๐šŽ๐šœ๐š(๐™ฝ, โ€œ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽโ€) โ†’ top ๐™ฝ rows based on ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽ

โœจ ๐š๐š.๐š—๐šœ๐š–๐šŠ๐š•๐š•๐šŽ๐šœ๐š(๐™ฝ, โ€œ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽโ€) โ†’ ๐™ฝ smallest rows based on ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽ

โœจ ๐™ฒ๐š˜๐š•_๐™ฝ๐šŠ๐š–๐šŽ is the name of the column you are interested in.

Smallest and largest values illustration (Image by Author)

4. Ignore the log output of the pip install command

Sometimes when installing a library from your jupyter notebook, you might not want to have all the details about the installation process generated by the default ๐š™๐š’๐š™ ๐š’๐š—๐šœ๐š๐šŠ๐š•๐š• command.

โœ… You can specify the -q or โ€” quiet option to get rid of that information.

Below is an illustration ๐Ÿ’ก

pip install illustration (Animation by Author)

5. Run multiple commands in a single notebook cell

The exclamation mark โ€˜!โ€™ is essential to successfully run a shell command from your Jupyter notebook.

However, this approach can be quite repetitive ๐Ÿ”‚ when dealing with multiple commands or a very long and complicated one.

โœ… A better way to tackle this issue is to use the %%๐›๐š๐ฌ๐ก expression at the beginning of your notebook cell.

๐Ÿ’ก Below is an illustration

Illustration of %%bash statement (Animation by Autor)

6. Virtual environment.

A Data Science project can involve multiple dependencies, and dealing with all of them can be a bit annoying. ๐Ÿคฏ

โœจ A good practice is to organize your project in a way that it can be easily shared with your team members and reproduced with the least amount of effort.

โœ… One way of doing this is to use virtual environments.

โš™๏ธ ๐—–๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ ๐˜ƒ๐—ถ๐—ฟ๐˜๐˜‚๐—ฎ๐—น ๐—ฒ๐—ป๐˜ƒ๐—ถ๐—ฟ๐—ผ๐—ป๐—บ๐—ฒ๐—ป๐˜ ๐—ฎ๐—ป๐—ฑ ๐—ถ๐—ป๐˜€๐˜๐—ฎ๐—น๐—น ๐—น๐—ถ๐—ฏ๐—ฟ๐—ฎ๐—ฟ๐—ถ๐—ฒ๐˜€.

โ†’ Install the virtual environment module.
๐š™๐š’๐š™ ๐š’๐š—๐šœ๐š๐šŠ๐š•๐š• ๐šŸ๐š’๐š›๐š๐šž๐šŠ๐š•๐šŽ๐š—๐šŸ

โ†’ Create your environment by giving a meaningful name.
๐šŸ๐š’๐š›๐š๐šž๐šŠ๐š•๐šŽ๐š—๐šŸ [๐šข๐š˜๐šž๐š›_๐šŽ๐š—๐šŸ๐š’๐š›๐š˜๐š—๐š–๐šŽ๐š—๐š_๐š—๐šŠ๐š–๐šŽ]

โ†’ Activate your environment.
๐šœ๐š˜๐šž๐š›๐šŒ๐šŽ [๐šข๐š˜๐šž๐š›_๐šŽ๐š—๐šŸ๐š’๐š›๐š˜๐š—๐š–๐šŽ๐š—๐š_๐š—๐šŠ๐š–๐šŽ]/๐š‹๐š’๐š—/๐šŠ๐šŒ๐š๐š’๐šŸ๐šŠ๐š๐šŽ

โ†’ Start installing the dependencies for your project.
๐š™๐š’๐š™ ๐š’๐š—๐šœ๐š๐šŠ๐š•๐š• ๐š™๐šŠ๐š—๐š๐šŠ๐šœ
โ€ฆ

All this is great ๐Ÿ‘๐Ÿผ, BUTโ€ฆ the virtual environment you just created is local to your machine๐Ÿ˜.

๐™’๐™๐™–๐™ฉ ๐™ฉ๐™ค ๐™™๐™ค?๐Ÿคท๐Ÿปโ€โ™‚๏ธ

๐Ÿ’ก You need to permanently save those dependencies in order to share them with others using this command:

โ†’ ๐š™๐š’๐š™ ๐š๐š›๐šŽ๐šŽ๐šฃ๐šŽ > ๐š›๐šŽ๐šš๐šž๐š’๐š›๐šŽ๐š–๐šŽ๐š—๐š๐šœ.๐š๐šก๐š

This will create ๐š›๐šŽ๐šš๐šž๐š’๐š›๐šŽ๐š–๐šŽ๐š—๐š๐šœ.๐š๐šก๐š file containing your project dependencies.

๐Ÿ”š Finally, anyone can install the exact same dependencies by running this command:
โ†’ ๐š™๐š’๐š™ ๐š’๐š—๐šœ๐š๐šŠ๐š•๐š• -๐š› ๐š›๐šŽ๐šš๐šž๐š’๐š›๐šŽ๐š–๐šŽ๐š—๐š๐šœ.๐š๐šก๐š

7. Run multiple metrics at once

Scikit learn metrics

8. Chain multiple lists as a single sequence

You can use a single for loop to iterate through multiple lists as a single sequence ๐Ÿ”‚.

โœ… This can be achieved using the ๐šŒ๐š‘๐šŠ๐š’๐š—() โ›“ function from Python ๐—ถ๐˜๐—ฒ๐—ฟ๐˜๐—ผ๐—ผ๐—น๐˜€ module.

List chaining

9. Pretty print of JSON data

โ“ Have ever wanted to print your JSON data in a correct indented format for better visualization?

โœ… The indent parameter of the dumps() method can be used to specify the indentation level of your formatted string output.

Pretty print your JSON data

Thank you for reading! ๐ŸŽ‰ ๐Ÿพ

I hope you found this list of Python and Pandas tricks helpful! Keep an eye on here, because the content will be maintained with more tricks on a daily basis.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all worldโ€™s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email โ€“ [email protected]. The content will be deleted within 24 hours.

Leave a comment