Techno Blender
Digitally Yours.

Pandas & Python Tricks for Data Science & Data Analysis — Part 2 | by Zoumana Keita | Jan, 2023

0 29


Photo by Andrew Neel on Unsplash

A couple of days ago, I shared some Python and Pandas tricks to help Data Analysts and Data Scientists quickly learn new valuable concepts that they might not be aware of. This is also part of the collection of tricks I share daily on LinkedIn.

remove duplicates from a list

When trying to remove duplicates from a list, you might attempt to use the 𝗳𝗼𝗿 loop approach.

This works but is inefficient ❌ when dealing with very large data.

Instead, use 𝘀𝗲𝘁() ✅ which natively does not accept duplicates.

Below is an illustration 💡

Remove duplicates (Image by Author)

Orders in the original list

Using 𝘀𝗲𝘁() to remove duplicates from lists is a great approach.

🚨 But be careful with using it as it will NOT ❌ preserve the original order. Only use it when you don’t care about the order of the elements in your list.

Instead, use 𝗱𝗶𝗰𝘁.𝗳𝗿𝗼𝗺𝗸𝗲𝘆𝘀() ✅ to preserve the original order.

Below is an illustration 💡

Original order kept by dict.fromkeys (Image by Author)

Check if an element exists in a list

#Python Tricks ✨🐍✨

When trying to 𝗰𝗵𝗲𝗰𝗸 𝗶𝗳 𝗮𝗻 𝗶𝘁𝗲𝗺 𝗲𝘅𝗶𝘀𝘁𝘀 𝗶𝗻 𝗮 𝗹𝗶𝘀𝘁, you might attempt to use the 𝗳𝗼𝗿 loop and 𝗶𝗳 condition approach.

This works but is inefficient ❌ when dealing with very large data.

Instead, use the 𝗶𝗻 ✅ approach which natively returns a boolean.

Below is an illustration 💡

Check if an element exists in a list (Image by Author)

Get the N largest and smallest values in a Python list

The maximum and minimum values of a list in Python can be found using the 𝗺𝗮𝘅() and 𝗺𝗶𝗻() functions respectively.

However, when it comes to getting the 𝗡 𝗹𝗮𝗿𝗴𝗲𝘀𝘁 or 𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 values of you might think of a two-way approach:

1️⃣ Sort the list in decreasing or increasing order.

2️⃣ Retrieve the N largest or smallest values.

Good strategy, BUT not efficient ❌ when dealing with large data.

✅ Instead, you can use the 𝗻𝗹𝗮𝗿𝗴𝗲𝘀𝘁 and 𝗻𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 functions from the built-in Python module 𝗵𝗲𝗮𝗽𝗾 which is fast 🚀 and memory efficient 👍

𝗻𝗹𝗮𝗿𝗴𝗲𝘀𝘁 and 𝗻𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 functions illustration (Image by Author)

Display multiple dataframes using the same cell

Most of the time, we tend to use different notebook cells to display different dataframes such as the head() and tail() of the same data.

This is because when using them in the same cell, only the last one will be displayed, and all the instructions before are ignored ❌

✅ To solve this issue, you can use the 𝗱𝗶𝘀𝗽𝗹𝗮𝘆() function.

Below is an illustration 💡

Multiple dataframes from the same notebook cell (Image by Author)

Describe both numerical & categorical columns

Applying the 𝗱𝗲𝘀𝗰𝗿𝗶𝗯𝗲() function without a parameter naturally returns statistics related to numerical columns only.

This restricts 🚫 our understanding of the data set since most of the time we deal with categorical columns as well.

✅ To solve this issue, you can proceed with a two-way approach:

1️⃣ Use 𝗱𝗲𝘀𝗰𝗿𝗶𝗯𝗲() for numerical columns.

2️⃣ Set the parameter 𝗶𝗻𝗰𝗹𝘂𝗱𝗲=[𝗼𝗯𝗷𝗲𝗰𝘁] to provide information about categorical ones.

Below is an illustration 💡

describe including categorical columns as well (Image by Author)

Avoid for loops when creating new columns

When working with Pandas dataframes, creating new columns from existing ones is mainly part of the process.

The way these columns are created can affect the efficiency of the overall computation time ⏰.

Some may use loops to generate those derived columns.

However, this might not be the right approach ❌ because of the time complexity 📈, especially when working with large data.

✅ Adopting the vectorization approach is much better.

illustration of vectorization vs. for loop

Save a subset of Pandas columns

Sometimes we are interested in saving only a subset of columns from the original data frame rather than the whole data.

One way of doing that is to create a new data frame with the columns of interest.

But, this approach adds another layer of complexity ❌.

✅ This issue can be solved by specifying the columns argument.

Get a subset of Pandas columns (Image by Author)

Convert Tabular data from the webpage into Pandas Dataframe

If you want to extract tables from a webpage 🌐 as Pandas Dataframes, you can use the 𝗿𝗲𝗮𝗱_𝗵𝘁𝗺𝗹() function of Pandas.

✅ It returns a list of all the tables from the webpage.

Convert the webpage table into Pandas Dataframe (Image by Author)

Thank you for reading! 🎉 🍾

I hope you found this list of Python and Pandas tricks helpful! Keep an eye on here, because the content will be maintained with more tricks on a daily basis.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!




Photo by Andrew Neel on Unsplash

A couple of days ago, I shared some Python and Pandas tricks to help Data Analysts and Data Scientists quickly learn new valuable concepts that they might not be aware of. This is also part of the collection of tricks I share daily on LinkedIn.

remove duplicates from a list

When trying to remove duplicates from a list, you might attempt to use the 𝗳𝗼𝗿 loop approach.

This works but is inefficient ❌ when dealing with very large data.

Instead, use 𝘀𝗲𝘁() ✅ which natively does not accept duplicates.

Below is an illustration 💡

Remove duplicates (Image by Author)

Orders in the original list

Using 𝘀𝗲𝘁() to remove duplicates from lists is a great approach.

🚨 But be careful with using it as it will NOT ❌ preserve the original order. Only use it when you don’t care about the order of the elements in your list.

Instead, use 𝗱𝗶𝗰𝘁.𝗳𝗿𝗼𝗺𝗸𝗲𝘆𝘀() ✅ to preserve the original order.

Below is an illustration 💡

Original order kept by dict.fromkeys (Image by Author)

Check if an element exists in a list

#Python Tricks ✨🐍✨

When trying to 𝗰𝗵𝗲𝗰𝗸 𝗶𝗳 𝗮𝗻 𝗶𝘁𝗲𝗺 𝗲𝘅𝗶𝘀𝘁𝘀 𝗶𝗻 𝗮 𝗹𝗶𝘀𝘁, you might attempt to use the 𝗳𝗼𝗿 loop and 𝗶𝗳 condition approach.

This works but is inefficient ❌ when dealing with very large data.

Instead, use the 𝗶𝗻 ✅ approach which natively returns a boolean.

Below is an illustration 💡

Check if an element exists in a list (Image by Author)

Get the N largest and smallest values in a Python list

The maximum and minimum values of a list in Python can be found using the 𝗺𝗮𝘅() and 𝗺𝗶𝗻() functions respectively.

However, when it comes to getting the 𝗡 𝗹𝗮𝗿𝗴𝗲𝘀𝘁 or 𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 values of you might think of a two-way approach:

1️⃣ Sort the list in decreasing or increasing order.

2️⃣ Retrieve the N largest or smallest values.

Good strategy, BUT not efficient ❌ when dealing with large data.

✅ Instead, you can use the 𝗻𝗹𝗮𝗿𝗴𝗲𝘀𝘁 and 𝗻𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 functions from the built-in Python module 𝗵𝗲𝗮𝗽𝗾 which is fast 🚀 and memory efficient 👍

𝗻𝗹𝗮𝗿𝗴𝗲𝘀𝘁 and 𝗻𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 functions illustration (Image by Author)

Display multiple dataframes using the same cell

Most of the time, we tend to use different notebook cells to display different dataframes such as the head() and tail() of the same data.

This is because when using them in the same cell, only the last one will be displayed, and all the instructions before are ignored ❌

✅ To solve this issue, you can use the 𝗱𝗶𝘀𝗽𝗹𝗮𝘆() function.

Below is an illustration 💡

Multiple dataframes from the same notebook cell (Image by Author)

Describe both numerical & categorical columns

Applying the 𝗱𝗲𝘀𝗰𝗿𝗶𝗯𝗲() function without a parameter naturally returns statistics related to numerical columns only.

This restricts 🚫 our understanding of the data set since most of the time we deal with categorical columns as well.

✅ To solve this issue, you can proceed with a two-way approach:

1️⃣ Use 𝗱𝗲𝘀𝗰𝗿𝗶𝗯𝗲() for numerical columns.

2️⃣ Set the parameter 𝗶𝗻𝗰𝗹𝘂𝗱𝗲=[𝗼𝗯𝗷𝗲𝗰𝘁] to provide information about categorical ones.

Below is an illustration 💡

describe including categorical columns as well (Image by Author)

Avoid for loops when creating new columns

When working with Pandas dataframes, creating new columns from existing ones is mainly part of the process.

The way these columns are created can affect the efficiency of the overall computation time ⏰.

Some may use loops to generate those derived columns.

However, this might not be the right approach ❌ because of the time complexity 📈, especially when working with large data.

✅ Adopting the vectorization approach is much better.

illustration of vectorization vs. for loop

Save a subset of Pandas columns

Sometimes we are interested in saving only a subset of columns from the original data frame rather than the whole data.

One way of doing that is to create a new data frame with the columns of interest.

But, this approach adds another layer of complexity ❌.

✅ This issue can be solved by specifying the columns argument.

Get a subset of Pandas columns (Image by Author)

Convert Tabular data from the webpage into Pandas Dataframe

If you want to extract tables from a webpage 🌐 as Pandas Dataframes, you can use the 𝗿𝗲𝗮𝗱_𝗵𝘁𝗺𝗹() function of Pandas.

✅ It returns a list of all the tables from the webpage.

Convert the webpage table into Pandas Dataframe (Image by Author)

Thank you for reading! 🎉 🍾

I hope you found this list of Python and Pandas tricks helpful! Keep an eye on here, because the content will be maintained with more tricks on a daily basis.

Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.

Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment