Techno Blender
Digitally Yours.

5 Signs You’ve Become an Advanced Pandas User Without Even Realizing It

0 37


3. Friends with Pandas

If there is one thing that makes Pandas the king of data analysis libraries, it’s got to be its integration with the rest of the data ecosystem.

For example, by now you must have realized how you can change the plotting backend of Pandas from Matplotlib to either Plotly, HVPlot, holoviews, Bokeh, or Altair.

Yes, Matplotlib is best friends with Pandas but for once in a while, you fancy something interactive like Plotly or Altair.

import pandas as pd
import plotly.express as px

# Set the default plotting backend to Plotly
pd.options.plotting.backend = 'plotly'

Talking about backends, you’ve also noticed that Pandas added a fully-supported PyArrow implementation for its read_* functions to load data files in the brand-new 2.0.0 version.

import pandas as pd

pd.read_csv(file_name, engine='pyarrow')

When it was NumPy backend only, there were many limitations like little support for non-numeric data types, near-total disregard to missing values or no support for complex data structures (dates, timestamps, categoricals).

Before 2.0.0, Pandas had been cooking up in-house solutions to these problems but they were not as good as some heavy users have hoped. With PyArrow backend, loading data is considerably faster and it brings a suite of data types that Apache Arrow users are familiar with:

import pandas as pd

pd.read_csv(file_name, engine='pyarrow', dtype_engine='pyarrow')

Another cool feature of Pandas I am sure you use all the time in JupyterLab is styling DataFrames.

Since project Jupyter is so awesome, Pandas developers added a bit of HTML/CSS magic under the .style attribute so you can spice up plain old DataFrames in a way that reveals additional insights

df.sample(20, axis=1).describe().T.style.bar(
subset=["mean"], color="#205ff2"
).background_gradient(
subset=["std"], cmap="Reds"
).background_gradient(
subset=["50%"], cmap="coolwarm"
)
image.png
Image by author.

4. The data sculptor

Since Pandas is a data analysis and manipulation library, the truest sign you are pro is how flexibly you can shape and transform datasets to suit your purposes.

While most online courses provide the ready-made, cleaned columnar format data, the datasets in the wild come in many shapes and forms. For example, one of the most annoying formats of data is row-based (very common with financial data):

import pandas as pd

# create example DataFrame
df = pd.DataFrame(
{
"Date": [
"2022-01-01",
"2022-01-02",
"2022-01-01",
"2022-01-02",
],
"Country": ["USA", "USA", "Canada", "Canada"],
"Value": [10, 15, 5, 8],
}
)

df

png
Image by author

You must be able to convert row-based format into a more useful format like the below example using pivot function:

pivot_df = df.pivot(
index="Date",
columns="Country",
values="Value",
)

pivot_df

png

You may also have to perform the opposite of this operation, called a melt.

Here is an example with melt function of Pandas that turns columnar data into row-based format:

df = pd.DataFrame(
{
"Date": ["2022-01-01", "2022-01-02", "2022-01-03"],
"AAPL": [100.0, 101.0, 99.0],
"GOOG": [200.0, 205.0, 195.0],
"MSFT": [50.0, 52.0, 48.0],
}
)

df

png
Image by author
melted_df = pd.melt(
df, id_vars=["Date"], var_name="Stock", value_name="Price"
)

melted_df

png
Image by author

Such functions can be quite challenging to understand and even harder to apply.

There are other similar ones like pivot_table, which creates a pivot table that can compute different types of aggregations for each value in the table.

Another function is stack/unstack, which can collapse/explode DataFrame indices. crosstab computes a cross-tabulation of two or more factors, and by default, computes a frequency table of the factors but can also compute other summary statistics.

Then there’s groupby. Even though the basics of this function is simple, its more advanced use-cases are very hard to master. If the contents of the Pandas groupby function were made into a separate library, it would be larger than most in the Python ecosystem.

# Group by a date column, use a monthly frequency 
# and find the total revenue for `category`

grouped = df.groupby(['category', pd.Grouper(key='date', freq='M')])
monthly_revenue = grouped['revenue'].sum()

Skillfully choosing the right function for a particular situation is a sign you are true data sculptor.

Read parts two and three to learn the ins and outs of the functions mentioned in this section.


3. Friends with Pandas

If there is one thing that makes Pandas the king of data analysis libraries, it’s got to be its integration with the rest of the data ecosystem.

For example, by now you must have realized how you can change the plotting backend of Pandas from Matplotlib to either Plotly, HVPlot, holoviews, Bokeh, or Altair.

Yes, Matplotlib is best friends with Pandas but for once in a while, you fancy something interactive like Plotly or Altair.

import pandas as pd
import plotly.express as px

# Set the default plotting backend to Plotly
pd.options.plotting.backend = 'plotly'

Talking about backends, you’ve also noticed that Pandas added a fully-supported PyArrow implementation for its read_* functions to load data files in the brand-new 2.0.0 version.

import pandas as pd

pd.read_csv(file_name, engine='pyarrow')

When it was NumPy backend only, there were many limitations like little support for non-numeric data types, near-total disregard to missing values or no support for complex data structures (dates, timestamps, categoricals).

Before 2.0.0, Pandas had been cooking up in-house solutions to these problems but they were not as good as some heavy users have hoped. With PyArrow backend, loading data is considerably faster and it brings a suite of data types that Apache Arrow users are familiar with:

import pandas as pd

pd.read_csv(file_name, engine='pyarrow', dtype_engine='pyarrow')

Another cool feature of Pandas I am sure you use all the time in JupyterLab is styling DataFrames.

Since project Jupyter is so awesome, Pandas developers added a bit of HTML/CSS magic under the .style attribute so you can spice up plain old DataFrames in a way that reveals additional insights

df.sample(20, axis=1).describe().T.style.bar(
subset=["mean"], color="#205ff2"
).background_gradient(
subset=["std"], cmap="Reds"
).background_gradient(
subset=["50%"], cmap="coolwarm"
)
image.png
Image by author.

4. The data sculptor

Since Pandas is a data analysis and manipulation library, the truest sign you are pro is how flexibly you can shape and transform datasets to suit your purposes.

While most online courses provide the ready-made, cleaned columnar format data, the datasets in the wild come in many shapes and forms. For example, one of the most annoying formats of data is row-based (very common with financial data):

import pandas as pd

# create example DataFrame
df = pd.DataFrame(
{
"Date": [
"2022-01-01",
"2022-01-02",
"2022-01-01",
"2022-01-02",
],
"Country": ["USA", "USA", "Canada", "Canada"],
"Value": [10, 15, 5, 8],
}
)

df

png
Image by author

You must be able to convert row-based format into a more useful format like the below example using pivot function:

pivot_df = df.pivot(
index="Date",
columns="Country",
values="Value",
)

pivot_df

png

You may also have to perform the opposite of this operation, called a melt.

Here is an example with melt function of Pandas that turns columnar data into row-based format:

df = pd.DataFrame(
{
"Date": ["2022-01-01", "2022-01-02", "2022-01-03"],
"AAPL": [100.0, 101.0, 99.0],
"GOOG": [200.0, 205.0, 195.0],
"MSFT": [50.0, 52.0, 48.0],
}
)

df

png
Image by author
melted_df = pd.melt(
df, id_vars=["Date"], var_name="Stock", value_name="Price"
)

melted_df

png
Image by author

Such functions can be quite challenging to understand and even harder to apply.

There are other similar ones like pivot_table, which creates a pivot table that can compute different types of aggregations for each value in the table.

Another function is stack/unstack, which can collapse/explode DataFrame indices. crosstab computes a cross-tabulation of two or more factors, and by default, computes a frequency table of the factors but can also compute other summary statistics.

Then there’s groupby. Even though the basics of this function is simple, its more advanced use-cases are very hard to master. If the contents of the Pandas groupby function were made into a separate library, it would be larger than most in the Python ecosystem.

# Group by a date column, use a monthly frequency 
# and find the total revenue for `category`

grouped = df.groupby(['category', pd.Grouper(key='date', freq='M')])
monthly_revenue = grouped['revenue'].sum()

Skillfully choosing the right function for a particular situation is a sign you are true data sculptor.

Read parts two and three to learn the ins and outs of the functions mentioned in this section.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment