Techno Blender
Digitally Yours.

Grab and Use These Four Useful Seaborn Visualization Templates | by Mintao Wei | Jun, 2022

0 120


Introducing four types of plotting functions and relevant tricks for exploratory data analysis based on Seaborn

  1. Introduction
  2. Viz 1: Double-axis Time Series Plot with Auxiliary Line/Band
  3. Viz 2: Scatter Plot with Fitted Trendline
  4. Viz 3: Distribution Plot with KDE Line (Kernel Density Estimation)
  5. Viz 4: Categorical Bar Plot Series
  6. Summary

Initiatives

“Matplotlib and seaborn are ugly, I only use ggplot2 in R”;
“The seaborn API is a pain and very rigid to work with”;
“The default plots for seaborn and matplotlib are so poor and I have to search for the right parameters every time”;
“What are some other plotting libraries that works well with Jupyter Notebook?”

These are comments that I heard from my computational social scientist fellows and data scientist friends, and I am sure every data person can more or less relate to them. Admittedly, matplotlib and seaborn are not perfect, but they have an unparalleled advantage: matplotlib and seaborn are easy-to-use than most complex visualization toolkits such as Plotly/Dash and it is based on Python, meaning they are irreplaceable for exploratory data analysis (EDA) and we should embrace them.

To make life easier for EDA, I would like to share some seaborn visualization templates and relevant tricks I personally use so that you could spend more time on the analysis. This article will emphasize resolving two pain points: (1) aesthetics; (2) functionality.

How can these functions help you?

You are more than welcome to grab and use the visualization templates by simply changing the input to your datasets. You could also modify the function body to serve your own purpose. To help with that, I summarized a few key tricks (i.e. seaborn parameters/methods) that I personally use a lot in my own work.

Datasets

I use the taxis and tips datasets from seaborn together with publicly-available weather data from National Oceanic and Atmospheric Administration (NOAA) as illustrations for the following templates. Please feel free to replicate the visualizations using the code chunks embedded.

The taxis and tips datasets are licensed as open-source through Seaborn. They can be freely accessed in Python via the seaborn.load_dataset method and are allowed for commercial use according to their permissive license. The use of the NOAA weather data is abided by the World Meteorological Organization Resolution 40 Documentation. More details can be found in the bottom section as well as the references of this article.

The double-axis time series plot can be a great visualization to understand how the trend of our key variable of interest correlates to other exogenous factors. It is commonly used in time series analysis.

Figure by Author: This figure describes the time series fluctuations of the taxi ride count in NYC. It also specifies the rainy days as a reference to help interpret the trend. (Data Source: Seaborn Taxis Dataset, National Oceanic and Atmospheric Administration)
Code by Author

Key Notes for the Double-axis Time Series Plot

  • I recommend always adding sns.set(style=”...”,context=”...”) to have a nice layout and the Arial font type. Note that there are no ‘short-cuts’ like this in Google Colab because the Arial font type is not properly installed in their VMs. The only way to change font types in the Colab is to explicitly install and then add them to the matplotlib local folder, which can be troublesome. See here.
  • ax_y = ax.twinx() is the critical piece in creating a double-axis plot.
  • ax.axvspan/ax.axhspan/ax.axvline/ax.axhline are methods to auxiliary lines and shaded bands.
  • I prefer having gridlines as an anchor ax.grid(linestyle=” — “, alpha=0.5), and I will make sure the grid is not distracting by setting a small alpha.
  • I personally found the two code lines (one for reducing the x-ticks density, one for rotating the texts) below very helpful when using a date variable as the x-axis so that I can prevent the messy outlook when all the date texts overlap with each other.
    ax_y.xaxis.set_major_locator(ticker.MultipleLocator(...))
    plt.setp(ax_y.get_xticklabels(), rotation=45, ha=”right”, rotation_mode=”anchor”)

Scatter plots are an effective category of visualizations to capture the relationship between the two continuous variables. The nice thing of scatter plots is that it provides an original flavor of the correlation because it directly projects each individual data point to the canvas. Correspondingly, the unclear trend thus becomes a downside. As a result, it is often considered good practice to manually fit a line and add to the scatter plots.

Figure by Author: This figure projects 244 transaction data points to a two-dimensional space by its total bill amount ($) and tip percentage. A fitted polynomial trendline is also presented. (Data Source: Seaborn Tips Dataset)
Code by Author

Key Notes for the Scatter Plot with Fitted Trendline

  • The LaTeX text style from the sympy package can make the formula more aesthetic.
  • The curve_fit from scipy is usually used to determine the parameters of the fitted line, but trying different models (e.g. polynomials, logarithm, exponentials) won’t go wrong.

The distribution is one of the most critical statistical aspects that we would like to explore when doing uni-variate data analysis. While seaborn provides multiple methods such as displot, kdeplot, distplotto directly measure the distribution using KDE (kernel density estimation), there are some specific parameters that I would like to highlight and caution you against in this article.

Figure by Author: This figure delineates the distribution of the tipping percentages. It demonstrates that most of the time, people tip 16% of the total bill for that meal. (Data Source: Seaborn Tips Dataset)
Code by Author

Key Notes for the Distribution Plot with KDE Line (Kernel Density Estimation)

  • The smoothing KDE line can be a great reference, but it could possibly be misleading when the parameters are specified incorrectly in multi-modal or skewed distributions with extreme values. Therefore, it is always good practice to check the histogram at the same time and tune the cut, clip and bw_adjustaccordingly.
  • cut, clip and bw_adjust are three very important parameters for the KDE line. I usually set cut=0 and clip equals the min and max of the data to truncate the KDE curve at the limits and force it not to extend beyond the actual samples. It is worth noting that the cut parameter is, by default, not 0 and thus the KDE curve can sometimes differ from the sample distribution.
  • bw_adjust , in short, controls the ‘smoothness’ of the KDE curve, and thus should be carefully tuned to reflect the real distribution. I highly suggest reading the notes in the official document of the kdeplot.

The last visualization I personally found fruitful and handy is this series of bar plots, because we are concerned not only about the breakdown of the metric but also about the sample composition for each dimensional value. Otherwise, we might risk being caught in the Simpson Paradox and generate false conclusions. To include categorical variables, I use the popular Tips dataset from seaborn for the illustration.

Figure by Author: This graph shows the percentage of tip amount over the total bill broken down by gender. It also reveals that the sample is unbalanced with fewer female observations. (Data Source: Seaborn Tips Dataset)
Code by Author

This article presents 4 visualization templates of different types, which corresponds to four different use cases: (1) analyzing time series trend; (2) analyzing the relationship between two continuous variables; (3) analyzing the distribution; (4) analyzing the metric performance in different breakdowns of the dimension.

For each type of visualization, I provided the code as well as the data to replicate the visualizations and added keynotes to highlight the critical seaborn methods. The purpose is to facilitate more pragmatic and aesthetic plotting in Python.

Hope the above content helps a bit in your data journey!

This article relies on the dataset included in the Seaborn package. The Seaborn license is listed below:

Copyright © 2012–2021, Michael L. Waskom
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the project nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


Introducing four types of plotting functions and relevant tricks for exploratory data analysis based on Seaborn

  1. Introduction
  2. Viz 1: Double-axis Time Series Plot with Auxiliary Line/Band
  3. Viz 2: Scatter Plot with Fitted Trendline
  4. Viz 3: Distribution Plot with KDE Line (Kernel Density Estimation)
  5. Viz 4: Categorical Bar Plot Series
  6. Summary

Initiatives

“Matplotlib and seaborn are ugly, I only use ggplot2 in R”;
“The seaborn API is a pain and very rigid to work with”;
“The default plots for seaborn and matplotlib are so poor and I have to search for the right parameters every time”;
“What are some other plotting libraries that works well with Jupyter Notebook?”

These are comments that I heard from my computational social scientist fellows and data scientist friends, and I am sure every data person can more or less relate to them. Admittedly, matplotlib and seaborn are not perfect, but they have an unparalleled advantage: matplotlib and seaborn are easy-to-use than most complex visualization toolkits such as Plotly/Dash and it is based on Python, meaning they are irreplaceable for exploratory data analysis (EDA) and we should embrace them.

To make life easier for EDA, I would like to share some seaborn visualization templates and relevant tricks I personally use so that you could spend more time on the analysis. This article will emphasize resolving two pain points: (1) aesthetics; (2) functionality.

How can these functions help you?

You are more than welcome to grab and use the visualization templates by simply changing the input to your datasets. You could also modify the function body to serve your own purpose. To help with that, I summarized a few key tricks (i.e. seaborn parameters/methods) that I personally use a lot in my own work.

Datasets

I use the taxis and tips datasets from seaborn together with publicly-available weather data from National Oceanic and Atmospheric Administration (NOAA) as illustrations for the following templates. Please feel free to replicate the visualizations using the code chunks embedded.

The taxis and tips datasets are licensed as open-source through Seaborn. They can be freely accessed in Python via the seaborn.load_dataset method and are allowed for commercial use according to their permissive license. The use of the NOAA weather data is abided by the World Meteorological Organization Resolution 40 Documentation. More details can be found in the bottom section as well as the references of this article.

The double-axis time series plot can be a great visualization to understand how the trend of our key variable of interest correlates to other exogenous factors. It is commonly used in time series analysis.

Figure by Author: This figure describes the time series fluctuations of the taxi ride count in NYC. It also specifies the rainy days as a reference to help interpret the trend. (Data Source: Seaborn Taxis Dataset, National Oceanic and Atmospheric Administration)
Code by Author

Key Notes for the Double-axis Time Series Plot

  • I recommend always adding sns.set(style=”...”,context=”...”) to have a nice layout and the Arial font type. Note that there are no ‘short-cuts’ like this in Google Colab because the Arial font type is not properly installed in their VMs. The only way to change font types in the Colab is to explicitly install and then add them to the matplotlib local folder, which can be troublesome. See here.
  • ax_y = ax.twinx() is the critical piece in creating a double-axis plot.
  • ax.axvspan/ax.axhspan/ax.axvline/ax.axhline are methods to auxiliary lines and shaded bands.
  • I prefer having gridlines as an anchor ax.grid(linestyle=” — “, alpha=0.5), and I will make sure the grid is not distracting by setting a small alpha.
  • I personally found the two code lines (one for reducing the x-ticks density, one for rotating the texts) below very helpful when using a date variable as the x-axis so that I can prevent the messy outlook when all the date texts overlap with each other.
    ax_y.xaxis.set_major_locator(ticker.MultipleLocator(...))
    plt.setp(ax_y.get_xticklabels(), rotation=45, ha=”right”, rotation_mode=”anchor”)

Scatter plots are an effective category of visualizations to capture the relationship between the two continuous variables. The nice thing of scatter plots is that it provides an original flavor of the correlation because it directly projects each individual data point to the canvas. Correspondingly, the unclear trend thus becomes a downside. As a result, it is often considered good practice to manually fit a line and add to the scatter plots.

Figure by Author: This figure projects 244 transaction data points to a two-dimensional space by its total bill amount ($) and tip percentage. A fitted polynomial trendline is also presented. (Data Source: Seaborn Tips Dataset)
Code by Author

Key Notes for the Scatter Plot with Fitted Trendline

  • The LaTeX text style from the sympy package can make the formula more aesthetic.
  • The curve_fit from scipy is usually used to determine the parameters of the fitted line, but trying different models (e.g. polynomials, logarithm, exponentials) won’t go wrong.

The distribution is one of the most critical statistical aspects that we would like to explore when doing uni-variate data analysis. While seaborn provides multiple methods such as displot, kdeplot, distplotto directly measure the distribution using KDE (kernel density estimation), there are some specific parameters that I would like to highlight and caution you against in this article.

Figure by Author: This figure delineates the distribution of the tipping percentages. It demonstrates that most of the time, people tip 16% of the total bill for that meal. (Data Source: Seaborn Tips Dataset)
Code by Author

Key Notes for the Distribution Plot with KDE Line (Kernel Density Estimation)

  • The smoothing KDE line can be a great reference, but it could possibly be misleading when the parameters are specified incorrectly in multi-modal or skewed distributions with extreme values. Therefore, it is always good practice to check the histogram at the same time and tune the cut, clip and bw_adjustaccordingly.
  • cut, clip and bw_adjust are three very important parameters for the KDE line. I usually set cut=0 and clip equals the min and max of the data to truncate the KDE curve at the limits and force it not to extend beyond the actual samples. It is worth noting that the cut parameter is, by default, not 0 and thus the KDE curve can sometimes differ from the sample distribution.
  • bw_adjust , in short, controls the ‘smoothness’ of the KDE curve, and thus should be carefully tuned to reflect the real distribution. I highly suggest reading the notes in the official document of the kdeplot.

The last visualization I personally found fruitful and handy is this series of bar plots, because we are concerned not only about the breakdown of the metric but also about the sample composition for each dimensional value. Otherwise, we might risk being caught in the Simpson Paradox and generate false conclusions. To include categorical variables, I use the popular Tips dataset from seaborn for the illustration.

Figure by Author: This graph shows the percentage of tip amount over the total bill broken down by gender. It also reveals that the sample is unbalanced with fewer female observations. (Data Source: Seaborn Tips Dataset)
Code by Author

This article presents 4 visualization templates of different types, which corresponds to four different use cases: (1) analyzing time series trend; (2) analyzing the relationship between two continuous variables; (3) analyzing the distribution; (4) analyzing the metric performance in different breakdowns of the dimension.

For each type of visualization, I provided the code as well as the data to replicate the visualizations and added keynotes to highlight the critical seaborn methods. The purpose is to facilitate more pragmatic and aesthetic plotting in Python.

Hope the above content helps a bit in your data journey!

This article relies on the dataset included in the Seaborn package. The Seaborn license is listed below:

Copyright © 2012–2021, Michael L. Waskom
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the project nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment