Grab and Use These Four Useful Seaborn Visualization Templates | by Mintao Wei | Jun, 2022
Introducing four types of plotting functions and relevant tricks for exploratory data analysis based on Seaborn
- Introduction
- Viz 1: Double-axis Time Series Plot with Auxiliary Line/Band
- Viz 2: Scatter Plot with Fitted Trendline
- Viz 3: Distribution Plot with KDE Line (Kernel Density Estimation)
- Viz 4: Categorical Bar Plot Series
- Summary
Initiatives
“Matplotlib and seaborn are ugly, I only use ggplot2 in R”;
“The seaborn API is a pain and very rigid to work with”;
“The default plots for seaborn and matplotlib are so poor and I have to search for the right parameters every time”;
“What are some other plotting libraries that works well with Jupyter Notebook?”
These are comments that I heard from my computational social scientist fellows and data scientist friends, and I am sure every data person can more or less relate to them. Admittedly, matplotlib and seaborn are not perfect, but they have an unparalleled advantage: matplotlib and seaborn are easy-to-use than most complex visualization toolkits such as Plotly/Dash and it is based on Python, meaning they are irreplaceable for exploratory data analysis (EDA) and we should embrace them.
To make life easier for EDA, I would like to share some seaborn visualization templates and relevant tricks I personally use so that you could spend more time on the analysis. This article will emphasize resolving two pain points: (1) aesthetics; (2) functionality.
How can these functions help you?
You are more than welcome to grab and use the visualization templates by simply changing the input to your datasets. You could also modify the function body to serve your own purpose. To help with that, I summarized a few key tricks (i.e. seaborn parameters/methods) that I personally use a lot in my own work.
Datasets
I use the taxis and tips datasets from seaborn together with publicly-available weather data from National Oceanic and Atmospheric Administration (NOAA) as illustrations for the following templates. Please feel free to replicate the visualizations using the code chunks embedded.
The taxis and tips datasets are licensed as open-source through Seaborn. They can be freely accessed in Python via the seaborn.load_dataset
method and are allowed for commercial use according to their permissive license. The use of the NOAA weather data is abided by the World Meteorological Organization Resolution 40 Documentation. More details can be found in the bottom section as well as the references of this article.
The double-axis time series plot can be a great visualization to understand how the trend of our key variable of interest correlates to other exogenous factors. It is commonly used in time series analysis.
Key Notes for the Double-axis Time Series Plot
- I recommend always adding
sns.set(style=”...”,context=”...”)
to have a nice layout and the Arial font type. Note that there are no ‘short-cuts’ like this in Google Colab because the Arial font type is not properly installed in their VMs. The only way to change font types in the Colab is to explicitly install and then add them to the matplotlib local folder, which can be troublesome. See here. ax_y = ax.twinx()
is the critical piece in creating a double-axis plot.ax.axvspan/ax.axhspan/ax.axvline/ax.axhline
are methods to auxiliary lines and shaded bands.- I prefer having gridlines as an anchor
ax.grid(linestyle=” — “, alpha=0.5)
, and I will make sure the grid is not distracting by setting a small alpha. - I personally found the two code lines (one for reducing the x-ticks density, one for rotating the texts) below very helpful when using a date variable as the x-axis so that I can prevent the messy outlook when all the date texts overlap with each other.
ax_y.xaxis.set_major_locator(ticker.MultipleLocator(...))
plt.setp(ax_y.get_xticklabels(), rotation=45, ha=”right”, rotation_mode=”anchor”)
Scatter plots are an effective category of visualizations to capture the relationship between the two continuous variables. The nice thing of scatter plots is that it provides an original flavor of the correlation because it directly projects each individual data point to the canvas. Correspondingly, the unclear trend thus becomes a downside. As a result, it is often considered good practice to manually fit a line and add to the scatter plots.
Key Notes for the Scatter Plot with Fitted Trendline
- The LaTeX text style from the
sympy
package can make the formula more aesthetic. - The
curve_fit
fromscipy
is usually used to determine the parameters of the fitted line, but trying different models (e.g. polynomials, logarithm, exponentials) won’t go wrong.
The distribution is one of the most critical statistical aspects that we would like to explore when doing uni-variate data analysis. While seaborn provides multiple methods such as displot
, kdeplot
, distplot
to directly measure the distribution using KDE (kernel density estimation), there are some specific parameters that I would like to highlight and caution you against in this article.
Key Notes for the Distribution Plot with KDE Line (Kernel Density Estimation)
- The smoothing KDE line can be a great reference, but it could possibly be misleading when the parameters are specified incorrectly in multi-modal or skewed distributions with extreme values. Therefore, it is always good practice to check the histogram at the same time and tune the
cut
,clip
andbw_adjust
accordingly. cut
,clip
andbw_adjust
are three very important parameters for the KDE line. I usually setcut=0
andclip
equals the min and max of the data to truncate the KDE curve at the limits and force it not to extend beyond the actual samples. It is worth noting that thecut
parameter is, by default, not 0 and thus the KDE curve can sometimes differ from the sample distribution.bw_adjust
, in short, controls the ‘smoothness’ of the KDE curve, and thus should be carefully tuned to reflect the real distribution. I highly suggest reading the notes in the official document of the kdeplot.
The last visualization I personally found fruitful and handy is this series of bar plots, because we are concerned not only about the breakdown of the metric but also about the sample composition for each dimensional value. Otherwise, we might risk being caught in the Simpson Paradox and generate false conclusions. To include categorical variables, I use the popular Tips dataset from seaborn for the illustration.
This article presents 4 visualization templates of different types, which corresponds to four different use cases: (1) analyzing time series trend; (2) analyzing the relationship between two continuous variables; (3) analyzing the distribution; (4) analyzing the metric performance in different breakdowns of the dimension.
For each type of visualization, I provided the code as well as the data to replicate the visualizations and added keynotes to highlight the critical seaborn methods. The purpose is to facilitate more pragmatic and aesthetic plotting in Python.
Hope the above content helps a bit in your data journey!
This article relies on the dataset included in the Seaborn package. The Seaborn license is listed below:
Copyright © 2012–2021, Michael L. Waskom
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the project nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- Waskom, M., Botvinnik, Olga, O'Kane, Drew, Hobson, Paul, Lukauskas, Saulius, Gemperline, David C, … Qalieh, Adel. (2017). mwaskom/seaborn: v0.8.1 (September 2017). Zenodo. https://doi.org/10.5281/zenodo.883859
- United States. (2006). NOAA online weather data (NOWData): Interactive data query system : public fact sheet. Washington, D.C.: National Oceanic and Atmospheric Administration.
- Zolzaya Luvsandorj, 6 simple tips for prettier and customized plots in Seaborn (Python), https://towardsdatascience.com/6-simple-tips-for-prettier-and-customised-plots-in-seaborn-python-22f02ecc2393
- Anirudh Kashyap, Top 5 tricks to make plots look better, https://medium.com/@andykashyap/top-5-tricks-to-make-plots-look-better-9f6e687c1e08
- IonicSolutions, “What is y axis in seaborn distplot?”, https://stackoverflow.com/a/51667318/18719482. Aug 2018. Last accessed: 2022–06–13
- korakot, “Custom fonts in Google Colaboratory matplotlib charts”, https://stackoverflow.com/questions/51810908/custom-fonts-in-google-colaboratory-matplotlib-charts/72351664#72351664. Aug 2018. Last accessed: 2022–06–13
- apdnu, “Increase distance between title and plot in matplolib?”, https://stackoverflow.com/questions/16419670/increase-distance-between-title-and-plot-in-matplolib. Mar 2018. Last accessed: 2022–06–13
- DavidG, “How to plot a dashed line on seaborn lineplot?”, https://stackoverflow.com/questions/51963725/how-to-plot-a-dashed-line-on-seaborn-lineplot. Aug 2018. Last accessed: 2022–06–13
- Codes to replicate the visualizations, Mintao Wei’s Github Gist, https://gist.github.com/mintaow
- Datasets to replicate the visualizations, Mintao Wei’s Github folder, https://github.com/mintaow/MyMediumWork/tree/main/data
Introducing four types of plotting functions and relevant tricks for exploratory data analysis based on Seaborn
- Introduction
- Viz 1: Double-axis Time Series Plot with Auxiliary Line/Band
- Viz 2: Scatter Plot with Fitted Trendline
- Viz 3: Distribution Plot with KDE Line (Kernel Density Estimation)
- Viz 4: Categorical Bar Plot Series
- Summary
Initiatives
“Matplotlib and seaborn are ugly, I only use ggplot2 in R”;
“The seaborn API is a pain and very rigid to work with”;
“The default plots for seaborn and matplotlib are so poor and I have to search for the right parameters every time”;
“What are some other plotting libraries that works well with Jupyter Notebook?”
These are comments that I heard from my computational social scientist fellows and data scientist friends, and I am sure every data person can more or less relate to them. Admittedly, matplotlib and seaborn are not perfect, but they have an unparalleled advantage: matplotlib and seaborn are easy-to-use than most complex visualization toolkits such as Plotly/Dash and it is based on Python, meaning they are irreplaceable for exploratory data analysis (EDA) and we should embrace them.
To make life easier for EDA, I would like to share some seaborn visualization templates and relevant tricks I personally use so that you could spend more time on the analysis. This article will emphasize resolving two pain points: (1) aesthetics; (2) functionality.
How can these functions help you?
You are more than welcome to grab and use the visualization templates by simply changing the input to your datasets. You could also modify the function body to serve your own purpose. To help with that, I summarized a few key tricks (i.e. seaborn parameters/methods) that I personally use a lot in my own work.
Datasets
I use the taxis and tips datasets from seaborn together with publicly-available weather data from National Oceanic and Atmospheric Administration (NOAA) as illustrations for the following templates. Please feel free to replicate the visualizations using the code chunks embedded.
The taxis and tips datasets are licensed as open-source through Seaborn. They can be freely accessed in Python via the seaborn.load_dataset
method and are allowed for commercial use according to their permissive license. The use of the NOAA weather data is abided by the World Meteorological Organization Resolution 40 Documentation. More details can be found in the bottom section as well as the references of this article.
The double-axis time series plot can be a great visualization to understand how the trend of our key variable of interest correlates to other exogenous factors. It is commonly used in time series analysis.
Key Notes for the Double-axis Time Series Plot
- I recommend always adding
sns.set(style=”...”,context=”...”)
to have a nice layout and the Arial font type. Note that there are no ‘short-cuts’ like this in Google Colab because the Arial font type is not properly installed in their VMs. The only way to change font types in the Colab is to explicitly install and then add them to the matplotlib local folder, which can be troublesome. See here. ax_y = ax.twinx()
is the critical piece in creating a double-axis plot.ax.axvspan/ax.axhspan/ax.axvline/ax.axhline
are methods to auxiliary lines and shaded bands.- I prefer having gridlines as an anchor
ax.grid(linestyle=” — “, alpha=0.5)
, and I will make sure the grid is not distracting by setting a small alpha. - I personally found the two code lines (one for reducing the x-ticks density, one for rotating the texts) below very helpful when using a date variable as the x-axis so that I can prevent the messy outlook when all the date texts overlap with each other.
ax_y.xaxis.set_major_locator(ticker.MultipleLocator(...))
plt.setp(ax_y.get_xticklabels(), rotation=45, ha=”right”, rotation_mode=”anchor”)
Scatter plots are an effective category of visualizations to capture the relationship between the two continuous variables. The nice thing of scatter plots is that it provides an original flavor of the correlation because it directly projects each individual data point to the canvas. Correspondingly, the unclear trend thus becomes a downside. As a result, it is often considered good practice to manually fit a line and add to the scatter plots.
Key Notes for the Scatter Plot with Fitted Trendline
- The LaTeX text style from the
sympy
package can make the formula more aesthetic. - The
curve_fit
fromscipy
is usually used to determine the parameters of the fitted line, but trying different models (e.g. polynomials, logarithm, exponentials) won’t go wrong.
The distribution is one of the most critical statistical aspects that we would like to explore when doing uni-variate data analysis. While seaborn provides multiple methods such as displot
, kdeplot
, distplot
to directly measure the distribution using KDE (kernel density estimation), there are some specific parameters that I would like to highlight and caution you against in this article.
Key Notes for the Distribution Plot with KDE Line (Kernel Density Estimation)
- The smoothing KDE line can be a great reference, but it could possibly be misleading when the parameters are specified incorrectly in multi-modal or skewed distributions with extreme values. Therefore, it is always good practice to check the histogram at the same time and tune the
cut
,clip
andbw_adjust
accordingly. cut
,clip
andbw_adjust
are three very important parameters for the KDE line. I usually setcut=0
andclip
equals the min and max of the data to truncate the KDE curve at the limits and force it not to extend beyond the actual samples. It is worth noting that thecut
parameter is, by default, not 0 and thus the KDE curve can sometimes differ from the sample distribution.bw_adjust
, in short, controls the ‘smoothness’ of the KDE curve, and thus should be carefully tuned to reflect the real distribution. I highly suggest reading the notes in the official document of the kdeplot.
The last visualization I personally found fruitful and handy is this series of bar plots, because we are concerned not only about the breakdown of the metric but also about the sample composition for each dimensional value. Otherwise, we might risk being caught in the Simpson Paradox and generate false conclusions. To include categorical variables, I use the popular Tips dataset from seaborn for the illustration.
This article presents 4 visualization templates of different types, which corresponds to four different use cases: (1) analyzing time series trend; (2) analyzing the relationship between two continuous variables; (3) analyzing the distribution; (4) analyzing the metric performance in different breakdowns of the dimension.
For each type of visualization, I provided the code as well as the data to replicate the visualizations and added keynotes to highlight the critical seaborn methods. The purpose is to facilitate more pragmatic and aesthetic plotting in Python.
Hope the above content helps a bit in your data journey!
This article relies on the dataset included in the Seaborn package. The Seaborn license is listed below:
Copyright © 2012–2021, Michael L. Waskom
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of the project nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- Waskom, M., Botvinnik, Olga, O'Kane, Drew, Hobson, Paul, Lukauskas, Saulius, Gemperline, David C, … Qalieh, Adel. (2017). mwaskom/seaborn: v0.8.1 (September 2017). Zenodo. https://doi.org/10.5281/zenodo.883859
- United States. (2006). NOAA online weather data (NOWData): Interactive data query system : public fact sheet. Washington, D.C.: National Oceanic and Atmospheric Administration.
- Zolzaya Luvsandorj, 6 simple tips for prettier and customized plots in Seaborn (Python), https://towardsdatascience.com/6-simple-tips-for-prettier-and-customised-plots-in-seaborn-python-22f02ecc2393
- Anirudh Kashyap, Top 5 tricks to make plots look better, https://medium.com/@andykashyap/top-5-tricks-to-make-plots-look-better-9f6e687c1e08
- IonicSolutions, “What is y axis in seaborn distplot?”, https://stackoverflow.com/a/51667318/18719482. Aug 2018. Last accessed: 2022–06–13
- korakot, “Custom fonts in Google Colaboratory matplotlib charts”, https://stackoverflow.com/questions/51810908/custom-fonts-in-google-colaboratory-matplotlib-charts/72351664#72351664. Aug 2018. Last accessed: 2022–06–13
- apdnu, “Increase distance between title and plot in matplolib?”, https://stackoverflow.com/questions/16419670/increase-distance-between-title-and-plot-in-matplolib. Mar 2018. Last accessed: 2022–06–13
- DavidG, “How to plot a dashed line on seaborn lineplot?”, https://stackoverflow.com/questions/51963725/how-to-plot-a-dashed-line-on-seaborn-lineplot. Aug 2018. Last accessed: 2022–06–13
- Codes to replicate the visualizations, Mintao Wei’s Github Gist, https://gist.github.com/mintaow
- Datasets to replicate the visualizations, Mintao Wei’s Github folder, https://github.com/mintaow/MyMediumWork/tree/main/data