Techno Blender
Digitally Yours.

A Quick Guide to Beautiful Scatter Plots in Python | by Hair Parra | Jan, 2023

0 42


Image by author via Python Matplotlib

So you already know some Python and matplotlib. Perhaps you are like me and really like sophisticated, beautiful and insightful plots. However, when you encounter some basic examples to replicate them yourself, as seen in this documentation page, you might see something like this:

which generates the following plot:

src: https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_masked.html#sphx-glr-gallery-lines-bars-and-markers-scatter-masked-py

While very colourful, this plot is rather simple and not very insightful, and the code does explain its purpose. In this article, I would like to show you how to create beautiful, insightful scatter plots like the one you saw at the beginning of this article.

The notebook with code for this tutorial can be found here, and the dataset we will be using can be found in this link. Note that in this tutorial, I have mounted data in the drive, so you could either do the same, or download the data locally and run a Jupyter Notebook instead. For convenience, you can also download the data from my GitHub.

As I will be using Google Colab on this example, you will see specific Colab drive imports. However, if you are doing this locally, you can skip these. Since we will also be doing some basic data cleaning and linear regression, I have also imported some Scikit-learn classes.

For this example, we are going to use the Life Expectancy vs GDP per capita dataset, which is available at https://ourworldindata.org/. Once you’ve grabbed the data and you’re able to read the file, let’s examine the steps we will take to analyze it:

  1. Load the data
  2. Examine the data and rename the columns for convenience
  3. Extract rows/datapoints that have a value for GDP
  4. Visualize the population distribution
  5. Constructing the plotting function
  6. Plotting the data, beautifully!

Load the data

First, let’s load and examine our data. We load the data using the correct data path using Pandas (wherever you placed it), and we rename some columns for convenience.

The resulting data should look like this:

Plot by author — generated using Python Pandas on Colab

Cleaning the data

Notice that many data points have Nan for the GDP column, and also for the Populationcolumn. Since we cannot have an analysis on missing values, we are going to remove them from our data:

After having cleaned the GDP and selected the 2018 year (you can verify yourself there is no data with GDP for further years!), our data will look like so:

Plot by author — generated using Python Pandas on Colab

Note that we also excluded the data point World from our analysis since this contains an aggregate of all countries (feel free to include it and see how the final plot changes! ).

Notice that the population values are given in millions. Further in the plot, in order to add different colors to different thresholds of population numbers, we would like to see what its distribution looks like:

Image by author via Python Matplotlib

Oops! This tells us that most countries have a population between 0.0 and 0.2^ 1e9 = 200 000 000 (200 millions) , and some countries have more than ~1200 millions. Since there is a bigger density on the left (more countries with fewer people), we can ignore the high-population countries and produce a histogram focused on the left countries:

Image by author via Python Matplotlib

Much better! We will use this to create approximate values to colour-code countries with different population densities.

For this part, we will perform a couple of sub-steps, as shown below.

Function Definition

We will define our function as follows:

Note the three parameters: df , which is our data after initial pre-processing, apply_color to colour-code the population density, and regression to apply regression to the plot.

Filling missing data

Our data still contains a couple of NaN values for important columns of interest, namely ["Expectancy", "GDP", "Population"] . Although we could drop the rows with NaNs, a better idea here is to fill them with an “educated guess” instead. Popular options are the mean or median of the column, but here we will use a machine learning algorithm called K Nearest Neighbours (KNN) instead. If you are not familiar with it, you can read up on it here. For our purposes,all it does is that it will essentially fill up the missing values by using the top N most-similar country entries based on the available attributes (say Expectancy and GDP) to predict the Population value, and using the average value of those.

Aggregating the data

Although the data here is only for 2018, given a better daset, you could extend this analysis to more years. That’s why we also aggregate the data to make sure all the years are taken into account. This is done using Pandas groupby(), choosing the "Country" column as the argument. Additionally, we also round the data beforehand to reduce the number of decimals.

Extracting Plot Variables

Once again for the sake of convenience, we will reassign the columns to individual variables. Note that you could simply query the dataframe directly. An important step here, however, is that we will scale the Population column by a factor of a million and multiply by 2. This will control the size of the dots when plotting the scatter plot.

Perform Regression

An optional part of our plot is to draw a very nice line that follows the mean of all the different points which represent each of the country’s GDP vs their respective life expectancy. This will give us a way to quickly evaluate at glance the status of individual countries with respect to the general trend. For this, you could use Scikit-learn’s ElasticSearchCV , which is essentially linear regression which mixes Lasso and Ridge regression, adding regularization for a more robust fit. You can read about it in the Wikipedia article and also see the corresponding Scikit-learn documentation. Additionally, the CV stands for “Cross Validation”, which is a common technique in machine learning to reduce variance in your models. You can read more about it here. The code is as follows (notice the regression parameter):

After initializing the algorithm, we format the data into the correct format (here, the X argument are the predictive features, and y is the target), and then fit the regression model. Then, we produce predictions, and finally, arrange them in the right format. Note that we also trim the predictions for datapoints where the predictions are less than 90, so that the plot doesn’t go out bounds.

Start the plot, colour-code and add point density

We are now ready to start plotting. The first step, is to colour-code the population density points for each country in our data. We assign a different colour depending on the number of people, and only apply this if the argument apply_color=True . Remember the histogram we plotted before? If we take a look at the boundary values of the bins, we can come up with some nice bounds for the population, and give it an appropriate colour (of course, these are the colours that make sense to me, but feel free to choose whatever colours you want!).

Next, we start the plot with plt.figure()and call the plt.scatter() method, depending on whether we want to colour code or not:

Here:

  • The s argument will take a vector with real values, and will modify the resulting dot size accordingly. Therefore we assign the population variable, which already contains the appropriate scaling for each value ofr that country proportional to its number of people. The higher the population, the bigger the dot.
  • The c argument will apply the corresponding colour to each point, which is what we assigned to the dataframe before.
  • The alpha argument will change the transparency of the dots.

Additionally, we can also annotate the points with high population density, which we can do using the plt.annotate() method. In order to obtain the corresponding country names, we subset such records and retrieve their corresponding coordinates, and then pass them to the method.

Note the +0.3 in the y-coordinate; this is to move the text a bit away from the point so it doesn’t overlap.

Regression Line

Next, we add the regression line using reg.plot() using the reg_data that we created before. This line represents a sort of average life expectancy given the GDP per capita of some country, which we can use as a baseline to compare the relative status of other countries.

The ls='--' argument specifies line style. Note also that we are using as plotting arguments the GDP data ( reg_data["X"] ) and the corresponding predictions.

Finish the plot

Let’s finish up the code for our plot:

In the first part we add a label for the x-axis (GDP per capita), for the y-axis (Life Expectancy), and give our plot an appropriate title. Additionally, we display the x values in log scale, as it provides a prettier and more insightful plot (try removing it and see what happens!).

Next, we create the labels for the different colours. As Matplotlib creates the labels all at once, in order to create a label with the appropriate color and name, we create four corresponding “ghost” lines, which will not do anything on the final plot, but will provide a colour-coded label instead.

Finally, we create the legend and display our plot.

All together

We’ve come a long way. This is what the function looks like altogether:

Let’s test it out!

Image by author using Python

Not bad at all! At glance, we can see all the different world countries life expectancy with respect to their GDP per capita, and their corresponding life expectancy measurements. Additionally, we can also see the name of some countries with a relatively big number of people.

We can also set apply_color=True to distinguish between population densities and produce the following:

Plot by author using Python

Just how beautiful that looks! We can now clearly see different magnitudes of population density easily thanks to the colour-coding.

Finally, let’s add the regression line:

Plot by author using Python on Google Colab

And that’s it! We have now successfully generated the plot that you saw at the beginning of the article. Just by looking at the plot, we can tell at glance which countries had a life expectancy higher than the average relative to their GDP in 2018, by checking which countries lie above and beyond the regression line. Note that the actual regression line is a line (duh!) and not a curve, but because we are working in log-scale, it seems to be a curve instead, as the values are changed.

Beautiful, informative plots are an art, and while some libraries will definitely facilitate your learning; it’s never a bad idea to improve on the basics and become learn a couple of new tricks.

I hope that this article was interesting to you, and if so, make sure to check some of my other popular articles & series!

Data Science Basics

GIT Basics

Functional Programming in Python

Intro to Time Series Analysis in R

The data used for this analysis and its original analysis was done by Our World in Data, and can be found at https://ourworldindata.org/grapher/life-expectancy-vs-gdp-per-capita, where the following source is cited:

Madison Project Database (2020); UN WPP (2022); Zijdeman et al. (2015)

From their page:

“Licenses: All visualizations, data, and articles produced by Our World in Data are open access under the Creative Commons BY license. You have permission to use, distribute, and reproduce these in any medium, provided the source and authors are credited.”

See https://ourworldindata.org/about#legal

  1. https://jairparraml.com/
  2. https://blog.jairparraml.com/
  3. https://www.linkedin.com/in/hair-parra-526ba19b/
  4. https://github.com/JairParra
  5. https://medium.com/@hair.parra


Image by author via Python Matplotlib

So you already know some Python and matplotlib. Perhaps you are like me and really like sophisticated, beautiful and insightful plots. However, when you encounter some basic examples to replicate them yourself, as seen in this documentation page, you might see something like this:

which generates the following plot:

src: https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_masked.html#sphx-glr-gallery-lines-bars-and-markers-scatter-masked-py

While very colourful, this plot is rather simple and not very insightful, and the code does explain its purpose. In this article, I would like to show you how to create beautiful, insightful scatter plots like the one you saw at the beginning of this article.

The notebook with code for this tutorial can be found here, and the dataset we will be using can be found in this link. Note that in this tutorial, I have mounted data in the drive, so you could either do the same, or download the data locally and run a Jupyter Notebook instead. For convenience, you can also download the data from my GitHub.

As I will be using Google Colab on this example, you will see specific Colab drive imports. However, if you are doing this locally, you can skip these. Since we will also be doing some basic data cleaning and linear regression, I have also imported some Scikit-learn classes.

For this example, we are going to use the Life Expectancy vs GDP per capita dataset, which is available at https://ourworldindata.org/. Once you’ve grabbed the data and you’re able to read the file, let’s examine the steps we will take to analyze it:

  1. Load the data
  2. Examine the data and rename the columns for convenience
  3. Extract rows/datapoints that have a value for GDP
  4. Visualize the population distribution
  5. Constructing the plotting function
  6. Plotting the data, beautifully!

Load the data

First, let’s load and examine our data. We load the data using the correct data path using Pandas (wherever you placed it), and we rename some columns for convenience.

The resulting data should look like this:

Plot by author — generated using Python Pandas on Colab

Cleaning the data

Notice that many data points have Nan for the GDP column, and also for the Populationcolumn. Since we cannot have an analysis on missing values, we are going to remove them from our data:

After having cleaned the GDP and selected the 2018 year (you can verify yourself there is no data with GDP for further years!), our data will look like so:

Plot by author — generated using Python Pandas on Colab

Note that we also excluded the data point World from our analysis since this contains an aggregate of all countries (feel free to include it and see how the final plot changes! ).

Notice that the population values are given in millions. Further in the plot, in order to add different colors to different thresholds of population numbers, we would like to see what its distribution looks like:

Image by author via Python Matplotlib

Oops! This tells us that most countries have a population between 0.0 and 0.2^ 1e9 = 200 000 000 (200 millions) , and some countries have more than ~1200 millions. Since there is a bigger density on the left (more countries with fewer people), we can ignore the high-population countries and produce a histogram focused on the left countries:

Image by author via Python Matplotlib

Much better! We will use this to create approximate values to colour-code countries with different population densities.

For this part, we will perform a couple of sub-steps, as shown below.

Function Definition

We will define our function as follows:

Note the three parameters: df , which is our data after initial pre-processing, apply_color to colour-code the population density, and regression to apply regression to the plot.

Filling missing data

Our data still contains a couple of NaN values for important columns of interest, namely ["Expectancy", "GDP", "Population"] . Although we could drop the rows with NaNs, a better idea here is to fill them with an “educated guess” instead. Popular options are the mean or median of the column, but here we will use a machine learning algorithm called K Nearest Neighbours (KNN) instead. If you are not familiar with it, you can read up on it here. For our purposes,all it does is that it will essentially fill up the missing values by using the top N most-similar country entries based on the available attributes (say Expectancy and GDP) to predict the Population value, and using the average value of those.

Aggregating the data

Although the data here is only for 2018, given a better daset, you could extend this analysis to more years. That’s why we also aggregate the data to make sure all the years are taken into account. This is done using Pandas groupby(), choosing the "Country" column as the argument. Additionally, we also round the data beforehand to reduce the number of decimals.

Extracting Plot Variables

Once again for the sake of convenience, we will reassign the columns to individual variables. Note that you could simply query the dataframe directly. An important step here, however, is that we will scale the Population column by a factor of a million and multiply by 2. This will control the size of the dots when plotting the scatter plot.

Perform Regression

An optional part of our plot is to draw a very nice line that follows the mean of all the different points which represent each of the country’s GDP vs their respective life expectancy. This will give us a way to quickly evaluate at glance the status of individual countries with respect to the general trend. For this, you could use Scikit-learn’s ElasticSearchCV , which is essentially linear regression which mixes Lasso and Ridge regression, adding regularization for a more robust fit. You can read about it in the Wikipedia article and also see the corresponding Scikit-learn documentation. Additionally, the CV stands for “Cross Validation”, which is a common technique in machine learning to reduce variance in your models. You can read more about it here. The code is as follows (notice the regression parameter):

After initializing the algorithm, we format the data into the correct format (here, the X argument are the predictive features, and y is the target), and then fit the regression model. Then, we produce predictions, and finally, arrange them in the right format. Note that we also trim the predictions for datapoints where the predictions are less than 90, so that the plot doesn’t go out bounds.

Start the plot, colour-code and add point density

We are now ready to start plotting. The first step, is to colour-code the population density points for each country in our data. We assign a different colour depending on the number of people, and only apply this if the argument apply_color=True . Remember the histogram we plotted before? If we take a look at the boundary values of the bins, we can come up with some nice bounds for the population, and give it an appropriate colour (of course, these are the colours that make sense to me, but feel free to choose whatever colours you want!).

Next, we start the plot with plt.figure()and call the plt.scatter() method, depending on whether we want to colour code or not:

Here:

  • The s argument will take a vector with real values, and will modify the resulting dot size accordingly. Therefore we assign the population variable, which already contains the appropriate scaling for each value ofr that country proportional to its number of people. The higher the population, the bigger the dot.
  • The c argument will apply the corresponding colour to each point, which is what we assigned to the dataframe before.
  • The alpha argument will change the transparency of the dots.

Additionally, we can also annotate the points with high population density, which we can do using the plt.annotate() method. In order to obtain the corresponding country names, we subset such records and retrieve their corresponding coordinates, and then pass them to the method.

Note the +0.3 in the y-coordinate; this is to move the text a bit away from the point so it doesn’t overlap.

Regression Line

Next, we add the regression line using reg.plot() using the reg_data that we created before. This line represents a sort of average life expectancy given the GDP per capita of some country, which we can use as a baseline to compare the relative status of other countries.

The ls='--' argument specifies line style. Note also that we are using as plotting arguments the GDP data ( reg_data["X"] ) and the corresponding predictions.

Finish the plot

Let’s finish up the code for our plot:

In the first part we add a label for the x-axis (GDP per capita), for the y-axis (Life Expectancy), and give our plot an appropriate title. Additionally, we display the x values in log scale, as it provides a prettier and more insightful plot (try removing it and see what happens!).

Next, we create the labels for the different colours. As Matplotlib creates the labels all at once, in order to create a label with the appropriate color and name, we create four corresponding “ghost” lines, which will not do anything on the final plot, but will provide a colour-coded label instead.

Finally, we create the legend and display our plot.

All together

We’ve come a long way. This is what the function looks like altogether:

Let’s test it out!

Image by author using Python

Not bad at all! At glance, we can see all the different world countries life expectancy with respect to their GDP per capita, and their corresponding life expectancy measurements. Additionally, we can also see the name of some countries with a relatively big number of people.

We can also set apply_color=True to distinguish between population densities and produce the following:

Plot by author using Python

Just how beautiful that looks! We can now clearly see different magnitudes of population density easily thanks to the colour-coding.

Finally, let’s add the regression line:

Plot by author using Python on Google Colab

And that’s it! We have now successfully generated the plot that you saw at the beginning of the article. Just by looking at the plot, we can tell at glance which countries had a life expectancy higher than the average relative to their GDP in 2018, by checking which countries lie above and beyond the regression line. Note that the actual regression line is a line (duh!) and not a curve, but because we are working in log-scale, it seems to be a curve instead, as the values are changed.

Beautiful, informative plots are an art, and while some libraries will definitely facilitate your learning; it’s never a bad idea to improve on the basics and become learn a couple of new tricks.

I hope that this article was interesting to you, and if so, make sure to check some of my other popular articles & series!

Data Science Basics

GIT Basics

Functional Programming in Python

Intro to Time Series Analysis in R

The data used for this analysis and its original analysis was done by Our World in Data, and can be found at https://ourworldindata.org/grapher/life-expectancy-vs-gdp-per-capita, where the following source is cited:

Madison Project Database (2020); UN WPP (2022); Zijdeman et al. (2015)

From their page:

“Licenses: All visualizations, data, and articles produced by Our World in Data are open access under the Creative Commons BY license. You have permission to use, distribute, and reproduce these in any medium, provided the source and authors are credited.”

See https://ourworldindata.org/about#legal

  1. https://jairparraml.com/
  2. https://blog.jairparraml.com/
  3. https://www.linkedin.com/in/hair-parra-526ba19b/
  4. https://github.com/JairParra
  5. https://medium.com/@hair.parra

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment