8 Tips for Effective Data Visualisation | by David Farrugia | Apr, 2023

By Jessie Hobb On Apr 25, 2023

DATA | VISUALISATION | ANALYTICS

A guide to correctly presenting your insights and observations to your audience

When we discuss data science, we tend to focus heavily on data cleaning and the machine learning aspect of the process.

The main points of discussion seem to be on how to best prepare our dataset for modelling, which features do we need to engineer and include in our training, which machine learning technique will we try first, and how are we going to evaluate it?

Whilst these are all valid and important questions to ask and plan, as data scientist, we often forget to prioritise one of the biggest selling points of any project: visualisations.

Every single data science project involves at least 2 parties: the technical (i.e., the data scientist(s)) and the non-technical (the stakeholder, could be some manager or a c-level executive).

We need to remember that the primal purpose of data science is to increase business value. Most people do not understand data. We must show them.

When done effectively, data visualisations can help us uncover insights, identify trends, and communicate complex ideas.

Throughout my years of experience, this area is where I see several professionals lack — especially those in more junior roles (myself included!).

Creating great data visualisations is a whole other skill on its own. It’s easy to have a data visualisation that causes more confusion than clarity.

In this post, we will discuss 8 tips on how we can generate beautiful, interpretable, and effective data visualisations.

By far, the most difficult skill to master is the intuition to pick the right visualisation type to use.

We have bar charts, line charts, pie charts, scatter charts, heat maps, and violin charts — to name a few. It’s super easy to get lost and feel overwhelmed. Head over to the seaborn gallery, and you’d immediately start to comprehend how vast this decision can become.

As expected, this is probably the most common mistake that I see on a regular basis. Using the incorrect data visualisation chart.

Picking the correct chart type is vital, and directly related to the type of data that we are presenting and the message that we want to communicate.

Suppose we have a small dataset that shows how many apple, banana, and orange sales a shop made this month.

# Example data
data = {'apples': 10, 'bananas': 5, 'oranges': 7}

Let us investigate how different chart types translate the message.

For all cases, we would need to import the following packages:

import matplotlib.pyplot as plt
import pandas as pd

Bar Chart

# Bar chart
plt.bar(data.keys(), data.values())
plt.title('Fruit Sales')
plt.xlabel('Fruit')
plt.ylabel('Number of Sales')
plt.show()

The bar chart does an excellent job in showing the values per category (in our case, the type of fruit). This chart clearly shows that the best selling fruit were apples, and the least selling were the bananas.

Line Chart

# Line chart
df = pd.DataFrame(data, index=[0])
df.plot.line()
plt.title('Fruit Sales')
plt.xlabel('Fruit')
plt.ylabel('Number of Sales')
plt.show()

If we try to visualise the same data as a line chart, we get the above — an empty chart. A line chart is typically used to show a trend over time. Therefore, we would require monitoring some kind of ‘moving’ variable. In this case, it could be the sales per month, for a number of different months.

Scatter Chart

We can also map the same fruit categories to a number and visualise them as a scatter chart. Suppose we have 5 categories and their respective values.

# Scatter plot
x = [1, 2, 3, 4, 5]
y = [10, 5, 8, 3, 6]
plt.scatter(x, y)
plt.title('Data Points')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

As we can see, although the scatter plot does show some kind of variance between the different categories and also helps to indicate their performance, the message is still not presented clearly.

For this particular insight delivery and use-case, I think that we can all agree that the bar chart is perhaps the most suitable.

Infogram have a good article on this topic.

I cannot stress this enough — colour is your best friend in visualisations.

Use colours to highlight the main (or interesting to note) insights.

Use colours to separate groups.

Use colours to shift the audience’s focus on an area that you want them to see.

Use colour to take control of your audience’s attention.

And for the love of good — pick an aesthetically pleasing colour palette that compliments the data and the audience. For example, if we are showing fruits, it probably makes sense to have oranges be the colour orange and the banana be the colour yellow. These little details are the differences between a good plot and a great plot. Your viewer should not try to understand the plot — rather, it should speak to them and tell them everything that they need to know!

Bonus tip — once you’ve picked your colour palette, stay consistent. Use the same colour palette throughout all of your charts. Especially during a presentation. Do not confuse the viewer. If apples were colour coded red in the first plot, do not colour code them yellow in the one after.

Recall the bar char example that we displayed earlier. Let us spice it up with some colours.

import matplotlib.pyplot as plt
import pandas as pddata = {'apples': 10, 'bananas': 5, 'oranges': 7}
# create a figure with two subplots
fig, axs = plt.subplots(ncols=2, figsize=(10, 4))
# plot the first chart on the left subplot
axs[0].bar(data.keys(), data.values())
axs[0].set_title('Fruit Sales')
axs[0].set_xlabel('Fruit')
axs[0].set_ylabel('Number of Sales')
# Custom color palette
colors = ['#C5283D', '#E9724C', '#FFC857']
# plot the second chart on the right subplot
axs[1].bar(data.keys(), data.values(), color=colors)
axs[1].set_title('Fruit Sales')
axs[1].set_xlabel('Fruit')
axs[1].set_ylabel('Number of Sales')
# adjust the spacing between the subplots
fig.tight_layout()
# show the plot
plt.show()

As with most things in life, the simpler the better.

Do not include unnecessary elements or styling to your plot if it doesn’t add any value.

Also remember, that your objective is to present the findings to your audience as clearer and efficiently as possible. No one cares about your fancy graphics.

Extra stuff will only serve one purpose: distracting your audience.

Suppose a dataset containing the total sales of three different products — A, B, and C. We want to create a chart to show the sales trends over time.

import matplotlib.pyplot as plt
import numpy as np# Generate some fake data
months = np.arange(1, 7)
sales_a = np.array([100, 120, 90, 110, 130, 95])
sales_b = np.array([80, 90, 100, 110, 120, 130])
sales_c = np.array([70, 80, 90, 100, 110, 120])
# Create the chart
fig, axs = plt.subplots(3, sharex=True, sharey=True)
axs[0].plot(months, sales_a, color='red')
axs[0].set_title('Product A')
axs[1].plot(months, sales_b, color='green')
axs[1].set_title('Product B')
axs[2].plot(months, sales_c, color='blue')
axs[2].set_title('Product C')
fig.suptitle('Sales by Product')
plt.show()

The above gets the job done — but rather poorly. We have multiple charts, all with their own scale, title, and colour. It is difficult to follow and to compare.

Let’s simplify this a bit, shall we?

import matplotlib.pyplot as plt
import numpy as np# Generate some fake data
months = np.arange(1, 7)
sales_a = np.array([100, 120, 90, 110, 130, 95])
sales_b = np.array([80, 90, 100, 110, 120, 130])
sales_c = np.array([70, 80, 90, 100, 110, 120])
# Create the chart
plt.plot(months, sales_a, color='red', label='Product A')
plt.plot(months, sales_b, color='green', label='Product B')
plt.plot(months, sales_c, color='blue', label='Product C')
plt.title('Sales by Product')
plt.legend()
plt.show()

Much better, no?

We can now easily compare the trends with each other.

Of course, this is just a single example. When generating plots, just keep in mind that anything added to the plot must contribute value.

I cannot stress this enough — provide context!

It’s astonishing the number of plots that I regularly see that do not have a title or labelled axes. Your audience is not a mind reader. Let them know what they are seeing!

Add labels, titles, legends, your data sources, and annotations when necessary.

Here’s an example of bad plot without any context (left side) and a great plot with context (right side).

We sometimes need to visualise multiple data with different scales or ranges. It is incredibly important to make sure that we are handling and representing all variables using the same scale, and focusing on the interesting data ranges.

Be careful of misrepresenting your data.

For example, consider the below chart:

import matplotlib.pyplot as plt
import pandas as pd# Example data
data = {'apples': 10, 'bananas': 5, 'oranges': 7}
# First chart: bar chart with proportional representation and inconsistent y-axis
plt.subplot(1, 2, 1)
plt.bar(data.keys(), data.values())
plt.ylim(0, 500)
plt.title('Fruit Sales')
plt.xlabel('Fruit')
plt.ylabel('Quantity Sold')
# Second chart: bar chart with proportional representation and consistent y-axis
plt.subplot(1, 2, 2)
plt.bar(data.keys(), data.values())
plt.ylim(0, 12)
plt.title('Fruit Sales')
plt.xlabel('Fruit')
plt.ylabel('Quantity Sold')
# Adjust the spacing between the charts
plt.subplots_adjust(wspace=0.3)
# Display the charts
plt.show()

We much appreciate the difference between the two charts. The left one is completely shown out of scale — making it very hard to assess and compare the plots.

The right chart, on the other hand, clearly shows the differences.

We must try to make the chart as easily understandable as possible.

We must also try to make the chart as interesting as possible.

Great charts convey a direct message. They pick an interesting observation or insight and tell its story. The chart should be used as a means to support the claim.

import matplotlib.pyplot as plt
import pandas as pd# Example data
year = [2015, 2016, 2017, 2018, 2019, 2020]
sales = [100, 150, 200, 180, 250, 300]
# Line chart with a narrative
plt.plot(year, sales)
plt.title('Sales Growth')
plt.xlabel('Year')
plt.ylabel('Number of Sales')
plt.text(2016.5, 160, 'First year of rapid growth')
plt.text(2018.2, 195, 'Sales decline due to recession')
plt.text(2019.7, 265, 'Sales pick up after the recession')
plt.show()

The plot should serve the needs of the audience.

The plot should help your audience reach their own conclusions faster.

Different audiences have different appetites.

For example, if we just trained our forecasting model and we are presenting the results to some executives, we probably want to focus on the financial aspect. We would likely want to highlight the different KPIs and how this model will improve revenue.

If we are presenting the model to a more technical audience, such as other data scientist or engineers, then we likely would want to focus on the model performance aspect. We would want to highlight the learning curves or focus on evaluation metrics.

The audience will influence our plot. We need to use concepts and the language that our audience knows and understands.

Sometimes, our data is inherently complex and the only way to make this easily understandable by our audience is to make it interactive.

This would allow our audience to manually explore the data and derive their own insights.

We can help them out by adding other interactive components such as tooltips, filters, and zoom to make it as engaging as possible.

Plotly is an awesome tool for generating interactive plots.

import plotly.graph_objs as go
import numpy as np# Generate random data
x = np.random.rand(100)
y = np.random.rand(100)
# Create a Plotly trace object
trace = go.Scatter(
x = x,
y = y,
mode = 'markers'
)
# Create a Plotly layout object
layout = go.Layout(
title = 'Interactive Scatter Plot',
xaxis = dict(title = 'X Axis'),
yaxis = dict(title = 'Y Axis'),
hovermode = 'closest'
)
# Create a Plotly figure object that combines the trace and layout
fig = go.Figure(data=[trace], layout=layout)
# Display the interactive plot in the Jupyter Notebook
fig.show()