Techno Blender
Digitally Yours.

9 Actionable Ways to Improve Your Data Visualization Game | by Ygor Serpa | May, 2022

0 59


Photo by Alexander Sinn on Unsplash

There is an excellent quote by Patrick Winston that says:

Your success in life will be determined largely by your ability to speak, your ability to write, and the quality of your ideas, in that order”

While he didn’t specifically say data visualization, I humbly believe he would agree that our ability to plot data falls within the writing category. And, as in writing, there are many pitfalls out there that we all can easily trip into unless we are well aware of them beforehand.

In this article, I summarize some of the main insights I acquired over the years on how to improve my data visualization skills. These broadly fall into two categories: things we should always do and how to communicate ideas better. As in most of my articles, I try to give you as much actionable advice as possible. By the end of this article, you should have a couple of new tricks up your sleeve and some new resources to get inspired.

The most important aspect of a good plot is being self-contained. It should indicate what it shows (title), what its axes represent (axis labels), the units of measure used, the value range it covers, and what series it has. In other words, your plot should be able to stand on its own without the need for anything else. Your only fallback should be the plot’s caption.

One of the worst reading experiences is to find an excellent-looking plot while skimming through a paper and finding out that you cannot understand what it shows without reading the full article. It is full of abbreviations and custom notation. While you may think this might get people into reading your work, this is more likely to make them close it and move on to the next.

Quick Checklist: your plot has a title, each axis has a label, the numerical ranges are clear, there is a grid (or a hint of), and the series are properly named. Any other addition should also be clear. For instance, if you go for a logarithmic plot, make sure this is easily noticeable.

Here is a plot that could have been better:

Image by the Author

The main issue here is not the plot itself but the amount of information that didn’t get explained by it. Note how much info had to be put on the caption to fill all the gaps. For instance, a color ramp could have indicated the shading correlates with sampling density.

Image by the Author

From the same work, here is a much better plot. The X-axis shows three settings (Free Fall, Brownian, and Gravity), each with three sub-settings (SAP Only, KD-Only, and KD + SAP), and three series (Single, SSE, and AVX). The Y-axis represents time, and one can see it is plotted logarithmically due to our familiarity with powers of ten.

Go the Extra Mile: The cherry on the cake is the numbers on top of each bar. It is hard for most people to make numerical sense of logarithmic plots. We can simultaneously convey the exponential speed-ups while presenting an intuitive view of the results by showing the actual numbers. It is often tough to show the numbers behind each data point without leading to visual pollution. However, whenever you can, please do it. Especially when dealing with things like logarithmic scales.

Visually, every plot should indicate everything and be comprehensive enough so that viewers can understand the numerical values represented. This means there is some hint of a grid and a clear frame around data. Always remind yourself that plots are visual tables, not pictures. You wouldn’t blur a table, would you? Don’t make the numbers hard to see.

Let’s start with a bad example:

Image by the Author

All we know here is we have a bunch of curves. Some go beyond 100, and the plot ends at 32 thousand objects. We can compare series with each other, but we have no real numerical sense of it all — unless you grab a ruler and a pencil to sketch some reference values. Now consider the following:

Image by the Author

This time around, we do not need a ruler to see the black curve ends at < 20, the green one ends almost at 40, etc. The added markings and the subtle grid make all the difference. One could write a relatively accurate table out of this visualization without the need for a ruler or some other device.

Go the Extra Mile: It can be pretty effective to add frames and arrows to a plot literally. The idea is to guide your viewer toward what is essential. These additions can be significant when the visualization is complex or noisy. Here is an example:

Image by the Author

Look at the above plot. Using tools like Matplotlib can be daunting when we need to add boxes, text, and arrows to plots. Getting the positioning right is a pain, and you will always need to do some trial-and-error work whenever you need to reposition stuff. The same goes when we are trying to find better font sizes or colors — an endless cycle of guesswork.

The better approach is to export stuff to another app. Most of the plots I am most proud of were made through several tools and a small dose of image editing. For example, it is far easier to add arrows and boxes using PowerPoint or Slides than within plotting libraries or through Excel. You can also try on colors using free tools like Paint.Net and its filters, such as hue-saturation and brightness-contrast.

Go the Extra Mile: most plotting packages export to SVG. So you can easily alter anything using Inkscape and Adobe Illustrator tools utilizing this format. Plus, SVG is an XML-based file format, so you can open it on VS Code and do things such as mass color changes by using the find and replace tools.

LaTeX tip: LaTeX does not support SVG. However, you can convert your SVG files to PDF and include them as images for vector quality visualizations.

Before I go on, look at these three plots. They show the evolution of two processes. The lower the curve goes, the better. Within this setting, which of the following achieved the best results?

Image by the Author.

If you said the middle blue curve, you are…. wrongright?

The catch is that these are all the same curves. What changes from plot to plot is the Y-axis scale. One goes from 0 to 1, the second from 0.085 to 1.285, and the third from 0 to 0.5. In a sense, these are just different zooms of the same curves. Below is how they look when we set the same range for all three plots:

Image by the Author. Each plot uses the same Y-axis settings.

Especially if you are comparing several approaches to the same problem using multiple graphs, always make sure they use the same viewing scales. Moreover, use reasonable scales. For instance, if your data is a percentage, stick to the 0–100 range. If a variable can only range from 1 to 2, use that.

In this context, malicious plotters like to use this “minor mistake” to alter our perceptions of data intentionally. For instance, consider the three plots below. On the left, we have the original data; at the center position, we have a carefully selected scale to make the blue approach look like it is surprisingly better than the orange curve; on the right, the axis range makes it look like both approaches converge to the same values. Never do this.

Image by the Author. Wildly malicious Y-axis settings.

Here is another example of how scales can be used to mislead people intentionally. The fictitious company grew 1.1%, then 1.2%, and 1.5% in the following chart. The growth rate is relatively modest, but the visualization makes it look like the company is growing exponentially — there is even an arrow!

Go the Extra Mile: Here are two great articles showing how malicious actors lie using data visualization. How To Lie with Charts and Misleading Statistics. Both pieces feature real-world plots that show how these tactics are used in practice. Make sure you never make any of these errors (nor fall for them).

Unless you are a designer or have a great sense of aesthetics, stick to trial-and-tested color schemes. We hardly have any reason not to — and it saves time. Matplotlib’s documentation has a great article on their color maps and the idea behind them. For Excel users, you can just copy these colors by hand.

Here is a quick overview of the some available presets:

Generated using matplotlib

The available colormaps are neatly grouped into sequential, diverging, cyclic, quantitative, and miscellaneous. Additionally, they provide perceptually uniform schemes, which are great when considering how your plots will look when printed to grayscale.

Out of the miscellaneous category, two are noteworthy. First, cubehelix is an excellent scheme. Second, Jet is the color scheme we associate with flames and night vision — it works nearly always.

Generated using matplotlib

Go the Extra Mile: If you feel like you need your own sauce, you can always get some of these color maps on Paint.net or Photoshop and play with the Hue and Saturation values to develop novel ideas. Then, you need to copy the colors or script the changes directly. Here are two samples playing with the Jet and Cube Helix color maps hue and saturation:

Generated using matplotlib. Edited by the Author.

Imagine comparing the US, Brazil, and Canada. Which color would you use? How would you choose if it were Facebook, Twitter, Youtube, and Instagram? We naturally associate countries with their flags and brands with their logos. This association can imply certain choices. For instance, consider the following:

Image by the author. Data from the IMF

We have the USA in green, Brazil in red, and Canada in blue on the left. It is entirely arbitrary. On the right, I remapped the USA to light blue, Brazil to green, and Canada to red. If you are familiar with these countries, you will associate these colors with the respective flags. This is an explicit use of semantics. You might say the blue here is lighter than the one used on the US flag, but that is better than painting it as green (or Brazil in red).

By the way, the plot shows that the US GDP is around 20… 20 what? Billions? Trillions? You might know the answer, but not everyone does. Also, what does GDP stands for? This is an intentional example of bad labels (check tip #1). The title ought to present “Gross Domestic Product (GDP),” while the axis label should be “Trillions of Dollars.”

As a second example, here is a problematic rendition of the 2020 US presidential election results:

Source: Wikipedia. Edited by the author.

No media outlet would ever paint the democratic and republican parties in shades other than blue and red (just see the original version). The whole idea here is to leverage the pre-existing relationships viewers have with what we intend to present so they can intuitively understand the data.

Go the Extra Mile: Sometimes, the semantics can be pretty subtle and limited to a specific audience. For example, the colors indicate the algorithm’s source in the following plot. Green solutions are baselines, blue solutions come from the Bullet library, and the remaining approaches come from different sources. This relationship is not immediately apparent for those unfamiliar with the subject matter but is of great aid to those.

Image by the Author

Consider the above plot once again. Some series use an open marker (e.g., unfilled square), while others use closed markers (e.g., filled square). This is a handy way to show solutions are related. For instance, the two DBVT series share the square marker, one open and the other closed. The same goes for BF/SAP and Grid BF/SAP.

There are many subtle ways to show relationships. For example, carefully assigning markers can work wonders. On top of that, there are other tools we can use, such as line style and the filling shade. For example, if we had a Parallel DBVT F and Parallel DBVT D solutions, we could use dashed lines for the serial approaches and solid lines for the parallel ones. Likewise, we could append a P to the parallel counterparts.

Here are some ideas on how to relate series employing custom markers:

Simple ideas to convey relationships within markers. Image by the Author

The first option allows us to present two variables (filled/unfilled and serial/parallel). The second row could represent a baseline approach, optimization A, optimization B, and A + B. The third row could showcase increased levels of some properties, such as optimization aggressiveness. Finally, the fourth row uses letters, allowing for more than 4 variants.

Go the Extra Mile: Actually, don’t. Go for a table or multiple plots instead. Sometimes it is best to keep things simpler and more focused than trying to jam everything into a single master plot.

Although it may sound old-fashioned, many people still print things, and most of the time, they won’t do it in color. So always make sure your plots work under limited color settings. If you think this is too much, consider this as a contrast check. If your plot still works in grayscale, the color scheme you chose has sufficient contrast to please a broad audience.

Consider the following grayscale plot:

Image by the author.

Despite not having colors, it is relatively easy to identify each curve by following the markers and overall shade of gray. Also, pay attention to how much weight the only black curve has in relation to the others. I recommend using black for what you want to draw the most attention to. For instance, your own proposed approach.

Go the Extra Mile: On a more extreme note, you can also force everything to black to simulate a faulty printer. Here is the above plot in full black. We can no longer fully differentiate DBVT F from AxisSweep or CGAL from KD-Tree, but we can still understand what is going on. You can replicate this using the Color Curves function of Paint.Net or messing with the contrast.

Image by the author.

Most of the time, what you want to plot is not a simple 2D dataset that could easily fit within a scatterplot. Instead, it is a multi-dimensional or hierarchical problem with no clear visualization. In such cases, your best bet is to browse galleries for inspiration.

My three go-to sources of inspiration are:

The idea here is to glance at each sample and think, “how would my data probably look if I plotted it this way?” Alternatively, “does this plot shows the number of dimensions I need?” For instance, this can show several variables in a condensed space, while a plot matrix can be a great way to visualize several dimensions at once. From Plotly, there are some cool visualizations, like the Wind Rose and Ternary plots.

You might be wondering why the Papers with Code newsletter is doing here. If you check it out right now, you will see that, along with the reading suggestions, they cherrypick the most informative diagrams and plots from the works they feature. It is a visualization inspiration gold mine if you ask me. Be my guest and have a walk around the last few issues.

Go the Extra Mile: In a way, related work is also a gallery of sorts. For example, say you are writing an academic paper on a novel classification architecture, unsure of what to plot and how. To your rescue, there are plenty of papers on this topic from which you can draw some inspiration. It is a wise investment to save a picture of every incredible plot you see for later reference. Here is a great place to start.


Photo by Alexander Sinn on Unsplash

There is an excellent quote by Patrick Winston that says:

Your success in life will be determined largely by your ability to speak, your ability to write, and the quality of your ideas, in that order”

While he didn’t specifically say data visualization, I humbly believe he would agree that our ability to plot data falls within the writing category. And, as in writing, there are many pitfalls out there that we all can easily trip into unless we are well aware of them beforehand.

In this article, I summarize some of the main insights I acquired over the years on how to improve my data visualization skills. These broadly fall into two categories: things we should always do and how to communicate ideas better. As in most of my articles, I try to give you as much actionable advice as possible. By the end of this article, you should have a couple of new tricks up your sleeve and some new resources to get inspired.

The most important aspect of a good plot is being self-contained. It should indicate what it shows (title), what its axes represent (axis labels), the units of measure used, the value range it covers, and what series it has. In other words, your plot should be able to stand on its own without the need for anything else. Your only fallback should be the plot’s caption.

One of the worst reading experiences is to find an excellent-looking plot while skimming through a paper and finding out that you cannot understand what it shows without reading the full article. It is full of abbreviations and custom notation. While you may think this might get people into reading your work, this is more likely to make them close it and move on to the next.

Quick Checklist: your plot has a title, each axis has a label, the numerical ranges are clear, there is a grid (or a hint of), and the series are properly named. Any other addition should also be clear. For instance, if you go for a logarithmic plot, make sure this is easily noticeable.

Here is a plot that could have been better:

Image by the Author

The main issue here is not the plot itself but the amount of information that didn’t get explained by it. Note how much info had to be put on the caption to fill all the gaps. For instance, a color ramp could have indicated the shading correlates with sampling density.

Image by the Author

From the same work, here is a much better plot. The X-axis shows three settings (Free Fall, Brownian, and Gravity), each with three sub-settings (SAP Only, KD-Only, and KD + SAP), and three series (Single, SSE, and AVX). The Y-axis represents time, and one can see it is plotted logarithmically due to our familiarity with powers of ten.

Go the Extra Mile: The cherry on the cake is the numbers on top of each bar. It is hard for most people to make numerical sense of logarithmic plots. We can simultaneously convey the exponential speed-ups while presenting an intuitive view of the results by showing the actual numbers. It is often tough to show the numbers behind each data point without leading to visual pollution. However, whenever you can, please do it. Especially when dealing with things like logarithmic scales.

Visually, every plot should indicate everything and be comprehensive enough so that viewers can understand the numerical values represented. This means there is some hint of a grid and a clear frame around data. Always remind yourself that plots are visual tables, not pictures. You wouldn’t blur a table, would you? Don’t make the numbers hard to see.

Let’s start with a bad example:

Image by the Author

All we know here is we have a bunch of curves. Some go beyond 100, and the plot ends at 32 thousand objects. We can compare series with each other, but we have no real numerical sense of it all — unless you grab a ruler and a pencil to sketch some reference values. Now consider the following:

Image by the Author

This time around, we do not need a ruler to see the black curve ends at < 20, the green one ends almost at 40, etc. The added markings and the subtle grid make all the difference. One could write a relatively accurate table out of this visualization without the need for a ruler or some other device.

Go the Extra Mile: It can be pretty effective to add frames and arrows to a plot literally. The idea is to guide your viewer toward what is essential. These additions can be significant when the visualization is complex or noisy. Here is an example:

Image by the Author

Look at the above plot. Using tools like Matplotlib can be daunting when we need to add boxes, text, and arrows to plots. Getting the positioning right is a pain, and you will always need to do some trial-and-error work whenever you need to reposition stuff. The same goes when we are trying to find better font sizes or colors — an endless cycle of guesswork.

The better approach is to export stuff to another app. Most of the plots I am most proud of were made through several tools and a small dose of image editing. For example, it is far easier to add arrows and boxes using PowerPoint or Slides than within plotting libraries or through Excel. You can also try on colors using free tools like Paint.Net and its filters, such as hue-saturation and brightness-contrast.

Go the Extra Mile: most plotting packages export to SVG. So you can easily alter anything using Inkscape and Adobe Illustrator tools utilizing this format. Plus, SVG is an XML-based file format, so you can open it on VS Code and do things such as mass color changes by using the find and replace tools.

LaTeX tip: LaTeX does not support SVG. However, you can convert your SVG files to PDF and include them as images for vector quality visualizations.

Before I go on, look at these three plots. They show the evolution of two processes. The lower the curve goes, the better. Within this setting, which of the following achieved the best results?

Image by the Author.

If you said the middle blue curve, you are…. wrongright?

The catch is that these are all the same curves. What changes from plot to plot is the Y-axis scale. One goes from 0 to 1, the second from 0.085 to 1.285, and the third from 0 to 0.5. In a sense, these are just different zooms of the same curves. Below is how they look when we set the same range for all three plots:

Image by the Author. Each plot uses the same Y-axis settings.

Especially if you are comparing several approaches to the same problem using multiple graphs, always make sure they use the same viewing scales. Moreover, use reasonable scales. For instance, if your data is a percentage, stick to the 0–100 range. If a variable can only range from 1 to 2, use that.

In this context, malicious plotters like to use this “minor mistake” to alter our perceptions of data intentionally. For instance, consider the three plots below. On the left, we have the original data; at the center position, we have a carefully selected scale to make the blue approach look like it is surprisingly better than the orange curve; on the right, the axis range makes it look like both approaches converge to the same values. Never do this.

Image by the Author. Wildly malicious Y-axis settings.

Here is another example of how scales can be used to mislead people intentionally. The fictitious company grew 1.1%, then 1.2%, and 1.5% in the following chart. The growth rate is relatively modest, but the visualization makes it look like the company is growing exponentially — there is even an arrow!

Go the Extra Mile: Here are two great articles showing how malicious actors lie using data visualization. How To Lie with Charts and Misleading Statistics. Both pieces feature real-world plots that show how these tactics are used in practice. Make sure you never make any of these errors (nor fall for them).

Unless you are a designer or have a great sense of aesthetics, stick to trial-and-tested color schemes. We hardly have any reason not to — and it saves time. Matplotlib’s documentation has a great article on their color maps and the idea behind them. For Excel users, you can just copy these colors by hand.

Here is a quick overview of the some available presets:

Generated using matplotlib

The available colormaps are neatly grouped into sequential, diverging, cyclic, quantitative, and miscellaneous. Additionally, they provide perceptually uniform schemes, which are great when considering how your plots will look when printed to grayscale.

Out of the miscellaneous category, two are noteworthy. First, cubehelix is an excellent scheme. Second, Jet is the color scheme we associate with flames and night vision — it works nearly always.

Generated using matplotlib

Go the Extra Mile: If you feel like you need your own sauce, you can always get some of these color maps on Paint.net or Photoshop and play with the Hue and Saturation values to develop novel ideas. Then, you need to copy the colors or script the changes directly. Here are two samples playing with the Jet and Cube Helix color maps hue and saturation:

Generated using matplotlib. Edited by the Author.

Imagine comparing the US, Brazil, and Canada. Which color would you use? How would you choose if it were Facebook, Twitter, Youtube, and Instagram? We naturally associate countries with their flags and brands with their logos. This association can imply certain choices. For instance, consider the following:

Image by the author. Data from the IMF

We have the USA in green, Brazil in red, and Canada in blue on the left. It is entirely arbitrary. On the right, I remapped the USA to light blue, Brazil to green, and Canada to red. If you are familiar with these countries, you will associate these colors with the respective flags. This is an explicit use of semantics. You might say the blue here is lighter than the one used on the US flag, but that is better than painting it as green (or Brazil in red).

By the way, the plot shows that the US GDP is around 20… 20 what? Billions? Trillions? You might know the answer, but not everyone does. Also, what does GDP stands for? This is an intentional example of bad labels (check tip #1). The title ought to present “Gross Domestic Product (GDP),” while the axis label should be “Trillions of Dollars.”

As a second example, here is a problematic rendition of the 2020 US presidential election results:

Source: Wikipedia. Edited by the author.

No media outlet would ever paint the democratic and republican parties in shades other than blue and red (just see the original version). The whole idea here is to leverage the pre-existing relationships viewers have with what we intend to present so they can intuitively understand the data.

Go the Extra Mile: Sometimes, the semantics can be pretty subtle and limited to a specific audience. For example, the colors indicate the algorithm’s source in the following plot. Green solutions are baselines, blue solutions come from the Bullet library, and the remaining approaches come from different sources. This relationship is not immediately apparent for those unfamiliar with the subject matter but is of great aid to those.

Image by the Author

Consider the above plot once again. Some series use an open marker (e.g., unfilled square), while others use closed markers (e.g., filled square). This is a handy way to show solutions are related. For instance, the two DBVT series share the square marker, one open and the other closed. The same goes for BF/SAP and Grid BF/SAP.

There are many subtle ways to show relationships. For example, carefully assigning markers can work wonders. On top of that, there are other tools we can use, such as line style and the filling shade. For example, if we had a Parallel DBVT F and Parallel DBVT D solutions, we could use dashed lines for the serial approaches and solid lines for the parallel ones. Likewise, we could append a P to the parallel counterparts.

Here are some ideas on how to relate series employing custom markers:

Simple ideas to convey relationships within markers. Image by the Author

The first option allows us to present two variables (filled/unfilled and serial/parallel). The second row could represent a baseline approach, optimization A, optimization B, and A + B. The third row could showcase increased levels of some properties, such as optimization aggressiveness. Finally, the fourth row uses letters, allowing for more than 4 variants.

Go the Extra Mile: Actually, don’t. Go for a table or multiple plots instead. Sometimes it is best to keep things simpler and more focused than trying to jam everything into a single master plot.

Although it may sound old-fashioned, many people still print things, and most of the time, they won’t do it in color. So always make sure your plots work under limited color settings. If you think this is too much, consider this as a contrast check. If your plot still works in grayscale, the color scheme you chose has sufficient contrast to please a broad audience.

Consider the following grayscale plot:

Image by the author.

Despite not having colors, it is relatively easy to identify each curve by following the markers and overall shade of gray. Also, pay attention to how much weight the only black curve has in relation to the others. I recommend using black for what you want to draw the most attention to. For instance, your own proposed approach.

Go the Extra Mile: On a more extreme note, you can also force everything to black to simulate a faulty printer. Here is the above plot in full black. We can no longer fully differentiate DBVT F from AxisSweep or CGAL from KD-Tree, but we can still understand what is going on. You can replicate this using the Color Curves function of Paint.Net or messing with the contrast.

Image by the author.

Most of the time, what you want to plot is not a simple 2D dataset that could easily fit within a scatterplot. Instead, it is a multi-dimensional or hierarchical problem with no clear visualization. In such cases, your best bet is to browse galleries for inspiration.

My three go-to sources of inspiration are:

The idea here is to glance at each sample and think, “how would my data probably look if I plotted it this way?” Alternatively, “does this plot shows the number of dimensions I need?” For instance, this can show several variables in a condensed space, while a plot matrix can be a great way to visualize several dimensions at once. From Plotly, there are some cool visualizations, like the Wind Rose and Ternary plots.

You might be wondering why the Papers with Code newsletter is doing here. If you check it out right now, you will see that, along with the reading suggestions, they cherrypick the most informative diagrams and plots from the works they feature. It is a visualization inspiration gold mine if you ask me. Be my guest and have a walk around the last few issues.

Go the Extra Mile: In a way, related work is also a gallery of sorts. For example, say you are writing an academic paper on a novel classification architecture, unsure of what to plot and how. To your rescue, there are plenty of papers on this topic from which you can draw some inspiration. It is a wise investment to save a picture of every incredible plot you see for later reference. Here is a great place to start.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment