Data Visualizations with a Halloween Candy Dataset | by David Hundley | Oct, 2022


No tricks, just treats! Learn how to create cool, useful data visualizations with a Halloween candy dataset

Title card created by the author

Happy Halloween, friends! In the spirit of the season, I thought it would be fun to take this time to share how we can create some really neat and useful data visualizations using a Halloween candy dataset. Throughout this post, we’re going to cover a breadth of different types of visualizations making use of Python with supporting libraries, including Pandas, Matplotlib, and Seaborn. If you would like to see all the code in one seamless location, I invite you to check out this Jupyter notebook on GitHub. This is the same Jupyter notebook I used to produce every visualization you’ll see in this post, so you can be sure you’re getting every bit of code there!

The dataset we will be using in this post was curated by the good folks at FiveThirtyEight and is housed on Kaggle. If you would like to learn more about that dataset or download it for your own use, this link to Kaggle will get you everything you need. (This dataset is covered by FiveThirtyEight’s MIT license linked here.) A huge thank you to FiveThirtyEight for putting together this dataset; it is much appreciated! 😃

We have a lot of ground to cover in this post, so let’s keep this introduction brief. In the next brief section, we’ll talk a little bit about how we’ll be theming the data visualizations with custom colors, and then we’ll move into the bulk of the post covering all sorts of data visualizations. Let’s get going! 🦇

In addition to being a machine learning engineer and data science enthusiast, I also dabble a little on the side with graphic design. In the graphic design community, we really hone in a specific color hue with a computer value called a hex value. Yes, this is just like the hexidecimal system you may already familiar with; however, you may not have known how valuable these hex values are for graphic designers! Graphic designers seek to get a full color palette of hex values, which are essentially groupings of colors that look aesthetically pleasing together. In fact, there are many websites that are purely dedicated to sharing these curated color palettes. I used this site in particular to get my Halloween hex color palette. (And yes, it just now dawned on me the irony of calling it a “Halloween hex color.” 😂) Here is a little image I created that shows the colors that I will be using in my data visualizations along with their correlated hex values.

Graphic created by the author

In addition to adjusting the colors like this, we can also use Seaborn to set the theme of each data visualization to feature a darker background. By default, all our forthcoming visualizations will inherit this design scheme. The way we enable all of this in code is with the following snippet:

Code written by author and generated as visualization with Carbon

Now that we have our Halloween theming in place, let’s lay a groundwork of what you can expect for the remainder the post. The remainder of the post is broken into four primary sections, each representing a respective grouping of visualization types:

  • Categorical visualizations
  • Distribution visualizations
  • Relational visualizations
  • Clustering visualizations

Now, to be perfectly clear, this is NOT an all inclusive list of every single data visualization grouping. I just wanted to give you a sampling of some of the most common visualizations as this post is already going to be long enough as is!

Within each of these groupings, I will cover some specific data visualizations, and with each of these visualizations, I will do the following:

  • We will examine what the visualization is and why it’s valuable at a high level.
  • We will then move into hypothesizing a real life example in which we might use that specific visualization with the Halloween candy dataset.
  • Next, we shall create the visualization in code, using Python and libraries like Pandas, Matplotlib, and Seaborn.
  • Returning to our hypothesized scenario, we will conduct a post-analysis on the data using what the visualization can now help us to intuit better.

Simple enough, right? Let’s jump into our first grouping of data visualizations: the categorical visualizations!

Okay, if I’m being completely honest… when I selected this dataset, I transparently did not pay that close of attention to the nature of the data, particularly given that there are a LOT of features with binary values. No matter! This can give us some nice practice into displaying these categorical features across the visualizations below.

Count Plots

Just as the name sounds, a count plot is a visualization that counts the number of a specific value. Seaborn’s countplot functionality is relatively simple, so I thought I’d shake things up a bit by showing you how to display multiple count plots in a singular, holistic, visualizations.

Real Life Example: We have a whole bunch of these binary features, nine to be precise. These binary features are represented by a 0 for “no” and 1 for “yes”. Wouldn’t it be interesting to see how all these binary features compare in a single, easy-to-read visualization? We’re going to have to do a tiny bit of data preprocessing first, but then we’ll analyze our findings as part of a single visual with these 9 count plots.

Here is the code that enables creating this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: Well, if I’m being completely honest, this visualization isn’t super informative. 😂 I suppose it’s interesting that fruity candy vs. chocolate candy seems to have some sort of correlation, which would make sense in the fact that I can’t think of any candy that is necessarily fruity and chocolatey. It’s also interesting to note that pluribus is the only category that more candies have than not. I had to go remind myself what pluribus means in this context. For context, pluribus means if a candy is essentially multiples in a box or one holistic thing. That said, a pluribus-positive candy would be like M&Ms whereas a pluribus-negative candy would be like a Snickers candy bar. So even though we didn’t get a whole lot of value out of this particular visualization, I hope you can find the value in visualizing multiple count plots like this!

Pie Charts

Of all the visualizations we’ll be covering in this post, this is probably the one you are most familiar with. A pie chart in this scenario shows us how much of each binary feature accounts for the full feature. From a visual perspective, it tells us if one value is more dominant over the other. This is also the one visualization that doesn’t actually have a seaborn counterpart, so in order to produce these pie charts, we’re going to have to use barebones matplotlib.

Real Life Example: In the example with the count plot, we took a look at the binary features to understand how they were distributed from a pure count perspective. We’re going to do the same again here, except this time we’ll be visualizing these features each with their own respective pie chart.

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: As with the count plot analysis above, these visualizations honestly don’t provide us a whole lot of value. It is interesting how we can intuit something like the bar feature to have more or less a 1:3 ratio, but otherwise, there’s transparently not a whole lot of value we can get out of these specific visualizations. Still, I hope you can see the value in how we might use pie charts and count plots to our advantage in other contexts!

While most of this dataset consists of binary features, there are a handful of features with continuous values. Naturally, it is super common for data scientists to want to understand the distribution of these sorts of features, so this section will cover a few types of distribution visualizations.

Box Plots / Violin Plots

A box plot is a visualization used to understand at a high level the primary quantiles of a feature. More specifically, it will highlight the median, interquartile range (IQR), and the outlying minimum and maximum values. The IQR is represented by the center box whereas the “whiskers” on either side of the box represent the outlying minimum and maximum values. A violin plot is similiar to the box plot, except it seeks to provide more of a visualization on the precise distribution of the data. Depending on what you precisely want to know about the data, you may choose one over the other. Or you could do what we’re about to do and show both side by side!

Real Life Example: We have three different features with percentage-based data: sugar percentile, price percentile, and the percentile of how often each candy won when matched up against another candy. The question we’re essentially looking to answer for each of these is, is there a wide distribution in the data, or did they all fare about the same for reach respective feature? Let’s find out!

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: It’s kind of interesting how the green color of the violin plot makes each of those visualizations look like leaves from a tree, but otherwise, the violin plots don’t seem to offer a whole lot of value as they are more difficult to interpret. The box plots, on the other hand, are much easier to interpret. In the case of the sugar and price percentiles, it is not surprising to see that the whiskers span the full range while the IQR lies more in the middle. The win percentile interestingly has whiskers that are tighter in scope with a tighter IQR. In my opinion, the box plot for the win percentile is the most informative of the bunch, as it essentially tells us that most candies lay in the middle in terms of favorability. Recall that this feature represents the percentage of wins it received when directly matched against another candy, and the IQR here is basically telling us that it was more common than not for a candy to be evenly matched against another candy. Of course, the whiskers represent that there were indeed outliers, but at a quick glance, this box plot tells us a very quick story about the data we wouldn’t quickly glean by just analyzing the numbers directly.

Histograms / KDE Plots

I noted above that the violin plot didn’t offer us a whole lot of value as it was difficult to interpret, and the box plot stays at a pretty high level itself, too. If we wanted to understand the distribution of data at a more granular level? This is where a histogram / KDE plot can be more helpful. A histogram visualizes the frequency of data at certain bins, sort of similar to a count plot. On top of the histogram, we can overlay a kernel density estimate (KDE), which plots a curve over the histogram to represent the “continuous probability” of each respective bin. (In case you’re curious, the violin plot itself is indeed a KDE curve essentially turned on its side and doubled across an axis of symmetry.)

Real Life Example: In the box plot / violin plot example, we wanted to see at a high level how the percentile-based features fared in terms of breadth of distribution. What if we wanted to get a more fine-grained look into how evenly the data is distributed? That is what we will assess with histogram and overlaying KDE curve!

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: Now this is actually pretty interesting! These visualizations tell a dramatically different story compared to the box and violin plots we previously analyzed. With the sugar and price percentile features in particular, we noted above that the IQR seems to indicate that the data is centrally clustered. These visualizations indicate this is not at all the case, and with the price percentile feature in particular, we actually see that there are more candies at either end of the spectrum. This is why it’s important to perform multiple visualizations onto the same data as you can get different, important bits of information from each visualization type!

When starting off work in a new realm of data, one of the first things a data scientist looks for is correlations between various data features. In this section, we’re going to cover a number of these correlation visualizations and how we can build an intuition for how these sorts of visualizations better tell a story about the data!

Regression / Scatter Plots

A regression plot is a kind of visualization that generally compares two numerical features between an x- and y-axis with each data point represented by a dot. We can also enhance a regression plot with an additional hue feature that visually dictates how this third feature affects the two initial features we are comparing. In our example below, we are specifically going to use Seaborn’s lmplot functionality, which essentially combines a standard regression plot with that ability to add that hue for a deeper analysis on the data.

Real Life Example: Our sugarpercent feature tells us the percentile of sugar a particular candy has compared to the other candies, while the pricepercent feature tells us the percentile of price that a candy costs compared to the other candies. We also have a third feature, chocolate, tells us whether or not the candy is chocolate. An lmplot can help us answer the question, Do chocolate candies have a stronger and/or higher correlation of sugar-to-price than their non-chocalate bretherin?

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: This visualization offers us a lot of information in one little plot! For both chocolate and non-chocolate candies, there is a very slight positive correlation in sugar percentile to price percentile, but there is so little data in this dataset that I would be very leery to make a broad-sweeping generalization like that. On the flip side, it appears to be pretty clear that chocolate candy has a higher sugar percentage AND higher price. This isn’t super surprising since in my anecdotal experience, chocalate candy does seem to be higher in price than non-chocolate candy. There is more information we could glean from this plot, but I hope the primary takeaway is that you can see how much information you can squeeze out of a singular plot!

Joint Plots

Up to this point, we have covered histograms, KDE curves, and scatter plots. You might be wondering, what if there is a way we could combine these things in a single visualization? We’re going to do just that with Seaborn’s joint plot visualization. A joint plot seeks to analyze the relationship between two features in a way that goes above and beyond something more simple. You can actually do quite a few things with Seaborn’s jointplot, so in the visualization below, I’m going to create a special type of visualization that almost looks like a topographical elevation map using the KDE curve.

Real Life Example: In the previous example, we examined the relationship of sugar-to-price for chocolate and non-chocolate candies. Now, I want to explore the price-to-win ratio for chocolate and non-chocolate candies. We could produce the same visualization as we did with the lmplot above, but this time, I want to show you the value of using a KDE-based jointplot. Let’s see the results!

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: The interesting thing about this kind of visualization is it helps to really hone in on where the data seems to be clustering the most often. For example, we can see that for chocolate candy, the greatest correlation is where the candy is at ~65th price percentile and ~70th win percentile. Additionally, we can see the pure KDE curves for each feature represented on the outer side of each respective access. We can see, for example, that non-chocolate candies tend to fall on the lower end of the price scale whereas the choolate candies tend to be a little higher in price. As I noted in a previous analysis, this is not all too surprising as my anecdotal experience has been that those big bags of chocolate candy in the Target aisle are indeed more expensive!

In this next section, we’ll use some unsupervised clustering to cluster the data into groups that the algorithm feels fit appropriately together. Next, we’ll create some visualizations to help us intuit why the clustering algorithm clustered the data as it did. A bit of a *disclaimer* as we proceed into this next section: I’m not sure that this is exactly the best dataset to try to be doing this on. Yes, we’re going to see some results, but the data is too small with features too specific that I’m leery to say these results are reliable. That said, the focus on this post is how to create these visualizations for use in your own context, so I’m not terribly concerned about the reliability of these results. If you would like to see a full use case where unsupervised clustering is legitimately used, I would advise you to check out my Starbucks customer clustering Medium post.

Line Charts

When determining how many clusters we should try organizing the data to, it is common to represent the sum of squared errors (SSE) of the clustering algorithm on a chart and looking for an elbow in the visualization as the ideal number of clusters we should maintain. We will demonstrate below how to both calculate the SSE scores and visualize them as a line chart.

Real Life Example: The example in our case is pretty simple: we want to cluster our dataset to perform some deeper analyses onto those clusters to see if we can get any insight that way. We’re not sure how many clusters we should be aiming for, so this visualization will allow us to use the elbow method of determining how many clusters we should target. Let’s test out 1 to 14 clusters to see how our results fare.

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: As to be expected, a single cluster always produces the highest SSE score followed by a steep drop off after few clusters. After 5–6 clusters, you can see that the SSE more or less has diminishing returns, so you can now probably see why we call this the elbow method. The ideal number of clusters is found at the “elbow” of this visualization, and we can argue whether that would be 2 or 3 clusters.

Scatter Plots

While we have already sort of touched on scatter plots during the analysis of the lmplot, I thought it’d be worth revisiting again in this new context. Instead of using the lmplot, we’re going to use Seaborn’s more vanilla scatterplot to create these visualizations. We will also be adding each individual candy’s label directly on top of the chart with these final data visuals.

Real Life Example: When we determined how many clusters we should create via the elbow method, we noted that we could either create 2 or 3 clusters for our dataset. There’s not really a right or wrong answer, but I am personally interested in answering the following question: How is the K-Means algorithm differentiating between data when clustering between either 2 or 3 clusters in terms of the win-to-price ratio?

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: I personally find these visuals to be the most interesting ones we’ve produced yet. By adding the candy labels directly on top of the visualization, we can more precisely analyze why the clusters were derived as such. Looking at the first visualization with only two clusters, it seems that you can draw a pretty clear line between the clusters: the first cluster contain non-chocolate candies that fared worse than the other cluster in terms of price and win percentages. The other cluster seems to consist of more “premium” chocolate candies. Things start to get even more interesting when we look at the three-cluster visualization. The original “non-chocolate” cluster remains largely intact, but the chocolate cluster gets broken up further into two groups. Interestingly, the points in the green cluster do all seem to have a similarity to them. If it is challenging to read this visual, some of the candies in the green cluster include Twix, Kit Kat, Nestle Crunch, and Snickers Crisper. In my experience, all of those candies have a bit of a crispy, crunchy nature about them. Very interesting indeed!

That brings us to the conclusion of this post! I hope you found all these visualizations to be both fun and interesting. Again, this post doesn’t represent every single kind of visualization we can create, but I hope it provides you with a good sampling of ideas that you can produce in your own context. Thank you for reading this post, and hope you all have a great, safe Halloween! 🎃


No tricks, just treats! Learn how to create cool, useful data visualizations with a Halloween candy dataset

Title card created by the author

Happy Halloween, friends! In the spirit of the season, I thought it would be fun to take this time to share how we can create some really neat and useful data visualizations using a Halloween candy dataset. Throughout this post, we’re going to cover a breadth of different types of visualizations making use of Python with supporting libraries, including Pandas, Matplotlib, and Seaborn. If you would like to see all the code in one seamless location, I invite you to check out this Jupyter notebook on GitHub. This is the same Jupyter notebook I used to produce every visualization you’ll see in this post, so you can be sure you’re getting every bit of code there!

The dataset we will be using in this post was curated by the good folks at FiveThirtyEight and is housed on Kaggle. If you would like to learn more about that dataset or download it for your own use, this link to Kaggle will get you everything you need. (This dataset is covered by FiveThirtyEight’s MIT license linked here.) A huge thank you to FiveThirtyEight for putting together this dataset; it is much appreciated! 😃

We have a lot of ground to cover in this post, so let’s keep this introduction brief. In the next brief section, we’ll talk a little bit about how we’ll be theming the data visualizations with custom colors, and then we’ll move into the bulk of the post covering all sorts of data visualizations. Let’s get going! 🦇

In addition to being a machine learning engineer and data science enthusiast, I also dabble a little on the side with graphic design. In the graphic design community, we really hone in a specific color hue with a computer value called a hex value. Yes, this is just like the hexidecimal system you may already familiar with; however, you may not have known how valuable these hex values are for graphic designers! Graphic designers seek to get a full color palette of hex values, which are essentially groupings of colors that look aesthetically pleasing together. In fact, there are many websites that are purely dedicated to sharing these curated color palettes. I used this site in particular to get my Halloween hex color palette. (And yes, it just now dawned on me the irony of calling it a “Halloween hex color.” 😂) Here is a little image I created that shows the colors that I will be using in my data visualizations along with their correlated hex values.

Graphic created by the author

In addition to adjusting the colors like this, we can also use Seaborn to set the theme of each data visualization to feature a darker background. By default, all our forthcoming visualizations will inherit this design scheme. The way we enable all of this in code is with the following snippet:

Code written by author and generated as visualization with Carbon

Now that we have our Halloween theming in place, let’s lay a groundwork of what you can expect for the remainder the post. The remainder of the post is broken into four primary sections, each representing a respective grouping of visualization types:

  • Categorical visualizations
  • Distribution visualizations
  • Relational visualizations
  • Clustering visualizations

Now, to be perfectly clear, this is NOT an all inclusive list of every single data visualization grouping. I just wanted to give you a sampling of some of the most common visualizations as this post is already going to be long enough as is!

Within each of these groupings, I will cover some specific data visualizations, and with each of these visualizations, I will do the following:

  • We will examine what the visualization is and why it’s valuable at a high level.
  • We will then move into hypothesizing a real life example in which we might use that specific visualization with the Halloween candy dataset.
  • Next, we shall create the visualization in code, using Python and libraries like Pandas, Matplotlib, and Seaborn.
  • Returning to our hypothesized scenario, we will conduct a post-analysis on the data using what the visualization can now help us to intuit better.

Simple enough, right? Let’s jump into our first grouping of data visualizations: the categorical visualizations!

Okay, if I’m being completely honest… when I selected this dataset, I transparently did not pay that close of attention to the nature of the data, particularly given that there are a LOT of features with binary values. No matter! This can give us some nice practice into displaying these categorical features across the visualizations below.

Count Plots

Just as the name sounds, a count plot is a visualization that counts the number of a specific value. Seaborn’s countplot functionality is relatively simple, so I thought I’d shake things up a bit by showing you how to display multiple count plots in a singular, holistic, visualizations.

Real Life Example: We have a whole bunch of these binary features, nine to be precise. These binary features are represented by a 0 for “no” and 1 for “yes”. Wouldn’t it be interesting to see how all these binary features compare in a single, easy-to-read visualization? We’re going to have to do a tiny bit of data preprocessing first, but then we’ll analyze our findings as part of a single visual with these 9 count plots.

Here is the code that enables creating this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: Well, if I’m being completely honest, this visualization isn’t super informative. 😂 I suppose it’s interesting that fruity candy vs. chocolate candy seems to have some sort of correlation, which would make sense in the fact that I can’t think of any candy that is necessarily fruity and chocolatey. It’s also interesting to note that pluribus is the only category that more candies have than not. I had to go remind myself what pluribus means in this context. For context, pluribus means if a candy is essentially multiples in a box or one holistic thing. That said, a pluribus-positive candy would be like M&Ms whereas a pluribus-negative candy would be like a Snickers candy bar. So even though we didn’t get a whole lot of value out of this particular visualization, I hope you can find the value in visualizing multiple count plots like this!

Pie Charts

Of all the visualizations we’ll be covering in this post, this is probably the one you are most familiar with. A pie chart in this scenario shows us how much of each binary feature accounts for the full feature. From a visual perspective, it tells us if one value is more dominant over the other. This is also the one visualization that doesn’t actually have a seaborn counterpart, so in order to produce these pie charts, we’re going to have to use barebones matplotlib.

Real Life Example: In the example with the count plot, we took a look at the binary features to understand how they were distributed from a pure count perspective. We’re going to do the same again here, except this time we’ll be visualizing these features each with their own respective pie chart.

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: As with the count plot analysis above, these visualizations honestly don’t provide us a whole lot of value. It is interesting how we can intuit something like the bar feature to have more or less a 1:3 ratio, but otherwise, there’s transparently not a whole lot of value we can get out of these specific visualizations. Still, I hope you can see the value in how we might use pie charts and count plots to our advantage in other contexts!

While most of this dataset consists of binary features, there are a handful of features with continuous values. Naturally, it is super common for data scientists to want to understand the distribution of these sorts of features, so this section will cover a few types of distribution visualizations.

Box Plots / Violin Plots

A box plot is a visualization used to understand at a high level the primary quantiles of a feature. More specifically, it will highlight the median, interquartile range (IQR), and the outlying minimum and maximum values. The IQR is represented by the center box whereas the “whiskers” on either side of the box represent the outlying minimum and maximum values. A violin plot is similiar to the box plot, except it seeks to provide more of a visualization on the precise distribution of the data. Depending on what you precisely want to know about the data, you may choose one over the other. Or you could do what we’re about to do and show both side by side!

Real Life Example: We have three different features with percentage-based data: sugar percentile, price percentile, and the percentile of how often each candy won when matched up against another candy. The question we’re essentially looking to answer for each of these is, is there a wide distribution in the data, or did they all fare about the same for reach respective feature? Let’s find out!

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: It’s kind of interesting how the green color of the violin plot makes each of those visualizations look like leaves from a tree, but otherwise, the violin plots don’t seem to offer a whole lot of value as they are more difficult to interpret. The box plots, on the other hand, are much easier to interpret. In the case of the sugar and price percentiles, it is not surprising to see that the whiskers span the full range while the IQR lies more in the middle. The win percentile interestingly has whiskers that are tighter in scope with a tighter IQR. In my opinion, the box plot for the win percentile is the most informative of the bunch, as it essentially tells us that most candies lay in the middle in terms of favorability. Recall that this feature represents the percentage of wins it received when directly matched against another candy, and the IQR here is basically telling us that it was more common than not for a candy to be evenly matched against another candy. Of course, the whiskers represent that there were indeed outliers, but at a quick glance, this box plot tells us a very quick story about the data we wouldn’t quickly glean by just analyzing the numbers directly.

Histograms / KDE Plots

I noted above that the violin plot didn’t offer us a whole lot of value as it was difficult to interpret, and the box plot stays at a pretty high level itself, too. If we wanted to understand the distribution of data at a more granular level? This is where a histogram / KDE plot can be more helpful. A histogram visualizes the frequency of data at certain bins, sort of similar to a count plot. On top of the histogram, we can overlay a kernel density estimate (KDE), which plots a curve over the histogram to represent the “continuous probability” of each respective bin. (In case you’re curious, the violin plot itself is indeed a KDE curve essentially turned on its side and doubled across an axis of symmetry.)

Real Life Example: In the box plot / violin plot example, we wanted to see at a high level how the percentile-based features fared in terms of breadth of distribution. What if we wanted to get a more fine-grained look into how evenly the data is distributed? That is what we will assess with histogram and overlaying KDE curve!

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: Now this is actually pretty interesting! These visualizations tell a dramatically different story compared to the box and violin plots we previously analyzed. With the sugar and price percentile features in particular, we noted above that the IQR seems to indicate that the data is centrally clustered. These visualizations indicate this is not at all the case, and with the price percentile feature in particular, we actually see that there are more candies at either end of the spectrum. This is why it’s important to perform multiple visualizations onto the same data as you can get different, important bits of information from each visualization type!

When starting off work in a new realm of data, one of the first things a data scientist looks for is correlations between various data features. In this section, we’re going to cover a number of these correlation visualizations and how we can build an intuition for how these sorts of visualizations better tell a story about the data!

Regression / Scatter Plots

A regression plot is a kind of visualization that generally compares two numerical features between an x- and y-axis with each data point represented by a dot. We can also enhance a regression plot with an additional hue feature that visually dictates how this third feature affects the two initial features we are comparing. In our example below, we are specifically going to use Seaborn’s lmplot functionality, which essentially combines a standard regression plot with that ability to add that hue for a deeper analysis on the data.

Real Life Example: Our sugarpercent feature tells us the percentile of sugar a particular candy has compared to the other candies, while the pricepercent feature tells us the percentile of price that a candy costs compared to the other candies. We also have a third feature, chocolate, tells us whether or not the candy is chocolate. An lmplot can help us answer the question, Do chocolate candies have a stronger and/or higher correlation of sugar-to-price than their non-chocalate bretherin?

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: This visualization offers us a lot of information in one little plot! For both chocolate and non-chocolate candies, there is a very slight positive correlation in sugar percentile to price percentile, but there is so little data in this dataset that I would be very leery to make a broad-sweeping generalization like that. On the flip side, it appears to be pretty clear that chocolate candy has a higher sugar percentage AND higher price. This isn’t super surprising since in my anecdotal experience, chocalate candy does seem to be higher in price than non-chocolate candy. There is more information we could glean from this plot, but I hope the primary takeaway is that you can see how much information you can squeeze out of a singular plot!

Joint Plots

Up to this point, we have covered histograms, KDE curves, and scatter plots. You might be wondering, what if there is a way we could combine these things in a single visualization? We’re going to do just that with Seaborn’s joint plot visualization. A joint plot seeks to analyze the relationship between two features in a way that goes above and beyond something more simple. You can actually do quite a few things with Seaborn’s jointplot, so in the visualization below, I’m going to create a special type of visualization that almost looks like a topographical elevation map using the KDE curve.

Real Life Example: In the previous example, we examined the relationship of sugar-to-price for chocolate and non-chocolate candies. Now, I want to explore the price-to-win ratio for chocolate and non-chocolate candies. We could produce the same visualization as we did with the lmplot above, but this time, I want to show you the value of using a KDE-based jointplot. Let’s see the results!

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: The interesting thing about this kind of visualization is it helps to really hone in on where the data seems to be clustering the most often. For example, we can see that for chocolate candy, the greatest correlation is where the candy is at ~65th price percentile and ~70th win percentile. Additionally, we can see the pure KDE curves for each feature represented on the outer side of each respective access. We can see, for example, that non-chocolate candies tend to fall on the lower end of the price scale whereas the choolate candies tend to be a little higher in price. As I noted in a previous analysis, this is not all too surprising as my anecdotal experience has been that those big bags of chocolate candy in the Target aisle are indeed more expensive!

In this next section, we’ll use some unsupervised clustering to cluster the data into groups that the algorithm feels fit appropriately together. Next, we’ll create some visualizations to help us intuit why the clustering algorithm clustered the data as it did. A bit of a *disclaimer* as we proceed into this next section: I’m not sure that this is exactly the best dataset to try to be doing this on. Yes, we’re going to see some results, but the data is too small with features too specific that I’m leery to say these results are reliable. That said, the focus on this post is how to create these visualizations for use in your own context, so I’m not terribly concerned about the reliability of these results. If you would like to see a full use case where unsupervised clustering is legitimately used, I would advise you to check out my Starbucks customer clustering Medium post.

Line Charts

When determining how many clusters we should try organizing the data to, it is common to represent the sum of squared errors (SSE) of the clustering algorithm on a chart and looking for an elbow in the visualization as the ideal number of clusters we should maintain. We will demonstrate below how to both calculate the SSE scores and visualize them as a line chart.

Real Life Example: The example in our case is pretty simple: we want to cluster our dataset to perform some deeper analyses onto those clusters to see if we can get any insight that way. We’re not sure how many clusters we should be aiming for, so this visualization will allow us to use the elbow method of determining how many clusters we should target. Let’s test out 1 to 14 clusters to see how our results fare.

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: As to be expected, a single cluster always produces the highest SSE score followed by a steep drop off after few clusters. After 5–6 clusters, you can see that the SSE more or less has diminishing returns, so you can now probably see why we call this the elbow method. The ideal number of clusters is found at the “elbow” of this visualization, and we can argue whether that would be 2 or 3 clusters.

Scatter Plots

While we have already sort of touched on scatter plots during the analysis of the lmplot, I thought it’d be worth revisiting again in this new context. Instead of using the lmplot, we’re going to use Seaborn’s more vanilla scatterplot to create these visualizations. We will also be adding each individual candy’s label directly on top of the chart with these final data visuals.

Real Life Example: When we determined how many clusters we should create via the elbow method, we noted that we could either create 2 or 3 clusters for our dataset. There’s not really a right or wrong answer, but I am personally interested in answering the following question: How is the K-Means algorithm differentiating between data when clustering between either 2 or 3 clusters in terms of the win-to-price ratio?

Here is the code to enable this visualization:

Code written by author and generated as visualization with Carbon

And here is what the visualization itself looks like:

Data visualization created by the author with Seaborn / Matplotlib

Post-Analysis: I personally find these visuals to be the most interesting ones we’ve produced yet. By adding the candy labels directly on top of the visualization, we can more precisely analyze why the clusters were derived as such. Looking at the first visualization with only two clusters, it seems that you can draw a pretty clear line between the clusters: the first cluster contain non-chocolate candies that fared worse than the other cluster in terms of price and win percentages. The other cluster seems to consist of more “premium” chocolate candies. Things start to get even more interesting when we look at the three-cluster visualization. The original “non-chocolate” cluster remains largely intact, but the chocolate cluster gets broken up further into two groups. Interestingly, the points in the green cluster do all seem to have a similarity to them. If it is challenging to read this visual, some of the candies in the green cluster include Twix, Kit Kat, Nestle Crunch, and Snickers Crisper. In my experience, all of those candies have a bit of a crispy, crunchy nature about them. Very interesting indeed!

That brings us to the conclusion of this post! I hope you found all these visualizations to be both fun and interesting. Again, this post doesn’t represent every single kind of visualization we can create, but I hope it provides you with a good sampling of ideas that you can produce in your own context. Thank you for reading this post, and hope you all have a great, safe Halloween! 🎃

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@technoblender.com. The content will be deleted within 24 hours.
Ai Newsartificial intelligencecandyDatadatasetDavidHalloweenHundleylatest newsOctVisualizations
Comments (0)
Add Comment