Techno Blender
Digitally Yours.

Simple Explanations of Basic Statistics Concepts (Part 2) | by Chi Nguyen | Mar, 2023

0 42


Simple explanations of different statistics concepts

Photo by Icons8 Team on Unsplash

In Part 1: Simple Explanations of Basic Statistics Concepts, I explained the fundamental ideas behind some statistical concepts, including different definitions relating to population and sample, sampling methods, and confidence interval. Today, I will provide you with additional explanations of several frequently encountered statistical topics. Hopefully, this is a simple tutorial even for those who are not familiar with statistics.

Now, let’s dive in!

What is variability & Why does it matter?

When talking about variability, you are talking about how spread out the data is. The median and mean are not designed for this since they simply reveal the range in which the majority of your data values fall.

Do you still remember the example of my family’s lemon farm from Part 1 ^^? In this section, I’ll bring my lemon farm up once again. Each year, my family has to harvest lemons twice in January and September. The distribution of the lemons’ weights for each harvest season is shown below. At the first glance, we can see the average weights of lemons for both harvest seasons are about the same. However, it seems that the weight distribution of January lemons is more spread out than that of lemons picked in September. In other words, though the lemons harvested in both months have similar weights, those harvested in January have more variability. So having the same central tendency doesn’t indicate similar degrees of variability or vice versa.

Fig 1 — Pic by Author

Apparently, knowing the variability is as essential as it helps my family to evaluate the quality of lemons between two harvesting seasons and alter the cultivation methods in January to yield more comparable products.

Overall, low variability is preferable as it gives more accurate population information prediction using the sample data.

So, how do we describe the variability (or difference) in statistics? Let’s look at 4 indicators: range, standard deviation, variance, and interquartile range.

Range

This is the simplest measure of variability and is computed as the difference between the smallest and largest item.

For example, in the January harvest, the heaviest lemon weighs 13g, and the smallest one weighs only 2g. This means the weight varies from 2g to 13g, and the weight range of those lemons is 11g.

Despite its simplicity, the range is rarely used as the single measure for variability. The reason is range cannot take all data points into consideration. Look at figure 2 below, you can see that both ranges in 2 cases are 11g. However, the distributions of weights between those 2 are completely different. That’s why knowing only the range doesn’t tell you anything about how the data disperse.

Fig 2: Pic by Author

For further insights, variance and standard deviation are what you might need.

Variance vs. Standard Deviation

Both indicators describe how values are distributed.

When comparing the dispersion of 2 different data sets with roughly the same average, the standard deviation is helpful since it tells us how far apart on average each data point is from the mean. The data set with a smaller standard deviation is closely concentrated around the mean.

However, there are some caveats when using standard deviation.

First, the standard deviation needs to be evaluated with reference to the mean. For instance, when comparing weights between pigs, 500g is not a big difference. However, the story is not the same with lemons. While the average weight of lemons is only 10g, measuring 500g more in weight would make a huge difference.

Fig 3: Pic by Author

Second, extreme values can affect the interpretation of standard deviation. Some outliers might increase the standard deviation and make the dispersion appear larger than normal. This leads to my third point, which is standard deviation is preferred when the data is a normal distribution.

Variance is computed by squaring the standard deviation. Higher variance means the data is more dispersed. Variance might be more challenging to comprehend intuitively, but without a doubt, it is an essential metric used in statistical tests, such as ANOVA, to test the difference between data sets.

Interquartile Range

In the situation when the data has an asymmetric distribution, contains extreme values, or is measured at the ordinal level, it would be preferable to use the interquartile range.

When the data values are sorted in ascending order, the Q1 (a.k.a the first quartile) is the value at which 25% of the data values are smaller or below. Similarly, the value that 75% of data fall below is known as the third quartile (Q3). IQR is then the difference between Q3 and Q1.

Fig 4: IQR — By Author

Suppose the September lemons and January lemons my farm produced have the same weight median. However, as you can see in Fig 5 below, the weight IQR of the January harvest is greater than September’s. Therefore, this suggests that the lemons picked in January vary significantly while the lemons harvested in September are more similar in weight.

Fig 5 — By Author

In general,…

  • If your data is ordinal, use range and interquartile range to assess the variability of data
  • If your data has a normal distribution, standard deviation, and variance should be considered, but be careful of the outliers
  • If your data is asymmetric or with outliers, the interquartile range is an appropriate method

The standard error is defined as the standard deviation of the means. What does that mean?

Assume that my family wants to check the weights of lemons in September. However, we cannot weigh all of them because it wastes too much time. We decided to randomly select 4 samples to evaluate the weight. For each batch, we carefully weigh each of the lemons and come up with the mean lemon weight for each batch. As a result, we have a total of 4 means, which correspond to 4 batches. Then, we calculate the average of these 4 means. The standard deviation of the mean weight values is defined as the standard error. To put it in another way, standard error hints at how the means are varied if we selected 4 random samples of lemons to evaluate.

As you may know from my previous post, sampling error always exists. Therefore, by knowing the standard error, you can assess how much your sample represents the population and then draw reliable insights. A low standard error indicates that sample means are close to the population mean, suggesting well-represented samples, and vice versa.

Fig 6: Standard error example — By Author

The standard error is normally smaller than the standard deviation since the means are not significantly different from each other.

Coefficient of Variation (CV) can be understood simply as a measure of relative variability. It is determined by putting the standard deviation into consideration with the scale of the data. Why do I say this? Let’s take a very simple example here.

Suppose I have 2 data sets A and B as below. For each dataset, I can easily compute the mean and standard deviation. It appears that these 2 data have the same standard deviation. However, does it mean that these 2 data sets should have a similar spread of data? The answer is no.

Look at list A, you can see that the maximum value is 10 times greater than the smallest one. Meanwhile, in list B, the maximum value is only 1.009 times bigger than the minimum value. So, it’s obvious that data points in list B will be closer to each other than data points in group A. In this case, the standard deviation is no longer helpful since it is computed in absolute terms. We need another measure that will take the scale of the dataset into account, and that is CV.

CV is calculated by getting the standard deviation divided by the mean. The CV of A is 0.94, while the CV of B is only 0.004. The CV of B is much smaller than that of A. Clearly, this result shows us a completely different picture than the standard deviation.

Fig 6: CV — By Author

Generally, CV is more helpful than Std when we want to compare two different data sets of different units like money in Zimbabwe Dollar vs. USD.

There are so many more to cover in statistics. However, I will come back with the next posts soon. I hope I have made these concepts a little bit clearer. Thanks for bearing with me until the end.

Cheers.

Reference

https://www.scribbr.com/statistics/variability/#:~:text=questions%20about%20variability-,Why%20does%20variability%20matter%3F,the%20sample%20to%20your%20population


Simple explanations of different statistics concepts

Photo by Icons8 Team on Unsplash

In Part 1: Simple Explanations of Basic Statistics Concepts, I explained the fundamental ideas behind some statistical concepts, including different definitions relating to population and sample, sampling methods, and confidence interval. Today, I will provide you with additional explanations of several frequently encountered statistical topics. Hopefully, this is a simple tutorial even for those who are not familiar with statistics.

Now, let’s dive in!

What is variability & Why does it matter?

When talking about variability, you are talking about how spread out the data is. The median and mean are not designed for this since they simply reveal the range in which the majority of your data values fall.

Do you still remember the example of my family’s lemon farm from Part 1 ^^? In this section, I’ll bring my lemon farm up once again. Each year, my family has to harvest lemons twice in January and September. The distribution of the lemons’ weights for each harvest season is shown below. At the first glance, we can see the average weights of lemons for both harvest seasons are about the same. However, it seems that the weight distribution of January lemons is more spread out than that of lemons picked in September. In other words, though the lemons harvested in both months have similar weights, those harvested in January have more variability. So having the same central tendency doesn’t indicate similar degrees of variability or vice versa.

Fig 1 — Pic by Author

Apparently, knowing the variability is as essential as it helps my family to evaluate the quality of lemons between two harvesting seasons and alter the cultivation methods in January to yield more comparable products.

Overall, low variability is preferable as it gives more accurate population information prediction using the sample data.

So, how do we describe the variability (or difference) in statistics? Let’s look at 4 indicators: range, standard deviation, variance, and interquartile range.

Range

This is the simplest measure of variability and is computed as the difference between the smallest and largest item.

For example, in the January harvest, the heaviest lemon weighs 13g, and the smallest one weighs only 2g. This means the weight varies from 2g to 13g, and the weight range of those lemons is 11g.

Despite its simplicity, the range is rarely used as the single measure for variability. The reason is range cannot take all data points into consideration. Look at figure 2 below, you can see that both ranges in 2 cases are 11g. However, the distributions of weights between those 2 are completely different. That’s why knowing only the range doesn’t tell you anything about how the data disperse.

Fig 2: Pic by Author

For further insights, variance and standard deviation are what you might need.

Variance vs. Standard Deviation

Both indicators describe how values are distributed.

When comparing the dispersion of 2 different data sets with roughly the same average, the standard deviation is helpful since it tells us how far apart on average each data point is from the mean. The data set with a smaller standard deviation is closely concentrated around the mean.

However, there are some caveats when using standard deviation.

First, the standard deviation needs to be evaluated with reference to the mean. For instance, when comparing weights between pigs, 500g is not a big difference. However, the story is not the same with lemons. While the average weight of lemons is only 10g, measuring 500g more in weight would make a huge difference.

Fig 3: Pic by Author

Second, extreme values can affect the interpretation of standard deviation. Some outliers might increase the standard deviation and make the dispersion appear larger than normal. This leads to my third point, which is standard deviation is preferred when the data is a normal distribution.

Variance is computed by squaring the standard deviation. Higher variance means the data is more dispersed. Variance might be more challenging to comprehend intuitively, but without a doubt, it is an essential metric used in statistical tests, such as ANOVA, to test the difference between data sets.

Interquartile Range

In the situation when the data has an asymmetric distribution, contains extreme values, or is measured at the ordinal level, it would be preferable to use the interquartile range.

When the data values are sorted in ascending order, the Q1 (a.k.a the first quartile) is the value at which 25% of the data values are smaller or below. Similarly, the value that 75% of data fall below is known as the third quartile (Q3). IQR is then the difference between Q3 and Q1.

Fig 4: IQR — By Author

Suppose the September lemons and January lemons my farm produced have the same weight median. However, as you can see in Fig 5 below, the weight IQR of the January harvest is greater than September’s. Therefore, this suggests that the lemons picked in January vary significantly while the lemons harvested in September are more similar in weight.

Fig 5 — By Author

In general,…

  • If your data is ordinal, use range and interquartile range to assess the variability of data
  • If your data has a normal distribution, standard deviation, and variance should be considered, but be careful of the outliers
  • If your data is asymmetric or with outliers, the interquartile range is an appropriate method

The standard error is defined as the standard deviation of the means. What does that mean?

Assume that my family wants to check the weights of lemons in September. However, we cannot weigh all of them because it wastes too much time. We decided to randomly select 4 samples to evaluate the weight. For each batch, we carefully weigh each of the lemons and come up with the mean lemon weight for each batch. As a result, we have a total of 4 means, which correspond to 4 batches. Then, we calculate the average of these 4 means. The standard deviation of the mean weight values is defined as the standard error. To put it in another way, standard error hints at how the means are varied if we selected 4 random samples of lemons to evaluate.

As you may know from my previous post, sampling error always exists. Therefore, by knowing the standard error, you can assess how much your sample represents the population and then draw reliable insights. A low standard error indicates that sample means are close to the population mean, suggesting well-represented samples, and vice versa.

Fig 6: Standard error example — By Author

The standard error is normally smaller than the standard deviation since the means are not significantly different from each other.

Coefficient of Variation (CV) can be understood simply as a measure of relative variability. It is determined by putting the standard deviation into consideration with the scale of the data. Why do I say this? Let’s take a very simple example here.

Suppose I have 2 data sets A and B as below. For each dataset, I can easily compute the mean and standard deviation. It appears that these 2 data have the same standard deviation. However, does it mean that these 2 data sets should have a similar spread of data? The answer is no.

Look at list A, you can see that the maximum value is 10 times greater than the smallest one. Meanwhile, in list B, the maximum value is only 1.009 times bigger than the minimum value. So, it’s obvious that data points in list B will be closer to each other than data points in group A. In this case, the standard deviation is no longer helpful since it is computed in absolute terms. We need another measure that will take the scale of the dataset into account, and that is CV.

CV is calculated by getting the standard deviation divided by the mean. The CV of A is 0.94, while the CV of B is only 0.004. The CV of B is much smaller than that of A. Clearly, this result shows us a completely different picture than the standard deviation.

Fig 6: CV — By Author

Generally, CV is more helpful than Std when we want to compare two different data sets of different units like money in Zimbabwe Dollar vs. USD.

There are so many more to cover in statistics. However, I will come back with the next posts soon. I hope I have made these concepts a little bit clearer. Thanks for bearing with me until the end.

Cheers.

Reference

https://www.scribbr.com/statistics/variability/#:~:text=questions%20about%20variability-,Why%20does%20variability%20matter%3F,the%20sample%20to%20your%20population

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment