Techno Blender
Digitally Yours.

Correlation vs covariance: it’s much simpler than it seems | by Giuseppe Mastrandrea | Jun, 2022

0 55


What is correlation? How can we compute correlation between to continuous variables? And what are the differences with covariance?

Clearly, two persons who understood what is correlation. Photo by Jill Wellington: https://www.pexels.com/it-it/foto/due-persone-in-piedi-nella-fotografia-di-sagoma-40815/

Machine learning is a wonderful field of study. Studying Machine Learning means taking the most interesting concepts coming from the most disparate fields (math, finance, biology, computer science, etc) with the aim of producing accurate and reliable predictive models. During my experience as a Machine Learning and Data Science teacher at Datamasters more than once students got confused about basic concepts and indexes related to the world of data science. In my first article (which is available in Italian here) I wrote about some of these indexes: variance, standard deviation and covariance.

In this article we’re going to study another index that can sound confusing, but let me tell you this: it’s definitely not rocket science. We’re talking about the correlation coefficient. Correlation looks a lot like covariance, and its use is very precise: provide us information about the presence (and if yes, what kind of presence) of a relation between two random variables. The unusual thing is that under the term “correlation” many formulas and coefficients can be found, very different from each other. The use of one coefficient over another one is based on the type of variables for which we want to calculate correlation.

Seems like a big deal, uh? Well, maybe. The reality, under certain circumstances which turn out to be not restrictive at all, is much simpler than you could actually expect. Let’s start from two random variables, the old good “weight” and “height” of 6 persons:

Let’s visualize these points:

Visualization of our dataset made with pyplot
Visualization of our dataset made with pyplot. Image by the author.

Before we start, let’s make a statement. These variables are numerical, i.e. variables that can assume any value in a numeric set or in an interval of that set. They are not categorical variables (variables whose possible values are in a predefined set, e.g. “hair color” which could have only values in the set [“brown”, “blonde”, “black”, etc]). For numerical variables like the ones we introduced earlier the most common way to calculate a correlation coefficient is using Pearson coefficient. The formula is:

Pearson correlation coefficient
Pearson correlation coefficient. Image by the author.

As we can see, Pearson correlation is nothing more than a fraction with covariance at the numerator and the product between the variables standard deviations at the denominator.

Pearson coefficient is used to detect a linear relation between two random continuous variables. If you want to measure a non-linear relation between two random variables you just have to use other coefficients (e.g. Spearman coefficient). The formula to compute the correlation between two random variables samples is slightly more complex, but basically we always use a covariance measurement “normalized” with respect to the standard deviation of the variables. After all is said and done, to compute a correlation between two random variables (X, Y) we have to:

  • Compute the variables mean
  • Compute the variables std. deviation
    – Compute the square of the difference between every sample and the mean of the variable
    – Sum all these squares
    – Divide by the number of samples
    – Do the square root of this fraction
  • Compute the covariance between X and Y
    – For each entry in our dataset compute the difference between the X-component and the mean of X and multiply it by the difference between the Y-component and the mean of Y
    – Sum these products
    – Divide by the number of samples

Easier done than said. Let’s move on:

Compute the means for “Weight” and “Height”:

μ_w = 76KG
μ_h = 180.33cm

Compute the std. deviation for the Weight:

We get this result:

σ_w = 13.5523

Same thing for height std. deviation:

σ_w = 10.8115

To compute the covariance we have to consider each point (i.e. the single rows/couples of the initial table: [100, 194], [80, 182], [75, 184], …), compute the difference between each component and its mean, then multiply them and sum all these products. Lastly, we divide by 6:

What we get is the covariance between Weight and Height:

cov(weight, height) = 163 kg-cm

Now we can compute the Pearson correlation between Weight and Height:

Pearson correlation between weight and height
Pearson correlation between weight and height

Scared by all these numbers?

Well, if you’re into Python you can use NumPy to get the same result with just this bunch of lines of code:

import numpy weight = [100, 80, 75, 56, 66, 79]
height = [194, 182, 184, 162, 171, 189]
pearson_corr = numpy.corrcoef(weight, height)[0, 1]
print(pearson_corr) # we'd get exactly 0.92932799

Now, take a break and notice two things. First of all, correlation is a unitless number. The kg-cm couple at the numerator is simplified with the standard deviations measurement units at the denominator. This feature alone makes correlation very interesting and flexible to use. But the real game-changer is that correlation has a well-defined interval: it is always a number between -1 and 1. Its meaning is similar to the covariance:

  • when correlation between X and Y is between -1 and 0, X and Y are inversely related: it means that when X increases, Y decreases
  • when correlation between X and Y is 0, X and Y have no linear relations
  • when correlation between X and Y is between 0 and 1, X and Y are directly related: when X increases, Y increases too.

Moreover, the closer correlation is to -1, the more “evident” the inverse correlation will be. Of course, the closer correlation is to 1, the more evident the direct correlation will be. In our case, 0.929 is very close to 1, which indicates a very high direct correlation between Weight and Height: we are basically saying that the taller a person is, the heavier he/she is. It makes sense, after all. We could have noticed the same relation between Weight and Height just with a glimpse of the chart:

Quite a positive correlation, uh?
Quite a positive correlation, uh? Image by the author.

Let’s draw the other 2 cases of correlation between variables. Here’s a correlation very close to -1:

Correlation close to -1
Correlation close to -1. Image by the author.

Here’s a correlation very close to 0:

Correlation close to 0
Correlation close to 0. Image by the author.

Before each chart we printed with Python the correlation matrix, which is similar to the covariance matrix. It is a square table that depicts the correlation between the variables. On the main diagonal we find the correlation between a variable and itself, and of course we have the maximum correlation value: 1. In the other cells, we have the correlation between the row value and the column value. But more on this topic in another article.

Before the end of the article, let’s make a quick recap on covariance and correlation analogies and differences.

Image by the author.

I hope this article was useful for you readers out there. You’re welcome!


What is correlation? How can we compute correlation between to continuous variables? And what are the differences with covariance?

Clearly, two persons who understood what is correlation.
Clearly, two persons who understood what is correlation. Photo by Jill Wellington: https://www.pexels.com/it-it/foto/due-persone-in-piedi-nella-fotografia-di-sagoma-40815/

Machine learning is a wonderful field of study. Studying Machine Learning means taking the most interesting concepts coming from the most disparate fields (math, finance, biology, computer science, etc) with the aim of producing accurate and reliable predictive models. During my experience as a Machine Learning and Data Science teacher at Datamasters more than once students got confused about basic concepts and indexes related to the world of data science. In my first article (which is available in Italian here) I wrote about some of these indexes: variance, standard deviation and covariance.

In this article we’re going to study another index that can sound confusing, but let me tell you this: it’s definitely not rocket science. We’re talking about the correlation coefficient. Correlation looks a lot like covariance, and its use is very precise: provide us information about the presence (and if yes, what kind of presence) of a relation between two random variables. The unusual thing is that under the term “correlation” many formulas and coefficients can be found, very different from each other. The use of one coefficient over another one is based on the type of variables for which we want to calculate correlation.

Seems like a big deal, uh? Well, maybe. The reality, under certain circumstances which turn out to be not restrictive at all, is much simpler than you could actually expect. Let’s start from two random variables, the old good “weight” and “height” of 6 persons:

Let’s visualize these points:

Visualization of our dataset made with pyplot
Visualization of our dataset made with pyplot. Image by the author.

Before we start, let’s make a statement. These variables are numerical, i.e. variables that can assume any value in a numeric set or in an interval of that set. They are not categorical variables (variables whose possible values are in a predefined set, e.g. “hair color” which could have only values in the set [“brown”, “blonde”, “black”, etc]). For numerical variables like the ones we introduced earlier the most common way to calculate a correlation coefficient is using Pearson coefficient. The formula is:

Pearson correlation coefficient
Pearson correlation coefficient. Image by the author.

As we can see, Pearson correlation is nothing more than a fraction with covariance at the numerator and the product between the variables standard deviations at the denominator.

Pearson coefficient is used to detect a linear relation between two random continuous variables. If you want to measure a non-linear relation between two random variables you just have to use other coefficients (e.g. Spearman coefficient). The formula to compute the correlation between two random variables samples is slightly more complex, but basically we always use a covariance measurement “normalized” with respect to the standard deviation of the variables. After all is said and done, to compute a correlation between two random variables (X, Y) we have to:

  • Compute the variables mean
  • Compute the variables std. deviation
    – Compute the square of the difference between every sample and the mean of the variable
    – Sum all these squares
    – Divide by the number of samples
    – Do the square root of this fraction
  • Compute the covariance between X and Y
    – For each entry in our dataset compute the difference between the X-component and the mean of X and multiply it by the difference between the Y-component and the mean of Y
    – Sum these products
    – Divide by the number of samples

Easier done than said. Let’s move on:

Compute the means for “Weight” and “Height”:

μ_w = 76KG
μ_h = 180.33cm

Compute the std. deviation for the Weight:

We get this result:

σ_w = 13.5523

Same thing for height std. deviation:

σ_w = 10.8115

To compute the covariance we have to consider each point (i.e. the single rows/couples of the initial table: [100, 194], [80, 182], [75, 184], …), compute the difference between each component and its mean, then multiply them and sum all these products. Lastly, we divide by 6:

What we get is the covariance between Weight and Height:

cov(weight, height) = 163 kg-cm

Now we can compute the Pearson correlation between Weight and Height:

Pearson correlation between weight and height
Pearson correlation between weight and height

Scared by all these numbers?

Well, if you’re into Python you can use NumPy to get the same result with just this bunch of lines of code:

import numpy weight = [100, 80, 75, 56, 66, 79]
height = [194, 182, 184, 162, 171, 189]
pearson_corr = numpy.corrcoef(weight, height)[0, 1]
print(pearson_corr) # we'd get exactly 0.92932799

Now, take a break and notice two things. First of all, correlation is a unitless number. The kg-cm couple at the numerator is simplified with the standard deviations measurement units at the denominator. This feature alone makes correlation very interesting and flexible to use. But the real game-changer is that correlation has a well-defined interval: it is always a number between -1 and 1. Its meaning is similar to the covariance:

  • when correlation between X and Y is between -1 and 0, X and Y are inversely related: it means that when X increases, Y decreases
  • when correlation between X and Y is 0, X and Y have no linear relations
  • when correlation between X and Y is between 0 and 1, X and Y are directly related: when X increases, Y increases too.

Moreover, the closer correlation is to -1, the more “evident” the inverse correlation will be. Of course, the closer correlation is to 1, the more evident the direct correlation will be. In our case, 0.929 is very close to 1, which indicates a very high direct correlation between Weight and Height: we are basically saying that the taller a person is, the heavier he/she is. It makes sense, after all. We could have noticed the same relation between Weight and Height just with a glimpse of the chart:

Quite a positive correlation, uh?
Quite a positive correlation, uh? Image by the author.

Let’s draw the other 2 cases of correlation between variables. Here’s a correlation very close to -1:

Correlation close to -1
Correlation close to -1. Image by the author.

Here’s a correlation very close to 0:

Correlation close to 0
Correlation close to 0. Image by the author.

Before each chart we printed with Python the correlation matrix, which is similar to the covariance matrix. It is a square table that depicts the correlation between the variables. On the main diagonal we find the correlation between a variable and itself, and of course we have the maximum correlation value: 1. In the other cells, we have the correlation between the row value and the column value. But more on this topic in another article.

Before the end of the article, let’s make a quick recap on covariance and correlation analogies and differences.

Image by the author.

I hope this article was useful for you readers out there. You’re welcome!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment