Techno Blender
Digitally Yours.

Correlation Coefficient, and How to Misunderstand a Relationship | by Marcin Kozak | Jan, 2023

0 50


STATISTICS

Interpreting correlation is far more difficult than most people think. Misinterpreting it, on the other hand, is not.

Perfect correlation! Photo by Matt Seymour on Unsplash

Correlation… Who has not heard about correlation? We use this term so often and yet we don’t know too much about it. When you say, “These two things are correlated,” what do you actually mean? Or what do you mean when you say that the correlation between two things is strong?

For data scientists and statisticians, correlation is even more important than it is for others. This is because we are those who work with data, so we are responsible for understanding any phenomena that happen in them — and explaining them to business colleagues or fellow researchers or whoever we’re working with.

I’ve happened to find myself in such a situations a hundred times, if not more. From my experience it follows that despite correlation being a simple concept, explaining it to others is disproportionately difficult. Why? Because most educated people have some sort of understanding of this concept — and more often than not, they are surprisingly attached to this understanding. Often, they are surprisingly attached to their understanding of correlations. Therefore, we often have to break through this wall first, and only then will we be able to explain what a given relationship means.

Today, we will discuss what the strength of correlation means. You will learn correlation is not as simple a topic as it may occur at first glance.

We will talk about Pearson’s correlation coefficient. The conclusions we will draw for it, however, will equally refer to Spearman’s correlation coefficient, which represents rank correlation — but also to any coefficient of correlation, whether parametric or not, and whether linear or not.

What does the correlation coefficient of 0.91 mean? And the correlation of -0.41? Of 0.27? Of 0.-65? Of 0.07?

Many people would be quick to interpret these coefficients as follows:

  • 0.91 → strong positive correlation
  • -0.41 → weak negative correlation
  • 0.27 → weak positive correlation
  • -0.65 → medium negative correlation
  • 0.07 → lack of correlation

Why is that? First of all, most statistics teachers suggest using such or similar terms and propose what I’ll call correlation guidelines. For instance, at this web page:

you will find the following guidelines:

Typical guidelines for the interpretation of the correlation coefficient. Image by author, based on this source

I’ve see many such guidelines. They may use a little different wording; here, we consider small/medium/large strength of association; other guidelines may use weak/medium/strong association/correlation; still others can use something similar. But basically, such guidelines offer some intervals for the correlation coefficient that can be used to decide how strong the corresponding association is.

How accurate such guidelines are? I will address this question from two perspectives:

  • of clusterization
  • of context

Clusterization, and the resulting approximation

By clusterization, I mean creating several clusters of values and using these clusters to interpret a correlation coefficient. Above, we had three such clusters: large, medium and small; or rather six:

  • negative large, medium and small correlation
  • positive large, medium and small correlation

So, a value of 0.66 means strong and positive correlation. A value of -.22 means weak and negative correlation.

This is quite an approximation. The correlation of 0.51 is considered of large strength, just like that of 0.99. But the correlation of 0.49 is, unlike that of 0.51, weak… I know, these are just numbers, so let me show some data. On the graph below, you can see three pairs of variables x and y, all the three pairs being normally distributed (to be precise, they follow a multivariate normal distribution). As you can see, their correlation is 0.48, 0.53 and 0.99.

Look at the graph and analyze it. What can you say about the three correlations?

Three pairs of values with different correlations. Image by author

A funny thing is that according to the above-mentioned guidelines, the first pair of variables is correlated differently than the second pair because

  • the strength of the first pair’s association is medium because correlation is 0.476
  • the strength of the second pair’s association is large because correlation is 0.528

But the second pair has a very similar strength of association as that of the third pair because

  • the strength of the second pair’s association is large because correlation is 0.528
  • the strength of the third pair’s association is also large because correlation is 0.986

Sorry, but that’s not something I can accept. It does make much sense. Any sense, actually.

This one thing I don’t like in such guidelines — but not the only one. However important it is, the next one is even more important.

Such correlation guidelines assume it’s enough for you to know a value of the correlation coefficient to know the strength of this association. It doesn’t even matter what you’re measuring. The population doesn’t matter, and the sample doesn’t matter; what matters is this single value.

This means we’re ignoring the context of the correlation. Whatever we’re measuring, it doesn’t matter. It can be anything. It can be everything.

Some correlation guidelines at least make an attempt to stress that it matters what you’re measuring. For instance, they can be accompanied by a sentence explaining that you should take the context of the relationship into account. For instance, under the above table, you can read the following sentence (source):

Remember that these values are guidelines and whether an association is strong or not will also depend on what you are measuring.

Okay, so on the one hand, you should consider the strength of association between 0.5 and 1.0 strong, but on the other hand, “you should remember that this also depends on what you are measuring.”

So… What? How? I mean… What? Does this mean I should change the guidelines depending on what I’m measuring? But how? Move the limits? Get rid of them? Any example?

Very helpful.

Indeed, correlation guidelines like these fully ignore the context of the phenomenon we study. Using them, you would use the same interpretation for, say,

  • correlation between the heights of identical twins, and
  • correlation between the heights of pairs of independent people.

Correlation guidelines like these fully ignore the context of the phenomenon we study.

I know that the second pair of correlation seem not to make much sense. I mean, no one would actually even think of estimating such correlation — but this is why it’s a perfect example of null correlation. If you conduct a methodologically correct study to measure this correlation, it would be null or close to null.

For the moment let’s assume we have estimated these two correlations in the population of people on the globe based on two big samples (two different correlations, two different samples). The correlation between identical twins’ height should be close to 1.0, and that of, say, 0.5 would be amazingly weak and should not happen. On the other hand, the correlation between pairs of independent people should be close to 0.0, and that of 0.5 would be amazingly strong and should not happen.

Did you notice what I just did? I used the very same value of the correlation coefficient between two variables representing a person’s height, 0.5, for two different situations — and it occurred that in one of them this value was amazingly weak while in the other one it was amazingly strong. The same value of 0.5 should not happen in these two situations. In one situation, it should not happen because it would mean a far-too-weak correlation. In the other it should not happen because it would mean a far-too-strong correlation.

What does this say about these guidelines?

Quite a lot. Such guidelines are worth next to nothing. They make people stop thinking. You should never use them unless you like the feeling of being misled, confused, falsified. Instead of using them — use your brain. Think.

Such guidelines are worth next to nothing. They make people stop thinking.

Instead of using them — use your brain. Think.

Don’t use such correlation guidelines. Don’t tell yourself, “I will just have a short look so that I know what wording I should use, but I won’t use these guidelines.” Stop cheating yourself.

Instead, think of the phenomenon you’re analyzing; think what sort of correlation you should expect; and think how far the estimate you got is from this expected correlation. If it’s much smaller, the association is likely weak. If it’s much bigger, it’s likely strong. When it’s similar, it’s likely expected.

This is the very word I miss in such guidelines — and in the interpretation of correlation coefficient in general. Expected. To be honest, for the moment I don’t know how to put this word into the knowledge we have of correlation. It does not fit, not yet.

But let’s take it step by step, without rush. Let’s think about this. I’ve been thinking about correlation for so many years that some more time will do no harm.

Did you noticed the word “likely” I used above? Like in this sentence: “If it’s much bigger, it’s likely strong.” I did this because when you obtain such unexpected results, first of all you should check how the sample was collected. Correlation coefficient is sensitive to sample size. Too small a sample can lead to inaccurate estimates. Thus, don’t get too attached to the correlation coefficient estimated from a small sample. Such an estimate may turn out to be very far from the real (population) one. I wouldn’t expect this to happen when you draw a big sample while in the case of a small sample, it’s not only probable — it’s likely. I feel like writing a longer article about the meaning of sample size in corrleation, so expect one in the future.

What I haven’t mentioned is that we often treat association between two variables as linear while it does not have to be such. Maybe you obtained an unexpectedly small value of the coefficient of linear correlation because the association was not linear? This is yet another thing to check, but it’s a different story…


STATISTICS

Interpreting correlation is far more difficult than most people think. Misinterpreting it, on the other hand, is not.

Perfect correlation! Photo by Matt Seymour on Unsplash

Correlation… Who has not heard about correlation? We use this term so often and yet we don’t know too much about it. When you say, “These two things are correlated,” what do you actually mean? Or what do you mean when you say that the correlation between two things is strong?

For data scientists and statisticians, correlation is even more important than it is for others. This is because we are those who work with data, so we are responsible for understanding any phenomena that happen in them — and explaining them to business colleagues or fellow researchers or whoever we’re working with.

I’ve happened to find myself in such a situations a hundred times, if not more. From my experience it follows that despite correlation being a simple concept, explaining it to others is disproportionately difficult. Why? Because most educated people have some sort of understanding of this concept — and more often than not, they are surprisingly attached to this understanding. Often, they are surprisingly attached to their understanding of correlations. Therefore, we often have to break through this wall first, and only then will we be able to explain what a given relationship means.

Today, we will discuss what the strength of correlation means. You will learn correlation is not as simple a topic as it may occur at first glance.

We will talk about Pearson’s correlation coefficient. The conclusions we will draw for it, however, will equally refer to Spearman’s correlation coefficient, which represents rank correlation — but also to any coefficient of correlation, whether parametric or not, and whether linear or not.

What does the correlation coefficient of 0.91 mean? And the correlation of -0.41? Of 0.27? Of 0.-65? Of 0.07?

Many people would be quick to interpret these coefficients as follows:

  • 0.91 → strong positive correlation
  • -0.41 → weak negative correlation
  • 0.27 → weak positive correlation
  • -0.65 → medium negative correlation
  • 0.07 → lack of correlation

Why is that? First of all, most statistics teachers suggest using such or similar terms and propose what I’ll call correlation guidelines. For instance, at this web page:

you will find the following guidelines:

Typical guidelines for the interpretation of the correlation coefficient. Image by author, based on this source

I’ve see many such guidelines. They may use a little different wording; here, we consider small/medium/large strength of association; other guidelines may use weak/medium/strong association/correlation; still others can use something similar. But basically, such guidelines offer some intervals for the correlation coefficient that can be used to decide how strong the corresponding association is.

How accurate such guidelines are? I will address this question from two perspectives:

  • of clusterization
  • of context

Clusterization, and the resulting approximation

By clusterization, I mean creating several clusters of values and using these clusters to interpret a correlation coefficient. Above, we had three such clusters: large, medium and small; or rather six:

  • negative large, medium and small correlation
  • positive large, medium and small correlation

So, a value of 0.66 means strong and positive correlation. A value of -.22 means weak and negative correlation.

This is quite an approximation. The correlation of 0.51 is considered of large strength, just like that of 0.99. But the correlation of 0.49 is, unlike that of 0.51, weak… I know, these are just numbers, so let me show some data. On the graph below, you can see three pairs of variables x and y, all the three pairs being normally distributed (to be precise, they follow a multivariate normal distribution). As you can see, their correlation is 0.48, 0.53 and 0.99.

Look at the graph and analyze it. What can you say about the three correlations?

Three pairs of values with different correlations. Image by author

A funny thing is that according to the above-mentioned guidelines, the first pair of variables is correlated differently than the second pair because

  • the strength of the first pair’s association is medium because correlation is 0.476
  • the strength of the second pair’s association is large because correlation is 0.528

But the second pair has a very similar strength of association as that of the third pair because

  • the strength of the second pair’s association is large because correlation is 0.528
  • the strength of the third pair’s association is also large because correlation is 0.986

Sorry, but that’s not something I can accept. It does make much sense. Any sense, actually.

This one thing I don’t like in such guidelines — but not the only one. However important it is, the next one is even more important.

Such correlation guidelines assume it’s enough for you to know a value of the correlation coefficient to know the strength of this association. It doesn’t even matter what you’re measuring. The population doesn’t matter, and the sample doesn’t matter; what matters is this single value.

This means we’re ignoring the context of the correlation. Whatever we’re measuring, it doesn’t matter. It can be anything. It can be everything.

Some correlation guidelines at least make an attempt to stress that it matters what you’re measuring. For instance, they can be accompanied by a sentence explaining that you should take the context of the relationship into account. For instance, under the above table, you can read the following sentence (source):

Remember that these values are guidelines and whether an association is strong or not will also depend on what you are measuring.

Okay, so on the one hand, you should consider the strength of association between 0.5 and 1.0 strong, but on the other hand, “you should remember that this also depends on what you are measuring.”

So… What? How? I mean… What? Does this mean I should change the guidelines depending on what I’m measuring? But how? Move the limits? Get rid of them? Any example?

Very helpful.

Indeed, correlation guidelines like these fully ignore the context of the phenomenon we study. Using them, you would use the same interpretation for, say,

  • correlation between the heights of identical twins, and
  • correlation between the heights of pairs of independent people.

Correlation guidelines like these fully ignore the context of the phenomenon we study.

I know that the second pair of correlation seem not to make much sense. I mean, no one would actually even think of estimating such correlation — but this is why it’s a perfect example of null correlation. If you conduct a methodologically correct study to measure this correlation, it would be null or close to null.

For the moment let’s assume we have estimated these two correlations in the population of people on the globe based on two big samples (two different correlations, two different samples). The correlation between identical twins’ height should be close to 1.0, and that of, say, 0.5 would be amazingly weak and should not happen. On the other hand, the correlation between pairs of independent people should be close to 0.0, and that of 0.5 would be amazingly strong and should not happen.

Did you notice what I just did? I used the very same value of the correlation coefficient between two variables representing a person’s height, 0.5, for two different situations — and it occurred that in one of them this value was amazingly weak while in the other one it was amazingly strong. The same value of 0.5 should not happen in these two situations. In one situation, it should not happen because it would mean a far-too-weak correlation. In the other it should not happen because it would mean a far-too-strong correlation.

What does this say about these guidelines?

Quite a lot. Such guidelines are worth next to nothing. They make people stop thinking. You should never use them unless you like the feeling of being misled, confused, falsified. Instead of using them — use your brain. Think.

Such guidelines are worth next to nothing. They make people stop thinking.

Instead of using them — use your brain. Think.

Don’t use such correlation guidelines. Don’t tell yourself, “I will just have a short look so that I know what wording I should use, but I won’t use these guidelines.” Stop cheating yourself.

Instead, think of the phenomenon you’re analyzing; think what sort of correlation you should expect; and think how far the estimate you got is from this expected correlation. If it’s much smaller, the association is likely weak. If it’s much bigger, it’s likely strong. When it’s similar, it’s likely expected.

This is the very word I miss in such guidelines — and in the interpretation of correlation coefficient in general. Expected. To be honest, for the moment I don’t know how to put this word into the knowledge we have of correlation. It does not fit, not yet.

But let’s take it step by step, without rush. Let’s think about this. I’ve been thinking about correlation for so many years that some more time will do no harm.

Did you noticed the word “likely” I used above? Like in this sentence: “If it’s much bigger, it’s likely strong.” I did this because when you obtain such unexpected results, first of all you should check how the sample was collected. Correlation coefficient is sensitive to sample size. Too small a sample can lead to inaccurate estimates. Thus, don’t get too attached to the correlation coefficient estimated from a small sample. Such an estimate may turn out to be very far from the real (population) one. I wouldn’t expect this to happen when you draw a big sample while in the case of a small sample, it’s not only probable — it’s likely. I feel like writing a longer article about the meaning of sample size in corrleation, so expect one in the future.

What I haven’t mentioned is that we often treat association between two variables as linear while it does not have to be such. Maybe you obtained an unexpectedly small value of the coefficient of linear correlation because the association was not linear? This is yet another thing to check, but it’s a different story…

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment