Techno Blender
Digitally Yours.

Statistical independence for beginners | by Jae Kim | Mar, 2023

0 39


Photo by Naser Tamimi on Unsplash

Statistical independence is a fundamental concept in statistics. It forms a part of the underlying assumptions in a wide range of (supervised) machine learning algorithms such as the logit model and naive Bayes classifiers. It also closely related with the key methods in artificial intelligence such as the maximum entropy and neural network. See, for example, this post for further insights.

In this post, I explain its definitions with intuitive interpretations, examples, and resources for statistical testing of independence (R code and Excel function).

For simplicity, consider two events A and B and define the following probabilities:

Prob(A): (marginal) probability of event A

Prob(B): (marginal) probability of event B

Prob(A ∩ B): joint probability of event A and B, the probability that A and B occuring at the same time;

Prob(A|B): conditional probability of event A given B, the probability of A given that the event B has already occurred;

Prob(B|A): conditional probability of event B given A.

These probabilities are related as

Image created by the author

As indicated in the Venn Diagram below, Prob(A|B) represents the proportion of the event A’s contribution (yellow area: A ∩ B) to the event B, in probabilities.

Image created by the author

For example,

A: married, B: male

P(A|B) = the probability of marriage among males;

A: unemployed, B: university graduates

P(A|B) = the probability of unemployment among university graduates

There are two equivalent conditions of statistical independence. First, the events A and B are statistically independent if

Prob(A ∩ B) = Prob(A) × Prob(B)

The probability of A and B occurring at the same time is the product of the probabilities. This means that if they occur at the same time, it is purely by chance and there is no systematic association between the two.

Second, the events A and B are statistically independent if

Prob(A|B) = Prob(A).

This condition follows from the relationship given above:

Image created by the author

The probability of A conditional on the event B is the same as the probability of A. It means, given that it already has occurred, your knowledge about the event B has no bearing on the probability of event A.

Similarly for Prob(B|A) = Prob(B).

Simple example

You toss two (fair) coins successively, each coin showing either H (Head) or T (Tail) with Prob(H) = Prob(T) = 0.5. You will have the following outcomes:

(H, H), (H, T), (T, H), (T, T).

For example,

Prob(H ∩ T) = 0.25, and this is equal to

Prob(H) × Prob(T) = 0.5 × 0.5.

That is, if you have the outcome (H, T), it is purely by chance with no systematic association. Alternatively,

Prob(T | H) = 0.25 = Prob(T ∩ H)/P(H) = 0.25/0.5= Prob(T)

If you have H from the first coin, that has no bearing on the probability of having T or H from the second coin.

Real-world examples

A: married, B: male

Prob(A ∩ B) = Prob(A) × Prob(B); a randomly selected person is male and married, and this joint occurrence is purely by chance.

Prob(A|B) = Prob(A); probability of marriage among males is same as the probability of marriage. Being a male has no impact on the probability of marriage.

Testing for Independence: chi-square test

A survey is conducted to examine if there is any association between marriage status of individuals and their gender. Out of 100 randomly selected individuals, 40 are males and 60 are females. Among them, 75 are married and 25 are unmarried. The (contingency) table below presents the joint frequencies of marriage status and gender.

Image created by the author

For example,

Prob(Y ∩ M) = 25/100; Prob(M) = 40/100

Prob(Y|M) = Prob(Y ∩ M)/Prob(M) = 25/40

These frequencies are compared with the expected frequencies under statistical independence:

Image created by the author

The expected joint probabilities under independence are listed as above:

For example,

Prob(Y ∩ M) = Prob(Y) × Prob(M) = 75/100 × 40/100 = 0.3;

Prob(Y ∩ F) = Prob(Y) × Prob(F) = 75/100 × 60/100 = 0.45.

The actual frequencies are similar to the expected values, but are they close enough to justify statistical independence? To test for the null hypothesis of statistical independence, we need to employ a test for the independence.

The chi-square test is widely used for this purpose, which compares the actual frequencies (Oi) with expected frequencies (Ei)

Image created by the author

where n is the number of cells in the table, N is the total number of responses, and pi is the expected probability (or relative frequency) under independence. The above statistic follows a chi-square distribution with degrees of freedom df = (Rows − 1)×(Cols − 1) where Rows and Cols are the number of rows and columns of the table.

The R code below shows the test results with the test statistic and p-value. The object table is defined as the 2 × 2 matrix of the actual frequencies as above, and input to the function chisq.test. At the 5% level of significance, the null hypothesis of independence between Gender and Marriage is rejected, with the p-value of 0.018 and test statistic of 5.56. The option is correct= FALSE is related with the continuity correction, which is not used here to be consistent with the Excel function.

table=matrix(c(25,50,15,10),nrow=2)
> table
[,1] [,2]
[1,] 25 15
[2,] 50 10
> chisq.test(table,correct = FALSE)

Pearson's Chi-squared test

data: table
X-squared = 5.5556, df = 1, p-value = 0.01842

The excel function CHISQ.TEST returns p-value of the test. The function requires the input of the actual range and expected range as below:

Image created by the author
Image created by the author


Photo by Naser Tamimi on Unsplash

Statistical independence is a fundamental concept in statistics. It forms a part of the underlying assumptions in a wide range of (supervised) machine learning algorithms such as the logit model and naive Bayes classifiers. It also closely related with the key methods in artificial intelligence such as the maximum entropy and neural network. See, for example, this post for further insights.

In this post, I explain its definitions with intuitive interpretations, examples, and resources for statistical testing of independence (R code and Excel function).

For simplicity, consider two events A and B and define the following probabilities:

Prob(A): (marginal) probability of event A

Prob(B): (marginal) probability of event B

Prob(A ∩ B): joint probability of event A and B, the probability that A and B occuring at the same time;

Prob(A|B): conditional probability of event A given B, the probability of A given that the event B has already occurred;

Prob(B|A): conditional probability of event B given A.

These probabilities are related as

Image created by the author

As indicated in the Venn Diagram below, Prob(A|B) represents the proportion of the event A’s contribution (yellow area: A ∩ B) to the event B, in probabilities.

Image created by the author

For example,

A: married, B: male

P(A|B) = the probability of marriage among males;

A: unemployed, B: university graduates

P(A|B) = the probability of unemployment among university graduates

There are two equivalent conditions of statistical independence. First, the events A and B are statistically independent if

Prob(A ∩ B) = Prob(A) × Prob(B)

The probability of A and B occurring at the same time is the product of the probabilities. This means that if they occur at the same time, it is purely by chance and there is no systematic association between the two.

Second, the events A and B are statistically independent if

Prob(A|B) = Prob(A).

This condition follows from the relationship given above:

Image created by the author

The probability of A conditional on the event B is the same as the probability of A. It means, given that it already has occurred, your knowledge about the event B has no bearing on the probability of event A.

Similarly for Prob(B|A) = Prob(B).

Simple example

You toss two (fair) coins successively, each coin showing either H (Head) or T (Tail) with Prob(H) = Prob(T) = 0.5. You will have the following outcomes:

(H, H), (H, T), (T, H), (T, T).

For example,

Prob(H ∩ T) = 0.25, and this is equal to

Prob(H) × Prob(T) = 0.5 × 0.5.

That is, if you have the outcome (H, T), it is purely by chance with no systematic association. Alternatively,

Prob(T | H) = 0.25 = Prob(T ∩ H)/P(H) = 0.25/0.5= Prob(T)

If you have H from the first coin, that has no bearing on the probability of having T or H from the second coin.

Real-world examples

A: married, B: male

Prob(A ∩ B) = Prob(A) × Prob(B); a randomly selected person is male and married, and this joint occurrence is purely by chance.

Prob(A|B) = Prob(A); probability of marriage among males is same as the probability of marriage. Being a male has no impact on the probability of marriage.

Testing for Independence: chi-square test

A survey is conducted to examine if there is any association between marriage status of individuals and their gender. Out of 100 randomly selected individuals, 40 are males and 60 are females. Among them, 75 are married and 25 are unmarried. The (contingency) table below presents the joint frequencies of marriage status and gender.

Image created by the author

For example,

Prob(Y ∩ M) = 25/100; Prob(M) = 40/100

Prob(Y|M) = Prob(Y ∩ M)/Prob(M) = 25/40

These frequencies are compared with the expected frequencies under statistical independence:

Image created by the author

The expected joint probabilities under independence are listed as above:

For example,

Prob(Y ∩ M) = Prob(Y) × Prob(M) = 75/100 × 40/100 = 0.3;

Prob(Y ∩ F) = Prob(Y) × Prob(F) = 75/100 × 60/100 = 0.45.

The actual frequencies are similar to the expected values, but are they close enough to justify statistical independence? To test for the null hypothesis of statistical independence, we need to employ a test for the independence.

The chi-square test is widely used for this purpose, which compares the actual frequencies (Oi) with expected frequencies (Ei)

Image created by the author

where n is the number of cells in the table, N is the total number of responses, and pi is the expected probability (or relative frequency) under independence. The above statistic follows a chi-square distribution with degrees of freedom df = (Rows − 1)×(Cols − 1) where Rows and Cols are the number of rows and columns of the table.

The R code below shows the test results with the test statistic and p-value. The object table is defined as the 2 × 2 matrix of the actual frequencies as above, and input to the function chisq.test. At the 5% level of significance, the null hypothesis of independence between Gender and Marriage is rejected, with the p-value of 0.018 and test statistic of 5.56. The option is correct= FALSE is related with the continuity correction, which is not used here to be consistent with the Excel function.

table=matrix(c(25,50,15,10),nrow=2)
> table
[,1] [,2]
[1,] 25 15
[2,] 50 10
> chisq.test(table,correct = FALSE)

Pearson's Chi-squared test

data: table
X-squared = 5.5556, df = 1, p-value = 0.01842

The excel function CHISQ.TEST returns p-value of the test. The function requires the input of the actual range and expected range as below:

Image created by the author
Image created by the author

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment