Introduction to p-value and Significance Testing with Examples | by Neeraj Krishna | Jan, 2023

By Jessie Hobb On Jan 19, 2023

Understand the idea behind the hypothesis testing framework through examples

Great products aren’t built overnight, rather they’re refined and polished through years of iteration. The most successful teams follow a feedback loop while developing a product. First, they develop an idea, deploy it to production, and monitor the process. Then, based on the data collected, they analyse and determine whether it’s successful. The insights gained in the analysis inform the next iteration of development. François Chollet, the creator of Keras, calls this the loop of progress [2].

The loop of progress. Image by the author.

The role of statistics is crucial in the analysis part. It helps us to test the hypothesis and arrive at a decision by observing the monitored data. Now, there are various hypothesis testing methods tailored for different scenarios, but in this article, we’ll understand the idea and the process behind the hypothesis testing framework through a simple example.

In statistics, we call the choices hypotheses. Often, the hypotheses aren’t clear in the real world. Let me illustrate. Consider two scenarios:

In scenario one, you’re given a die and are asked to find out if the die is fair or loaded.
In scenario two, you’re given two dice: one fair and one loaded. However, this time you know the probability distribution of the loaded die. If you randomly pick a die and toss it, can you determine which die you’ve picked based on the face it lands?

The two scenarios are a form of hypothesis testing problems. However, the difference between the two is we know what the alternate hypothesis looks like in the latter.

Scenario two is easier to deal with. There are well-defined methods like the likelihood ratio test that helps us to arrive at a decision based on a single observation.

However, most real-world cases look like scenario one.

In scenario one, we don’t know what the alternate hypothesis looks like. We just assume the alternate hypothesis is the die is loaded. We don’t know how loaded it is. So the goal of hypothesis testing comes down to proving or rejecting the null hypothesis, which in this case is the assumption the die is fair.

This form of hypothesis testing is also referred to as significance testing and is the focus of this article.

If you want to understand how to deal with problems of the first kind, consider reading my other article where we learn hypothesis testing from the first principles.

Let’s start with an example and build our way up.

You’re given a coin and are asked to find out if it’s fair or biased

Let 𝜃 represent the probability of heads. The null hypothesis is 𝜃 = 1/2 while the alternate hypothesis is 𝜃 ≠ 1/2.

Notice how the alternate hypothesis isn’t clear. 𝜃 can be any value between 0 and 1.

The key idea in hypothesis testing is the outcomes of the experiment depend on the hypotheses. So by observing the outcomes, we can draw a conclusion.

In this example, let’s toss the coin n times and observe the outcomes.

Let the outcomes be represented by X1, X2, X3, …, Xn where Xi is a random variable that represents the outcome of the ith toss. Xi = 1 if the coin lands on heads, else Xi = 0.

Rejection region and critical ratio

Let S = X1 + X2 + X3 + … + Xn represent the total number of heads. S is called a statistic, and it essentially summarizes the observations. Now, if the null hypothesis is true and the coin is fair, we’re more likely to observe n/2 heads. If it’s not, then the difference |S — n/2| would be large. So a reasonable decision rule is:

reject the null hypothesis if |S - n/2| > 𝜉

It means we reject the null hypothesis or the assumption the coin is fair if the absolute difference between the number of heads observed and n/2 is greater than a certain value 𝜉. 𝜉 is called the critical ratio.

Based on this decision rule, we can split the observation space into two regions: the rejection region and the acceptance region. The values of S for which the decision rule |S — n/2| > 𝜉 is satisfied fall into the rejection region, while the remaining values fall into the acceptance region.

The boundary between the regions depends on 𝜉. Let’s see how we can choose its value.

Significance Level

Under the null hypothesis, the probability that the observations will fall in the rejection region is called the significance level. It can be computed as:

P(reject H0 ; H0) = 𝛼

H0 is the null hypothesis. P(reject H0 ; H0) represents the probability of observations in the rejection region under H0. 𝛼 is the significance level.

In our example, it can be represented as:

P(|S - n/2| > 𝜉 ; H0) = 𝛼

Now, 𝛼 depends on 𝜉. As 𝜉 increases, the size of the rejection region decreases and so does 𝛼.

If we fix the value of 𝛼, we can find the value of 𝜉. If we know the distribution of the decision rule under the null hypothesis, we can find the value of 𝜉 for a given value of 𝛼.

Let’s run a simulation and see it in action.

In our example, S follows a binomial distribution with parameters n and 𝜃. However, as n increases, S approaches a normal distribution according to the central limit theorem [4].

The binomial distribution tends to be normal as n increases

So assuming S follows a normal distribution, we can standardize it by subtracting the mean and dividing by the standard deviation. The mean of the binomial is n * 𝜃 while its standard deviation is sqrt(n * 𝜃 * (1-𝜃)).

def standardize(S):
return (S - n * 𝜃) / sqrt(n * 𝜃 * (1-𝜃))

Let’s toss the coin n=1000 times, and according to the null hypothesis, the coin is fair, so 𝜃=1/2. The value of 𝜉 for a significance level of 𝛼=0.05 can be computed as shown below:

   P(|S - n𝜃| > 𝜉 ; H0) = 𝛼Divide the inequality by sqrt(n * 𝜃 * (1-𝜃)) which is the std of S
=> P(|S - n𝜃| / sqrt(n * 𝜃 * (1-𝜃)) > 𝜉 / sqrt(n * 𝜃 * (1-𝜃)) ; H0) = 𝛼
Let Z = |S - n𝜃| / sqrt(n * 𝜃 * (1-𝜃))
Z is a standard normal variable as discussed above
=> P(|Z| > 𝜉 / sqrt(n * 𝜃 * (1-𝜃)) ; H0) = 𝛼
=> P(|Z| > 𝜉 / sqrt(1000 * 1/2 * (1-1/2)) = 0.05
=> P(|Z| > 𝜉 / sqrt(250)) = 0.05
According to the standard normal tables, P(Z) = 0.05 if |Z| > 1.96
This is illustrated in the diagram below.
=> Z > 1.96 or Z < -1.96
The minimum value of 𝜉 can be calculated as:
𝜉 / sqrt(250) = 1.96
=> 𝜉 = 31

So for a significance level of 𝛼 = 0.05, the critical ratio is 𝜉 = 31. What does it mean? Our decision rule is:

reject the null hypothesis if |S - n/2| > 𝜉

If I perform the experiment and toss the coin n=1000 times and observe the number of heads S=472, then |472–500| = 28 < 31.

So we say the null hypothesis H₀ is not rejected at the 5% significance level.

Notice we say H₀ is not rejected instead of saying H₀ is accepted. It’s because we don’t have any strong grounds to accept H₀. It isn’t possible to prove 𝜃 is exactly equal to 0.5 based on the data. 𝜃 can also be equal to 0.51 or 0.49. So instead we say the observation S=472 isn’t enough to disprove the null hypothesis H₀ at a 5% significance level.

In this case where 𝜉 = 31, the observations S < 469 and S > 531 fall into the rejection region. Moreover, if I increase the value of 𝜉, the rejection region decreases as illustrated in the graph below:

The rejection region decreases as `𝜉 increases.`

When we fix the significance level 𝛼, say at 5%, it means under the model governed by H₀, the observations are likely to fall in the rejection region only 5% of the time. So if an observation ends up falling into the rejection region, it provides strong evidence that H₀ may be false.

The formal definition of p-value is:

the p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption the null hypothesis is correct. — Wikipedia [3]

Unlike the significance level which is fixed before performing the experiment, the p-value depends on the outcome of the experiment.

Let’s say the outcome of the coin toss experiment is S=430, then the outcomes that are “at least as extreme” as the result under the assumption the null hypothesis is correct are S < 430 and S > 570 because the distribution under the null hypothesis is symmetric and centred around the mean.

I’ve simulated the outcome s by varying 𝜃 and the p-value for each outcome is shown below:

p-value based on the outcome as 𝜃 varies

Essentially, the p-value is the value of 𝛼 for which s would be exactly at the threshold between rejection and non-rejection.

In the above example, we know the outcomes follow a binomial distribution with parameters n and 𝜃. So we can directly compute the critical ratio 𝜉 based on 𝛼 using any stats library without needing the normal approximation.

However, in most cases, we would not know the distribution under the null hypothesis. However, most of them tend to approximate the normal distribution as the sample size increases. So it’s important to use large sample sizes in hypothesis testing experiments.

Often in science, a scientist comes up with a theory and the other scientists prove or disprove it by carrying out experiments. They perform hypothesis testing where the original idea is the null hypothesis.

The general framework for hypothesis testing can be summarized as follows [1]:

Choose a statistic S that is representative of the observed data. It’s a scalar random variable that depends on the observations. Often, the sample mean or the sample variance is used as the statistic.
Come up with a decision rule to reject the null hypothesis H₀. The decision rule is a function of the statistic S and the critical ratio 𝜉. Based on the decision rule and 𝜉, the observation space can be divided into the rejection region and the acceptance region.
Choose the significance level 𝛼 which is the probability an observation is likely to fall into the rejection region under the null hypothesis.
Compute the critical ratio 𝜉 based on the significance level 𝛼. The distribution under the null hypothesis needs to be known to perform this computation. However, as we discussed, most of them can be approximated to a normal distribution if the sample size is large. Once the value is 𝜉 is known, the rejection region can be determined.

Once we perform the experiment and the observations are recorded, we need to do the following:

Calculate the value s of the statistic S.
Reject the null hypothesis if s belongs to the rejection region.

The beauty of hypothesis testing is there aren’t any constraints. We are free to design the experiment and choose the null and alternate hypotheses.

Hope you’ve enjoyed the article.

Let's ConnectHope you've enjoyed the article. Please clap and follow if you did.You can also reach out to me on LinkedIn and Twitter.