Charting the Non-Parametric Odyssey: Statistical Frameworks for Distribution-Free Hypothesis Testing | by Naman Agrawal | May, 2023

By Jessie Hobb On May 10, 2023

Gauging the Mechanics of the Sign and Wilcoxon Signed Rank Tests

Introduction
Tool 1: One-Sample Sign Test
Tool 2: Two-Sample Sign Test
Limitations of the Sign Test
Tool 3: One-Sample Wilcoxon Signed Rank Test
Wilcoxon Signed Rank Test: Possible Complications
Tool 4: Two-Sample Wilcoxon Signed Rank Test
Conclusion

Statistics is the corpus of instruments of knowledge that allow us to infer from data through tools including, but not limited to, estimation of parameters, construction of confidence intervals, and hypothesis testing to validate our assumptions. In this article, we will learn about frameworks that allow us to test our hypothesis about the values of different data quantiles, namely the Sign Test and the Wilcoxon Signed Rank Test. What’s unique about these frameworks is that, unlike popular hypothesis tests such as the z-test or the t-test, these tests don’t require any assumption made on the data, either through intuition or enforced via the Central Limit Theorem i.e., they are distribution-free or non-parametric. All that you need is data coming from a symmetric and continuous distribution, and you will be equipped with the tools to test claims such as:

Let’s start with a straightforward test, which only involves counting the values lesser than or equal to the threshold for the hypothesis: The Sign Test (aka the Binomial Test). In particular, we consider a sample of size n X₁, …, Xₙ as well as the simple null hypothesis and the corresponding alternative hypothesis stated as follows:

where m denotes the median of the given data. As a starting point, let’s think about how we’d approach the problem in a traditional t-test setting. In such a case, we would assume that n is large, allowing the Central Limit Theorem to kick in. Further, since the distribution is always assumed to be symmetric, testing for the median should be the same as testing for the mean. Thus, we can define the test statistic and the corresponding critical region as follows:

That’s simple enough and quite powerful (as we’ll later see), but it’s based on the assumption that the distribution converges to a t distribution of n — 1 degrees of freedom. But this may not necessarily be true, especially if n isn’t large enough. This necessitates an alternative framework that lets us test this assumption without bringing in the distributional assumption of the data. This leads us to the Sign Test, which simply involves counting the total number of observations more than m⁰. For example, consider the following data of n samples:

The hypotheses are given as follows:

As discussed above, we first calculate the number of Xᵢ more than m⁰. Let N⁺ denote the random variable for the number of Xᵢ more than m⁰ and let n⁺ be its realized value for the given sample:

Now, we determine the distribution of the above test statistic under the null. Since m = m⁰ under the null, the probability that a sample is more than m⁰ must be 0.5. Thus, N⁺ measures the number of samples more than m⁰, where each sample has a probability of 0.5 to contribute to N⁺ under the null hypothesis. In other words, N⁺ counts the number of successes, where each success occurs with a probability of 0.5. This defines a binomial random variable parametrized with the number of samples n = 10, and the probability of success p = 0.5. Thus,

Now that the distribution of the test statistic is known under the null, we can proceed to calculate the p-value for the given sample. Recall that the p-value denotes the probability of observing a more extreme value (relative to the sample observed value of the test statistic) under the assumption that the null hypothesis is true. Thus,

Evidently, the p-value is quite large. So, even at a 10% significance level, since the p-value > 0.10, we fail to reject the null hypothesis! And that concludes the simple, yet useful application of the sign test. This also leads us to another important question: What if the alternative hypothesis was in the other direction, or what if it was two-sided? The process still remains the same, just that we count the number of observations that are lesser than m⁰ (N⁻). In particular, suppose we were interested in testing:

Then,

Note that the distribution still follows binomial with the same parameters as the reasoning remains the same as for the previous case. Finally, we discuss the two-sided hypothesis case:

For this case, we compute both N⁺ and N⁻ for the sample:

As before the p-value gives us the probability of an outcome as extreme as 6 or more numbers more than 0.5. Because this is a two-sided test, an extreme result can be either 6 or more numbers more than 0.5 or 4 or fewer numbers more than 0.5.

Just as before, the p-value is quite large. So, even at a 10% significance level, since the p-value > 0.10, we fail to reject the null hypothesis! Isn’t this fascinating? Just simple counting and one of the most basic probability distributions can allow us to test such a hypothesis. But, can we generalize the results? In other words, can we use this test in general for any specific quantile, not just the median? Sure, the methodology remains exactly the same. For example, suppose for the same dataset as above, we are given the following hypotheses:

Note that π₀.₂₅ denotes the 25th quantile or the 1st quartile of the data. Just as before, we count the number of Xᵢ less than 3:

Now, we determine the distribution of the above test statistic under the null. Since π₀.₂₅ = 3 under the null, the probability that a sample is less than 3 must be 0.25. Thus, N⁻ measures the number of samples less than 3, where each sample has a probability of 0.25 to contribute to N⁻ under the null hypothesis. In other words, N⁻ counts the number of successes, where each success occurs with a probability of 0.25. This defines a binomial random variable parametrized with the number of samples n = 10, and the probability of success p = 0.25. Thus,

Consequently,

That’s a lower p-value, but still not quite low enough to reject at a 5% or 10% significance level. With this, we conclude our discussion on the One-Sample Sign Test.

In this section, we will attempt to generalize the sign test presented earlier to the case of two samples. Previously, we were given data (consisting of observations from a single sample X), and we could test the hypothesis on whether the median (or any of the quantiles) was larger, smaller, or not equal to a given threshold. In this section, we will extend the above notion to that of two samples. In particular, suppose we are given independent pairs of sample data (x₁, y₁), · · ·, (xₙ, yₙ). The null hypothesis states that the two samples are equally likely to be larger than the other i.e., the difference in their medians is 0, whereas the alternative hypothesis suggests that there is a difference in the two samples i.e, the difference in the median is either positive, negative or non-zero. Mathematically, the null and alternative hypotheses are given as follows:

Example: Suppose we are given the following paired data:

The hypotheses are given as follows:

The process of deriving the test statistic and finding its distribution remains similar. In the previous case, we simply checked for samples that were more than or less than the threshold (value under the null). In this case, we check for the sign by comparing each pair of observations. In other words, we check for the sign of Wᵢ = Xᵢ − Yᵢ for each observation from i = 1 to n:

The test statistic is given by:

By previous logic, the distribution of the test statistic is Bin(n, 0.5) under the null (Think of the value of the test statistic as the number of successes in n trials, where each success i.e., Xᵢ > Yᵢ has a probability of 0.5 as both samples are equally likely to be larger than the other). The p-value is therefore given by:

Indeed, the p-value is much smaller, and we can reject the null hypothesis at a 10% significance level! Finally, let’s try to calculate the p-value for the same data if the alternative hypothesis was two-sided:

Because this is a two-sided test, an extreme result can be either 8 or more positive signs or 2 or fewer positive signs:

Thus, we won’t be able to reject the null at a 10% significance for the two-sided case. This concludes our discussion on conducting the Sign Test for one and two sample cases.

In the previous two sections, we used the sign test to conduct different kinds of hypothesis testing. While the test was found to be quite useful due to its non-parametricity (aka distribution-free). However, in general, the sign test isn’t quite powerful (i.e., for the same significance level, the probability of type 2 error remains quite high). The main reason for this is that the sign test only considers the sign of the values i.e., whether they are lesser than the threshold value under the null (for one-sample) or lesser than their counterpart (for two-sample). It does not consider the magnitude of the difference i.e., how large or small the deviation from 0 is. In practice, the sign test is rarely used, but it still serves as a great tool for introducing discussions on non-parametric hypothesis testing due to its immense simplicity. This leads us to a new hypothesis-testing framework: The Wilcoxon Signed Rank test, an extension to the Sign test in a way, that considers the order of the magnitude of deviation from 0 and assigns ranks to each sample value. The test will be described in greater detail in the next section.

The theory behind the Wilcoxon Signed Rank Test is slightly convoluted. But, don’t worry. Let’s revisit the example from the previous section and apply the Wilcoxon test sequentially:

Step 1: Find the absolute value of the difference between Xᵢ and m⁰ for each i:

Step 2: Rank each of the computed values in step 1 i.e., assign rank (Rᵢ) from 1 to n for every | Xᵢ − m⁰|:

Step 3: Compute the Signed Rank (Rₛᵢ) for each sample, where the sign is given by the sign of Xᵢ − 4 (the same sign that was used in the Sign Test) i.e.,

Step 4: Compute the Wilcoxon Signed Rank Test statistic, which is given by the sum of the signed ranks for all the data points:

Before, we proceed to discuss the distribution of W, let’s pause for a moment and think about the logic for these steps. In the sign test, we simply counted the number of samples more than or less than m⁰. Here, we are weighing the count by rank, which measures the relative absolute deviation from m⁰ for different samples. This allows us to give a higher weight to values that were much lesser or much larger than m⁰, helping us overcome the limitation of the Sign test.

Now, let’s evaluate the distribution of W. It’s important to mention now that there is NO closed formula for the distribution of W (there are closed-form expressions for its moment-generating function, but no such expression exists for its mass function). One way to approach this problem is to use tables (Yes, there are specific tables designed for the critical values of W for different sample sizes) The other alternative is to approximate the distribution (CDF) of W using the Lyapunov CLT:

where Φ is the cumulative density function of the standard normal distribution. The precise math is quite complex and beyond the scope of this article. But, we will still try to make sense of it. Let’s try to calculate the expectation and variance of W:

where Sᵣ denotes the sign of the rᵗʰ rank (all rank values will go from 1 to n). Thus,

If we standardize W, we contain:

Indeed it’s quite similar to the expression inside Φ. In fact, by Lyapunov CLT (Not the traditional CLT, as even though the rSᵣ are independent, they are not necessarily identically distributed), we h

This leads to the Φ term in the distribution of W. You may also notice that the CDF of W contains an additional factor of 1. This is simply a unit correction. Recall that W is a discrete random variable. Whenever a discrete random variable is approximated by a continuous one (e.g., the normal distribution), it’s always important to add a correction term, usually the half-unit correction. The half-unit correction helps to account for this discrepancy by adjusting the probabilities of the continuous distribution so that they more closely match the probabilities of the discrete distribution. Specifically, the correction involves shifting the midpoints of the intervals used to define the continuous distribution by 0.5 units in the direction opposite to the rounding convention. This ensures that the probabilities assigned to the intervals in the continuous distribution more closely match the probabilities of the discrete distribution, which improves the accuracy of the approximation. However, for approximating W, we tend to use a full-unit correction (instead of the traditional half-unit). This is because the values of W can differ from one another only in multiples of 2. Think of it this way: If you have a value of W to be w, and you’re interested in decreasing the value of W by 1, you will have to flip the sign of a positive rank, which would add a minimal difference of 2. For instance, in our example, W was calculated to be 3. If we want to decrease W by 1, the best we can do is flip the +1 to -1, but this decreases the value of W by 2 (-1–1 = -2). Similar logic holds for increasing the value of W. Thus, in order to control for this range of values that W can take, a whole unit correction (0.5 × 2 = 1) is used instead of the traditional half unit correction (0.5 × 1 = 0.5). Thus, after applying the whole-unit correction and using Lyapunov CLT, we obtain:

Now that we have obtained an approximate expression for the distribution of W, we proceed to calculate the p-value for our example. Recall that the p-value denotes the probability of observing a more extreme value (relative to the sample observed value of the test statistic) under the assumption that the null hypothesis is true. Thus,

Evidently, the p-value is quite large. So, even at a 10% significance level, since the p-value > 0.10, we fail to reject the null hypothesis! And that concludes the simple, yet useful application of the Wilcoxon Signed Rank test. Now let’s consider the case when the alternative hypothesis was in the other direction. In particular, suppose we were interested in testing:

The test statistic still remains the same. However, since the direction in favor of the alternative hypothesis changes, the p-value calculation is adjusted accordingly (as per the definition of p-value):

Again, the p-value is large and we fail to reject the null at a 10% significance level. Finally, let’s look at the two-sided case:

The test statistic still remains the same. However, just as before, since the direction in favor of the alternative hypothesis changes, the p-value calculation is adjusted accordingly (as per the definition of p-value):

Again, the p-value is large and we fail to reject the null at a 10% significance level. Now, let’s generalize the test statistic for any specific quantile, not just the median. The methodology remains exactly the same. For example, suppose for the same dataset as above, we are given the following hypotheses:

Just as before, we construct the table and sum up the signed ranks to get the test statistic:

Now, we determine the distribution of the above test statistic under the null:

If we standardize W, we obtain

By Lyapunov CLT, we have:

Thus,

That’s a lower p-value, but still not quite low enough to reject at a 5% or 10% significance level. With this, we conclude our discussion on the One-Sample Wilcoxon Signed Rank Test.

There are two possible complications that may arise while using the Wilcoxon Signed Rank test:

For i ̸= j, |Xᵢ − m⁰| = |Xᵢⱼ − m⁰| i.e., the absolute deviations from m⁰ are the same. In this case, which of the observations would be assigned a higher rank? Firstly, under the assumption that the data follows a continuous distribution, the same data shouldn’t be occurring more than once theoretically. But, often in practice, this can happen. There are many ways to approach this problem, but the most common strategy is to assign each sample the average of the ranks. For instance, if there are 4 sample values such that they have the same absolute deviation from m⁰. If these observations take the ranks 7, 8, 9, and 10, for each sample we assign the average rank: (7 + 8 + 9 + 10)/4 = 8.5.
If |Xᵢ − m⁰| = 0 for some i, we exclude this sample from the analysis itself. We just used the reduced sample (and the corresponding reduced sample size) to compute the test statistic and its distribution.

For example, if we are given the following data and hypothesis:

We exclude the 5th sample (since |X₅ − 3| = 0) and use the remaining samples to calculate the average ranks:

As our last non-parametric hypothesis testing toolkit, let’s look at the two-sample extension of the Wilcoxon Signed Rank Test. As before, the null hypothesis states that the difference in the medians is 0, whereas the alternative hypothesis suggests that there is a difference in the two samples i.e., the difference in the median is either positive, negative, or non-zero. Mathematically, the null and alternative hypotheses are given as follows:

Let’s first look at the general framework and then consider an example. Suppose we are given data from two samples: X₁, X₂, · · ·, Xₙ₁ and Y₁, Y₂, · · ·, Yₙ₂. We combine the two samples in a single sample and arrange all observations in ascending order. For each observation, we assign the ranks (or average ranks in case of disputes) from 1, 2, · · ·, n1 + n2. Finally, we calculate the sum of ranks for all observations from X, which we refer to as our test statistic W. For example, consider that we are given the following paired data:

The hypotheses are given as follows:

Step 1: Combine Xᵢ and Yᵢ :

24.27, 8.63, 16.76, 21.92, 29.59, 4.01, 7.28, −7.75, −6.61, 13.05, 13.47, 24.6, −4.97, 0.07, 6.96, −0.53, 7.26, −11.7, −5.01, −4.43

Step 2: Arrange the combination in ascending order (remember to keep track of which observations, come from which sample:

Step 3: Assign The Ranks:

Step 4: Calculate the test statistic by summing up ranks from sample X:

Next, we find the distribution of W under the null. As before, there is NO closed formula for the distribution of W, So we approximate the distribution (CDF) of W using the Lyapunov CLT:

where Φ is the cumulative density function of the standard normal distribution. We can check this by calculating the expectation and variance of W. Note, that this calculation is quite extensive, so you may skip over it and proceed to the p-value calculation directly. But, we still include it for completeness. Let Rˣᵢ denote the rank assigned to the iᵗʰ sample from X, and Rʸⱼ denote the rank assigned to the jᵗʰ sample from Y :

Under the null, since the distributions are identical, Rˣᵢ and Rʸⱼ must be identically distributed for all 1 ≤ i ≤ n₁; 1 ≤ j ≤ n₂. Thus,

The calculation for the variance is slightly more complex. We make use of the following facts:

Thus,

Similarly, for i ̸= j:

Therefore, by the definition of variance and covariance, we have:

Therefore, by the extended sum of variances:

Thus, if we standardize W and apply CLT, we obtain:

Thus,

Thus, at a 10% significance level, since the p-value < 0.10 we can reject the null hypothesis! Finally, let’s try to calculate the p-value for the same data if the alternative hypothesis was two-sided:

Thus, we won’t be able to reject the null at 10% significance for the two-sided case. This concludes our discussion on conducting the Wilcoxon Signed Rank Test for one and two sample cases.

In this article, we familiarised ourselves with some of the most well-known non-parametric or distribution-free hypothesis testing frameworks. We took a look at the Sign Test and the Wilcoxon Signed Rank Test for both the one-sample and the two-sample case and compared their performance on sample observations. We explored both the one-sided and the two-sided cases for each of the tests as well as how to generalize them for testing for any given quantile of the data. The Sign test, although less powerful in general is quite simple, relying only on the CDF of the binomial distribution to derive the associated p-values. The Wilcoxon Signed Rank Test, on the other hand, is far more complex in its theory but tends to give better results as it not only considers whether the values are smaller or bigger than the threshold but also considers the relative magnitude of their absolute deviations.

Hope you enjoyed reading this article! In case you have any doubts or suggestions, do reply in the comment box.

Please feel free to contact me via mail. If you liked my article and want to read more of them, visit this link.

Note: All images have been made by the author.

Gauging the Mechanics of the Sign and Wilcoxon Signed Rank Tests

Introduction
Tool 1: One-Sample Sign Test
Tool 2: Two-Sample Sign Test
Limitations of the Sign Test
Tool 3: One-Sample Wilcoxon Signed Rank Test
Wilcoxon Signed Rank Test: Possible Complications
Tool 4: Two-Sample Wilcoxon Signed Rank Test
Conclusion

The hypotheses are given as follows:

Then,

Note that the distribution still follows binomial with the same parameters as the reasoning remains the same as for the previous case. Finally, we discuss the two-sided hypothesis case:

For this case, we compute both N⁺ and N⁻ for the sample:

Note that π₀.₂₅ denotes the 25th quantile or the 1st quartile of the data. Just as before, we count the number of Xᵢ less than 3:

Consequently,

That’s a lower p-value, but still not quite low enough to reject at a 5% or 10% significance level. With this, we conclude our discussion on the One-Sample Sign Test.

Example: Suppose we are given the following paired data:

The hypotheses are given as follows:

The test statistic is given by:

Because this is a two-sided test, an extreme result can be either 8 or more positive signs or 2 or fewer positive signs:

Thus, we won’t be able to reject the null at a 10% significance for the two-sided case. This concludes our discussion on conducting the Sign Test for one and two sample cases.

The theory behind the Wilcoxon Signed Rank Test is slightly convoluted. But, don’t worry. Let’s revisit the example from the previous section and apply the Wilcoxon test sequentially:

Step 1: Find the absolute value of the difference between Xᵢ and m⁰ for each i:

Step 2: Rank each of the computed values in step 1 i.e., assign rank (Rᵢ) from 1 to n for every | Xᵢ − m⁰|:

Step 3: Compute the Signed Rank (Rₛᵢ) for each sample, where the sign is given by the sign of Xᵢ − 4 (the same sign that was used in the Sign Test) i.e.,

Step 4: Compute the Wilcoxon Signed Rank Test statistic, which is given by the sum of the signed ranks for all the data points:

where Sᵣ denotes the sign of the rᵗʰ rank (all rank values will go from 1 to n). Thus,

If we standardize W, we contain:

Evidently, the p-value is quite large. So, even at a 10% significance level, since the p-value > 0.10, we fail to reject the null hypothesis! And that concludes the simple, yet useful application of the Wilcoxon Signed Rank test. Now let’s consider the case when the alternative hypothesis was in the other direction. In particular, suppose we were interested in testing:

Again, the p-value is large and we fail to reject the null at a 10% significance level. Finally, let’s look at the two-sided case:

Just as before, we construct the table and sum up the signed ranks to get the test statistic:

Now, we determine the distribution of the above test statistic under the null:

If we standardize W, we obtain

By Lyapunov CLT, we have:

Thus,

That’s a lower p-value, but still not quite low enough to reject at a 5% or 10% significance level. With this, we conclude our discussion on the One-Sample Wilcoxon Signed Rank Test.

There are two possible complications that may arise while using the Wilcoxon Signed Rank test:

For i ̸= j, |Xᵢ − m⁰| = |Xᵢⱼ − m⁰| i.e., the absolute deviations from m⁰ are the same. In this case, which of the observations would be assigned a higher rank? Firstly, under the assumption that the data follows a continuous distribution, the same data shouldn’t be occurring more than once theoretically. But, often in practice, this can happen. There are many ways to approach this problem, but the most common strategy is to assign each sample the average of the ranks. For instance, if there are 4 sample values such that they have the same absolute deviation from m⁰. If these observations take the ranks 7, 8, 9, and 10, for each sample we assign the average rank: (7 + 8 + 9 + 10)/4 = 8.5.
If |Xᵢ − m⁰| = 0 for some i, we exclude this sample from the analysis itself. We just used the reduced sample (and the corresponding reduced sample size) to compute the test statistic and its distribution.

For example, if we are given the following data and hypothesis:

We exclude the 5th sample (since |X₅ − 3| = 0) and use the remaining samples to calculate the average ranks:

The hypotheses are given as follows:

Step 1: Combine Xᵢ and Yᵢ :

24.27, 8.63, 16.76, 21.92, 29.59, 4.01, 7.28, −7.75, −6.61, 13.05, 13.47, 24.6, −4.97, 0.07, 6.96, −0.53, 7.26, −11.7, −5.01, −4.43

Step 2: Arrange the combination in ascending order (remember to keep track of which observations, come from which sample:

Step 3: Assign The Ranks:

Step 4: Calculate the test statistic by summing up ranks from sample X:

Next, we find the distribution of W under the null. As before, there is NO closed formula for the distribution of W, So we approximate the distribution (CDF) of W using the Lyapunov CLT:

Under the null, since the distributions are identical, Rˣᵢ and Rʸⱼ must be identically distributed for all 1 ≤ i ≤ n₁; 1 ≤ j ≤ n₂. Thus,

The calculation for the variance is slightly more complex. We make use of the following facts:

Thus,

Similarly, for i ̸= j:

Therefore, by the definition of variance and covariance, we have:

Therefore, by the extended sum of variances:

Thus, if we standardize W and apply CLT, we obtain:

Thus,

Thus, we won’t be able to reject the null at 10% significance for the two-sided case. This concludes our discussion on conducting the Wilcoxon Signed Rank Test for one and two sample cases.

Hope you enjoyed reading this article! In case you have any doubts or suggestions, do reply in the comment box.

Please feel free to contact me via mail. If you liked my article and want to read more of them, visit this link.

Note: All images have been made by the author.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.