2

If I perform a simulation for the one-sample Wilcoxon test (2-sided) for samples of size 5 (1000 times), the null hypothesis could not be rejected. Why?

I know the probability of the test statistic being equal to zero is 1/32= 0.03, which less than $\alpha=0.05$. However, I do not get p-values less than $\alpha$.

Alexis
  • 29,850
  • 5
    Toss a fair coin $5$ times: the probability they are all the same (a two-sided measure - pun intended) is $0.0625>0.05$ – Henry Sep 21 '22 at 13:33
  • 1
    To add to @Henry's point the probability of all 5 1st group measures being greater than all 5 2nd group measures is 1/32 = 0.03125, but there is also the probability of all 5 1st group measures being less than than all 5 2nd group measures, which is also 0.03125. So you can only fail to reject a two-sided null hypothesis at the $\alpha=0.05$ level with a sample size of 5, because the smallest possible p-value is 0.0625. Put another way: multiple all the p-values in your simulation by 2. :) – Alexis Sep 21 '22 at 14:37
  • 2
    Thought I suspect the extreme case here should be $2\times \dfrac{1}{10 \choose 5}\approx 0.0794$ – Henry Sep 21 '22 at 14:48
  • 2
    @Henry That's right--for the two-sample test. Those who don't want to work it out can have R confirm your value with wilcox.test(1:5, 6:10). For the one-sample test of the question, wilcox.text(1:5) returns 0.0625, as you first indicated. Performing this calculation before running the simulation would have indicated what the simulation is able to achieve. – whuber Sep 21 '22 at 16:36

1 Answers1

6

I'll write "differences" when describing the observed values (as if it were a paired test), but for a one-sample test just read 'difference' as 'observation' or 'value'.

The possible values of the test statistic (here using Wilcoxon's original definition of the statistic) for the sum of the ranks for the negative differences are $\{0, 1, ..., 15\}$ ($0$ being the value when all five differences are positive, and $15$ being the value when all five differences are negative).

These have probability $\frac{1}{32}$ times $(1,1,1,2,2,3,3,3,...,1)$ (the probabilities are symmetric around the mean sum of ranks of $\frac{15}{2}$).

The smallest possible two tailed p-value is therefore $\frac{1}{32}+\frac{1}{32} = \frac{1}{16}>0.05$. You cannot see a smaller p value than this.

Consequently, you cannot reject $H_0$ at the $5\%$ level with $n=5$; the smallest available two-tailed significance level is $\frac{1}{16}$.


This is a common problem with extremely small samples and various commonly used discrete test statistics (e.g. those used in sign tests, Fisher exact tests, rank correlation tests, the usual rank tests of location, etc). I have on multiple occasions seen researchers conduct tests at small sample sizes in these circumstances without being aware that there was no possible sample that could lead them to reject $H_0$.

If significance levels are lower, such as when there are corrections for multiple testing, this problem will occur more often.

This is one reason (of several) why I encourage people to explicitly choose a rejection rule based on a test statistic, not simply to compare $p$ to some $\alpha$ (a potentially dangerous practice, as we see here). Explicitly identifying a rejection rule at least increases the chance that the problem (that no such rule exists at the desired significance level) will be recognized.

Indeed, when the statistic is discrete I encourage people to identify the set of available significance levels (or at least the ones near desired significance levels), and to choose from among them.

There are sometimes ways to mitigate the discreteness issue, at least with some statistics, without altering the interpretation of what the statistic measures -- and without losing the distribution-free property of permutation statistics when performing such tests. Such approaches can't help in this case, though.

Glen_b
  • 282,281