Why does the Wilcoxon rank-sum care about data variance with uneven group sizes?

Question

I was playing a bit with simulations to get a better of picture of how unpaired t-test and Welch t-test compare when comparing same-mean data with uneven variance. As a third test, I included the rank-sum test. I noticed that when I repeatedly compare groups of 60 and 30 normally distributed samples (mean 0, first group having stdev of 1, the second stdev of 2), the rank-sum test has the tendency to give false positive significance, giving p<0.05 in 0.08-0.09 of the total number of simulation repeats (a similar trend to the non-Welch t-test). When the two groups have same size of 30 (with stdevs of 1 and 2 again), the p-value is just a shade over the expected 0.05.

Where does the high false negative rate of the rank-sum come from in the case with uneven group sizes? I think I understand what the null hypothesis is, but still don't immediately see how the group size affects the test.

Simulation code for just the rank-sum test in Matlab below:

nRepeats = 1e4;
%% We measure fraction of false positives for data with uneven variance, with:
% a) same-size group
% b) first group twice as large as the second one
meanVal = 0;
sd1 = 1;
sd2 = 2;
nSamples = 30;
pRSsame = zeros(nRepeats, 1);
pRSdifferent = zeros(nRepeats, 1);
for iRepeat = 1:nRepeats
    % same-size
    data1 = randn(nSamples, 1) * sd1 + meanVal;
    data2 = randn(nSamples, 1) * sd2 + meanVal;
    pRSsame(iRepeat) = ranksum(data1, data2);
% data1 larger
% same-size
data1 = randn(2*nSamples, 1) * sd1 + meanVal;
data2 = randn(nSamples, 1) * sd2 + meanVal;
pRSdifferent(iRepeat) = ranksum(data1, data2); 

end
fractionFPsame = sum(pRSsame < 0.05)/nRepeats % ~0.055
fractionFPdifferent = sum(pRSdifferent < 0.05)/nRepeats  %~0.088

Thank you!

It assumes exchangeability under the null, as with any permutation test, and it's sensitive to that for much the same reason the t-test is: its less likely to see a difference when one sample tends to stick out both ends at the sane time. This isn't necessarily a problem with samples with different spread because typically the spreads are related to the locations; if the spread increases fairly moderately as the location moves up there's no problem ( e.g you may just be dealing with a scale difference rather than a location difference). It can he perfectly consistent with exchangeability. — Glen_b, Oct 08 '21 at 05:06
@Glen_b Just to double-check, is it right that for this reason, one cannot say the null hypothesis is P(X>Y) = P(X<Y), but that it requires the assumption of the same distribution at the same time? — TJ27, Oct 08 '21 at 17:18
That's a reasonable way to write the null. You need the distribution to the be same when the null is true, or the basis for permuting the labels is lost*, but that's not the null, that's an assumption (in the same sense that in an ordinary two sample t-test the null is about means but assumptions include having the same variance & independence). Obviously the distribution is not the same under the alternative, indeed that's the whole point of the test! ... [ *(we typically assume independence as well as the same distribution but exchangeability is a weaker condition than independence)] — Glen_b, Oct 09 '21 at 01:33
It's not clear to me why many authors insist on importing the assumptions into the null for nonparametric tests when they don't do it with parametric ones. The circumstances are analogous. — Glen_b, Oct 09 '21 at 01:34
The data might well have different distributions. The issue is the what you assume would happen under the null, which the data usually can't help you with (since the null is usually false). It's more about "what do I think a null effect is". Consider a treatment that you expect to either modify some measured response over the standard its being compared to, or (under the null) to be ineffective -- not to have any effect on the distribution at all. If the treatment works, the populations -- and hence the samples might well have increasingly different shapes at bigger effect sizes. — Glen_b, Oct 09 '21 at 03:56
That "does nothing to change the distribution" assumption about the null won't always be suitable and then you might need to do something else -- like consider a bootstrap test for the difference in means (or whatever other quantity is of interest). — Glen_b, Oct 09 '21 at 03:57
That's very true actually, it does seem very analogous when you mention it!
I guess the difference might be in that with e.g. t-test, people just say "don't use it" if the variances are unequal, but with the Wilcoxon rank-sum, people use it even for data with very different distribution (which is not a problem, as you say, but it seems the interpretation is different)?

(also thank you for your responses - I removed the question probably while you were writing, sorry, because I realised it wasn't very clear with regards to what I meant). — TJ27, Oct 09 '21 at 03:59
The same thing is the case with the equal variance t-test. Imagine that under the null we have the same (~close to) normal distribution as assumed, but under the alternative the variance and even the shape for the treated group begin to change as the effect size moves away from 0. The test still has the right significance level, and as long as the distribution doesn't change very quickly with effect size, will still behave in sensible ways - the power will still increase as the effect size gets larger. The biggest problem will be that the 'standard' power calculations won't work as is. — Glen_b, Oct 09 '21 at 04:05
It's true, I never realised the degree of similarity there, thanks a lot for bringing up that angle of view.
With power calculations I find it can be the most straightforward to simulate that anyway, based on how the data actually look (if we have an idea) and the test I plan to use... — TJ27, Oct 09 '21 at 04:13
Maybe I wasn't too clear about how the simulations look - it's not necessarily bootstrapping which clearly has the issue you raise. What I meant is that one can sometimes have a decent idea of what the data look in general (based on relevant prior work, such as when you use a particular experimental technique in a slightly new setting and there is no reason to think the data will be distributed differently). There can be also things like unpaired experiments of treatment vs control, where you know that the treatment group tends to have more of a spread, — TJ27, Oct 10 '21 at 02:32
but you need to know if there is a shift in means - so you would want to do a t-test (say the distributions are reasonably normal) - but you anticipate that in the end you'll use Welch's t-test rather than the standard one. So rather than making "classical" power calculations with normal t-test, I'd say it's better to just generate data like you expect to see in the future (in a way not limited to the particular observed values) and get the rate of t1 and t2 errors using repeated simulations for various parameters. — TJ27, Oct 10 '21 at 02:36
P.S. Of course power shouldn't be calculated at the observed effect size (although post-hoc power can be interesting sometimes to demonstrate bias in publications :)) - but again you can often have a priori guess that informs a range to be explored. If a drug increases heart rate in normal hearts by 20%, it is a sensible first guesstimate for the heart rate change in diabetic hearts, as long as there is no evidence that it should act differently. Sure, one would explore the power for 10% and 30% as well, but no need to consider 200% change. — TJ27, Oct 10 '21 at 02:39
@Glen_b Sorry for perhaps being slow and reviving this again - was just reading other threads on this and ran a couple extra simulations and perhaps confused myself again. Here you wrote that Mann-Whitney isn't sensitive to changes in variance with equal mean: https://stats.stackexchange.com/questions/56649/mann-whitney-null-hypothesis-under-unequal-variance/56653#56653 - but my impression is that the simulations in my post and the work by Zimmermann 2004 (https://files.eric.ed.gov/fulltext/EJ848306.pdf, and others) show otherwise. — TJ27, Oct 20 '21 at 03:51
Assuming we take the more "general" application of the test (testing whether one group is stochastically larger), taking the null hypothesis as P(X<Y) = P(X>Y), then I just don't understand again why the data variance matters when mean is the same and the distribution is symmetrical. If I take the example in the simulation above, having 60 normally distributed numbers with mean 0 & sd 1 versus 30 normally distributed numbers with mean 0 & sd 2, in this case, the null hypothesis holds, no? Given the perfect symmetry, I do believe that neither group is stochastically larger? — TJ27, Oct 20 '21 at 03:51
But more than 5% of the ranksum tests on such data give p-value smaller than 0.05, giving a high number of false positives. This is not nearly as bad when the groups have the same size. This seems to have been discussed a number of times (e.g. Zimmermann 2004 https://files.eric.ed.gov/fulltext/EJ848306.pdf - third paragraph of "Further implications" sounds relevant for this discussion), but I do not see the mathematical explanation for the simulation results. — TJ27, Oct 20 '21 at 03:52
... Or is it that this is simply a result of the violation of the assumption of same distribution of the two populations? — TJ27, Oct 20 '21 at 04:01
The phrasing there could be clearer but the other answer of mine that you link to is making the claim that the test doesn't respond to alternatives where you hold the mean constant and change the variance (i.e. that it doesn't have good power against pure changes in spread rather than changes in location). That's demonstrably true, but unrelated to anything being said here. — Glen_b, Oct 20 '21 at 12:20
@Glen_b: Ok, I see. Right, so that was indeed somewhat unrelated. That said - how would you then phrase the assumptions and the null hypothesis of the test? Trying to read up more about the test, I find it puzzling how different formulations are provided across different resources (some of them being clearly problematic). — TJ27, Oct 21 '21 at 02:20
There's several ways to phrase it (and it might vary depending on which alternatives are particularly of interest), but I'd typically say for the usual assumptions: Independence, continuity, same shape and scale under the null, with a null $P(X>Y) = \frac12$ vs $P(X>Y) \neq \frac12$ for the two-sided alternative. where $X$ and $Y$ represent a random observation from the first and second population, respectively. However, somewhat narrower classes of alternatives are common (such as ordering in the cdfs). — Glen_b, Oct 21 '21 at 05:17
At this point I should probably turn my comments into an answer, though it's largely a rejection of the premises of the question as it stands (a frame challenge as some stackexchange sites like to say). — Glen_b, Oct 21 '21 at 05:21
Thank you! Right, with these assumptions I see how the original simulation violates these so one wouldn't expect the nominal significance level to hold.
I think a part of my confusion about how to phrase this was in that I often see the test for comparing distributions that are clearly not same-shape (e.g. left-skewed versus right-skewed) on the grounds of non-normality, so I think I was subconsciously looking for phrasing that would not require same shape&scale, but would preserve the nominal level. — TJ27, Oct 21 '21 at 17:04
I think your insights are very valuable here, as this is not something that would be that often discussed even in many statistics books. I have no issue at all with frame challenge personally. — TJ27, Oct 21 '21 at 17:05

Why does the Wilcoxon rank-sum care about data variance with uneven group sizes?

0 Answers0