11

I was wondering if it is possible to have 2 samples (A and B), where A is stochastically dominant over B (via Mann Whitney, or Kruskal Wallis), but where conversely B has a significantly higher mean than A (via 2-sample Welch test).

A and B can not be symmetric for this to be possible, and probably not have the same shape either. A and B also would not be normal, but the t-test is valid per CLT (say sample size 30 or 40). And probably this is more likely for single sided tests than double sided ones. I have tried various ways, but no luck so far...

Does such a counter-example exist, or if A is dominant over B, then the mean of A will never be statistically inferior to the mean of B (may not be statistically significantly superior, but can not be inferior). That is, both MW and Welch are significant, but in opposite directions.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
jginestet
  • 440
  • 1
  • 9
  • It seems you have a typo in the next to last line. One of the B's should be A. – David Smith Jan 13 '24 at 03:03
  • Is there any constraint on the shape of the distribution? Just add a very large value to one group. – Jeremy Miles Jan 13 '24 at 04:05
  • @JeremyMiles adding a very large value may be counterproductive as it will inflate the standard deviation more than it does the mean, tending to make the t move nearer to 1, rather than become larger. – Glen_b Jan 13 '24 at 06:15
  • However the presence of the second sample does make the situation a little more complicated than that, so it's maybe not that straightforward. – Glen_b Jan 13 '24 at 07:25
  • Thanks Glen_b. Yes, I tried this, but as yo0u said it inflates teh sd much more than the mean (square!), and so makes things worse... – jginestet Jan 13 '24 at 22:27
  • @David Smith, thanks for catching the typo. Corrected – jginestet Jan 13 '24 at 22:28
  • @jginestet Yeah, for some examples the influence function of a single observation on the test statistic reaches a peak, then heads back toward 1 or -1. With the Welch test, though, because the degrees of freedom are changing as you go, you can also get a fun little bump in the shape of the influence function of the p-value coming in as well (it happens in some cases related to my second example). – Glen_b Jan 14 '24 at 09:31

3 Answers3

19

(via Mann Withney, or Kruskal Wallis),

While the test statistic in Mann-Whitney does correspond to a measure of $P(X>Y)$, once you go beyond two samples Kruskal-Wallis doesn't quite do that in general. Three pairwise Mann-Whitney tests are sensitive to (/can detect) non-transitive dominance (where $X>Y$, $Y>Z$, $Z>X$; "pairwise dominance" in this sense is a non-transitive relation) but Kruskal-Wallis does not detect that cyclical relationship.

A and B also would not be normal, but the t-test is valid per CLT (say sample size 30 or 40)

$n=40$ gives no guarantee that the test statistic has very close to a t-distribution; and certainly nothing in the central limit theorem suggests that will be the case at that specific sample size. This is not especially important, as we can test for difference in population means without needing a t-statistic to have a t-distribution. Besides tests based on other parametric assumptions (i.e. other than the one of normality of the parent distributions), there's also nonparametric tests of means.

For the moment, however let's agree to take as given that the situation is such that the t-statistic is approximately distributed as t with the usual degrees of freedom even though it's not always going to be the case.

I was wondering if it is possible to have 2 samples (A and B), where A is stochastically dominant over B [...] but where conversely B has a significantly higher mean than A (via 2-sample Welch test).

Yes. You can have populations where $P(X>Y)>0.5$ and yet with $\mu_Y>\mu_X$. With large enough samples you can have high power to detect both effects.

It's not just means and $P(X>Y)$ that this can happen with; you can do this with almost any pair of non-equivalent tests. You could, for example have the Mann-Whitney go in the opposite direction to a difference in medians if you wanted.

Does such a counter-example exist,

Do you mean as data sets? Counterexamples to the idea they should go in the same direction are easy to construct.

Finding very small samples for which the mean difference is in one direction and the Mann-Whitney "difference" is in the opposite direction isn't difficult; of course with small samples the p-values won't be small.

Attaining significance is then just a matter of then accumulating the same sorts of differences enough times to get standard errors of those differences to be small.

Here's one I just constructed in R:

# 1. create a pair of small samples where the Mann-Whitney and 
#   t-test differences go in opposite directions
x <- c(-39, -41, 9, 10, 11, 50)  # mean is 0
y <- -x # here the means are equal but x slightly dominates y (20/36)
x <- x-6  # here we move x so the means differ in the opposite direction   
          # without shifting enough to change the ordering

2. create many copies of each sample (pushing p-values down)

... but with a very small amount of noise added (to remove ties)

rp <- 40 x1 <- rep(x,each=rp)+rnorm(rplength(x),0,.02) y1 <- rep(y,each=rp)+rnorm(rplength(y),0,.02)

The large x-sample still dominates the y-sample 55% of the time (same proportion as in the small samples), and the large x-sample still has mean about 6 smaller than the y-sample (same as in the small samples).

Except now these samples are big enough that each of the differences is significant at the 5% level (both p-values are between 0.03 and 0.04). I used the ordinary equal-variance t-test for the exercise*, rather than the Welch t test -- not that it matters in this case; the sample variances are effectively identical, so the two tests are the same. Of course you can use the same strategy to attain even smaller p-values.


Example 2

X sample (sorted, n=27):

28.3, 28.4, 28.9, 29.6, 30.2, 30.4, 30.6, 31.1, 31.3, 107.8,   
107.9, 108, 108.3, 108.4, 108.5, 108.8, 109.1, 109.2, 109.4,   
109.5, 110, 110.2, 110.7, 110.8, 110.9, 111.1, 111.8  

Y sample (sorted,n=22):

 88.4, 88.9, 89.1, 90.9, 97.9, 98.5, 98.9, 99.2, 99.3,   
 99.5, 100.1, 100.6, 100.7, 100.9, 101.8, 102.4, 102.8, 102.9,   
103.4, 104.2, 105.4, 107.7  

Fully 2/3 of the pairs of $(X_i,Y_j)$ values have the X's exceed the Y's but the mean of the X's is well below the mean of the Y's. Both the Welch t-test and the Mann-Whitney test are significant at the 5% level for two tailed tests, but they're detecting in opposite directions.

> t.test(x,y)
    Welch Two Sample t-test

data: x and y t = -2.1901, df = 27.243, p-value = 0.03725 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -31.597020 -1.036313 sample estimates: mean of x mean of y 82.93333 99.25000

> wilcox.test(x,y,conf.int=TRUE)

    Wilcoxon rank sum exact test

data: x and y W = 396, p-value = 0.04704 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: 0.2 9.1 sample estimates: difference in location 6.7

Here are confidence intervals for each test; the Mann-Whitney (MW, in blue) has A "higher" than B and the Welch t test (tW, in red) has A "lower" than B:

Plot of 95% CIs for the location difference A-B for both the Wilcoxon-Mann-Whitney and the Welch t-test, showing A>B in the Wilcoxon-Mann-Whitney and A<B in the Welch test


* the specific data and the statistics and p-values are not shown for the first example. It has 240 observations per sample, so I don't find it interesting in itself, the important part is understanding the construction method. With it you can make as many examples as you like. I was also worried people would focus too much on the specific appearance of this lone case and - either consciously or unconsciously - generalize features of that single example (i.e. assume that all cases would be of this form), as if its specific features characterized the situation. For example, I had some concern that they might conclude that the opposite directions of skewness in the first example was a necessary condition. The second smaller example has quite different features, which helps with that concern.

Glen_b
  • 282,281
  • thanks. That is exactly what I was looking for, and indeed my problem finding a counterexample was that the sd was always so large that the tests were not significant. But of course, if I just replicate the data, with a bit of noise, the power increases, and it does the trick. – jginestet Jan 13 '24 at 22:35
  • 1
    I was looking for this to ask students a "philosophical" question. I you have 2 treatments A and B, A being significantly dominant over B, but B having significantly higher mean over A, which would/should be the recommended treatment? My own answer to this is that if I am a public health official, I would pick the one with the highest mean (overall average population health will improve). But if I am a MD, treating 1 patient, I would recommend the dominant one (my patient, most of the time, will be better off on it...) Thanks for the answer – jginestet Jan 13 '24 at 22:45
  • Ah, this is a very good question to ponder, and it's great to have some data to make it more concrete. For the purpose of making the values seem more realistic, there would be no problem with shifting and scaling both samples by the same amount (e.g. you could multiply the values in both samples by say 0.5 and add some value like 90, say or 150, or whatever would suit. If I come up with a better example, I will come back and give it. To my recollection I have invented a smaller example at some point in the past (likely using a similar 'replication' trick) but I recall nothing about it. – Glen_b Jan 13 '24 at 22:55
  • 1
    Actually, now I write about it rather than just idly ponder, starting with Efron dice makes more sense than what I began with, that might be what I did before. I think perhaps one important lesson there is to ponder these questions at the time you're designing your studies, so that you make sure you answer the right question for the circumstances at hand. – Glen_b Jan 13 '24 at 22:57
  • 1
    Starting with thinking about Efron dice helped; I came up with x= -5,4,4 and y=3,3, from which I've got a new set with n1=30 and n2=20, which is much more promising as an example. Just playing about with it a bit to make it look nice, and I might see if I can push the Mann-Whitney p-value down further and maybe reduce the sample sizes as a result; there's plenty of play in the t-test p-value. – Glen_b Jan 14 '24 at 01:28
  • Post has been updated with a second, much smaller example (total of 49 data points) constructed in more or less the same fashion as before with a few cosmetic tweaks. – Glen_b Jan 14 '24 at 02:18
  • @jginestet The "philosophical question" is maybe worth its own question. Personally I think that in such a case (a) one would need to look at the data to see how small the majority is that draws the mean in the other direction than the rank sum, and (b) it is really important what exactly this is about, i.e., what is measured, and what consequences a slightly worse result for a majority has vs. a much better result for a few. Note also that "significantly dominant" isn't good wording for the MW result as in these cases there is no domination (in the sense of "stochastically larger"). – Christian Hennig Jan 14 '24 at 11:01
  • The Mann-Whitney is directly based on an estimate of the probability a random observation from one population exceeds a random observation from the other (estimating $P(A>B)$ by $\frac{U}{mn}$). I believe the OP means larger in that specific sense; you can't get a significant Mann-Whitney statistic in that direction without the sample exhibiting that specific form of dominance (as a sample proportion rather than a probability, naturally) – Glen_b Jan 16 '24 at 00:43
5

It's also easy to construct transformed Normal examples. If $Y\sim N(-.1,2)$ and $X\sim N(0,1)$ then the Mann-Whitney test says $X\succ Y$, and therefore $\exp(X)\succ \exp(Y)$, but the mean of $\exp(Y)$ is greater than the mean of $\exp(X)$.

From a medical point of view, I use an example of a preventive treatment that increases the cost for most people but substantially reduces it for people who would otherwise have had the prevented disease, eg

x<-rep(1, 1000)
y<-rbinom(1000,1,.1)*runif(1000,10,20)

There's an example of this happening in real life here with corticosteroid maintenance therapy in asthma

Thomas Lumley
  • 38,062
0
  • With a Mann Whitney U test you compare whether one distribution is or is not stochastically greater than the other.

    Mann Whitney tests the hypothesis $$P(X>Y) = P(Y>X)$$ or $$\text{median}(Y-X) = 0$$

  • With a t-test you compare the means.

    A t-test tests the hypothesis $$\mu_X = \mu_Y$$ or $$\text{mean}(Y-X) = 0$$

These two tests will relate to a single more general hypothesis that states that two distributions are the same. As for two distributions that are the same we have $P(X>Y) = P(Y>X)$ and also $\mu_X = \mu_Y$.

So both tests can be used as a test for that single hypothesis, but they will be sensitive to different alternative hypotheses (the one to the median of $Y-X$ and the other to the mean of $Y-X$). If you assume that the change occurs only by means of a shift in the value than stochastic order and difference in means tell the same story since a shift has the same effect on the mean and the median. But for more general situations (differences beyond just a shift) one can have situations with opposite signs. We can have distributions with a mean and median of opposite signs.

Related question 1

A related question is Have I presented this Mann-Whitney U test appropriately?

An illustrative example is the contingency table

Intuitively you can consider the Mann-Whitney U test as comparing something like an empirical joint distribution (the numbers in the cells are the product of the numbers in the margins, e.g. the upper left number $147 = 7 \times 21$):

$$\begin{array}{cc | cccccccc} &&\text{SD} &\text{D}&\text{N}&\text{A}&\text{SA}\\ & &7 & 15& 28 & 13 & 8\\ \hline \text{SD}&21& \color{gray}{147} & \color{blue}{315} & \color{blue}{588} & \color{blue}{273} & \color{blue}{168}\\ \text{D}&17& \color{red}{119} & \color{gray}{255} & \color{blue}{476} & \color{blue}{221} & \color{blue}{136} \\ \text{N}&82& \color{red}{547} & \color{red}{1230} & \color{gray}{2296} & \color{blue}{1066} & \color{blue}{656}\\ \text{A}&34& \color{red}{238} & \color{red}{510} & \color{red}{952} & \color{gray}{442} & \color{blue}{272} \\ \text{SA}&18 & \color{red}{126} & \color{red}{270}& \color{red}{504} & \color{red}{234} &\color{gray}{144} \\ \end{array}$$

And the question is: Do I get more observations in the upper right corner (men more often higher than women, blue) or in the lower left corner (women more often higher than men, red)?

The t-test answers the question whether the marginal means of the two distributions are different. You can imagine that moving around the values in the in those two area's can change the means while keeping the stochastic dominance the same.

Below is an example where the means are 3 for both X and Y:

$$\begin{array}{cc | cccccccc} &&&&&X \\ & &&\text{1} &\text{2}&\text{3}&\text{4}&\text{5}\\ && &50 & 50& 50 & 50 & 50\\ \hline &\text{1}&50& \color{gray}{10} & \color{blue}{10} & \color{blue}{10} & \color{blue}{10} & \color{blue}{10}\\ &\text{2}&50& \color{red}{10} & \color{gray}{10} & \color{blue}{10} & \color{blue}{10} & \color{blue}{10} \\ Y&\text{3}&50& \color{red}{10} & \color{red}{10} & \color{gray}{10} & \color{blue}{10} & \color{blue}{10}\\ &\text{4}&50& \color{red}{10} & \color{red}{10} & \color{red}{10} & \color{gray}{10} & \color{blue}{10} \\ &\text{5}&50 & \color{red}{10} & \color{red}{10}& \color{red}{10} & \color{red}{10} &\color{gray}{10} \\ \end{array}$$

By changing the distribution while keeping the sum of red and blue cases the same we can change the means

$$\begin{array}{cc | cccccccc} &&&&&X \\ & &&\text{1} &\text{2}&\text{3}&\text{4}&\text{5}\\ && &50 & 40& 40 & 40 & 80\\ \hline &\text{1}&50& \color{gray}{10} & \color{blue}{0} & \color{blue}{10} & \color{blue}{10} & \color{blue}{20}\\ &\text{2}&60& \color{red}{20} & \color{gray}{10} & \color{blue}{0} & \color{blue}{10} & \color{blue}{20} \\ Y&\text{3}&60& \color{red}{10} & \color{red}{20} & \color{gray}{10} & \color{blue}{0} & \color{blue}{20}\\ &\text{4}&60& \color{red}{10} & \color{red}{10} & \color{red}{20} & \color{gray}{10} & \color{blue}{10} \\ &\text{5}&20 & \color{red}{0} & \color{red}{0}& \color{red}{0} & \color{red}{10} &\color{gray}{10} \\ \end{array}$$

In the table above we have $P(X>Y) = P(X<Y)$, but $E[X] = 3.24$ and $E[Y] = 2.76$

Related question 2

Another example is given below from an answer to the question Which statistical analysis should I perform if the data sets are not normally distributed?

Keep in mind that a Mann-Whitney U test is answering a different question than the question whether two populations have the same means or not.

This is demonstrated in the example below. For some funny shaped population distribution (in order to make the outcome more extreme, with other types of distributions the effect will be less) we take 1000 times two samples of size 50 and compare them based 95% t-test and Mann-Whitney test. Overall, the two tests each reject about 5% of the time the null hypothesis, however they only do this at the same time only in 2% of the cases.

For this particular case it means that when you are rejecting the null hypothesis when either the Mann-Whitney or the t-test is with a p-value below 0.05, then this is not occurring in 5% of the cases, but instead in 8% of the cases. (and that is the 'problem' of cherry picking and peeking at multiple types of test instead of deciding beforehand what sort of test should be appropriate to use)

comparing two different tests

ns <- 50      # samples of size 50
nt <- 10^4    # compare 10^4 tests

pU and pT will contain the p-values of the tests

pU <- rep(nt,0)
pT <- rep(nt,0)

simulate data and perform tests nt times

for (i in 1:nt) {

some funny distribution with three modes

xy <- c(-1,0,0,0,1)[1+rbinom(ns,4,0.5)] y <- rnorm(ns,xy,0.1) xz <- c(-1,0,0,0,1)[1+rbinom(ns,4,0.5)] z <- rnorm(ns,xz,0.1)

perform tests

pT[i] <- t.test(y,z)$p.value > pU[i] <- wilcox.test(y,z)$p.value }

plot results of different p values

plot(pT,pU, xlim = c(0,0.3), ylim = c(0,0.3), xlab = "p value t-test", ylab = "p value Mann-Whitney test", main = "comparing two different tests", pch = 21 , col = 8, bg = 8, cex = 0.5)

plotting percentage of points in different regions

lines(c(0.05)c(1,1),c(0,1), col = 2, lty = 2) lines(c(0,1), c(0.05)c(1,1), col = 2, lty = 2) text(0.025,0.025, paste0(100sum((pT <= 0.05)(pU <= 0.05))/nt, " %"), cex = 0.7, col= 2 ) text(0.15,0.025, paste0(100sum((pT > 0.05)(pU <= 0.05))/nt, " %"), cex = 0.7, col= 2 ) text(0.025,0.15, paste0(100sum((pT <= 0.05)(pU > 0.05))/nt, " %"), cex = 0.7, col= 2 )

plotting the shape of the population distribution

from which the samples where drawn

t <- seq(-2,2,0.01) plot(t,0.5^4dnorm(t,-1,0.1)+0.5^4dnorm(t,1,0.1)+(1-0.5^3)*dnorm(t,0,0.1),

type = 'l', xlab = "value", ylab = "density", main = "funny distribution")