2

I have two vectors, A and B which I want to compare using the MWU test. Both vectors have the size of 995 with A having a mean and standard deviation of 10.50050 and 2.82287, respectively. The mean and std for B are 10.19397 and 2.87137. The histogram and KDE for the two vectors are as follows. histogram and KDE

All of this would tell me that the distribution of the vectors are very similar, however, MWU (as implemented in Python SciPy) returns a p value of 0.01764, which I deem relatively low.

Could someone please explain what key concept I'm missing?

Thank you.

hvta
  • 21
  • 3
    You're missing the fact that such tests as these become more sensitive to small effects as the sample size increases. – Galen Mar 18 '23 at 22:27
  • I'm experimenting with SciPy' mannwhitneyu method. The default alternative hypothesis is 'two-sided' which returns the low p value. Once it is set to 'less', the p value is 0.99. – hvta Mar 18 '23 at 22:28
  • None of this is any mysterious. Your sample size is large so a small difference can easily come out significant, but of course only if it is in the direction of the specified alternative. – Christian Hennig Mar 18 '23 at 23:04
  • Could someone please explain what key concept I'm missing? -- sample size. This is addressed in many dozens of questions on site (but about a range of tests; however the reason is the same each time). Briefly standard errors decrease with increasing sample size $-$ almost all common test statistics have standard error that is proportional to $\frac{1}{\sqrt{n}}$ $-$ so when $n$ is very large, even quite small differences are sufficient to be inconsistent with $H_0$. The end. – Glen_b Mar 18 '23 at 23:27
  • 1
    Your histograms have a Moiré pattern. Possibly it is similar to this question/answer. If you change the bin sizes then you might prevent it. – Sextus Empiricus Mar 18 '23 at 23:59
  • 1
    The histograms are very similar but even by eye their lower tails look different. It would be better to overlay the histograms and the densities in the same plot. Also, since the number of observations is the same -- are the observations perhaps paired? – dipetkov Mar 19 '23 at 00:40

2 Answers2

1
  • With many measurements (like 995), small differences may already become significant. A small effect size can still be significant.

    Also a t-test will show a significant difference. The means have a difference of about $0.3$ and the t-statistic is around $2.4$ giving a p-value of around $0.0166$

    Are smaller p-values more convincing?

    Why is "statistically significant" not enough?

  • Btw, the Mann-Whitney U test is not the same as a t-test and can be significant, even when the means are the same (https://stats.stackexchange.com/a/470512/).

    The MWU test is testing whether $P(X>Y) = P(Y>X)$. This is the case when in a PP-plot the curve divides the plane into two equal areas. So, in relation to a MWU test, a PP-plot might be a better way to visualize the difference rather than two histograms.

1

One thing I'd like to add to the answer by @SextusEmpiricus is that a useful statistic to report when using an Wilcoxon-Mann-Whitney test is one that reports the probability that an observation in one group being greater than an observation in the other group.

There are several variants of this statistic. Reporting the probability, two are Vargha and Delaney's A and Grissom and Kim's probability of superiority.

These can be transformed to a -1 to 1 scale, like an r value. Examples are the Glass rank biserial coefficient and Cliff’s delta.

I don't know if any of these are available in Python, but they are relatively easy to compute.

My suspicion is that you will find these effect size statistics suggest there is just a small difference in the stochastic dominance between these two distributions.

Sal Mangiafico
  • 11,330
  • 2
  • 15
  • 35