I'm testing two groups for equivalence in exercise self-efficacy using the TOSTER package in R. I ran a Mann-Whitney U test (wilcox_TOST). The lower limit p value is significant but not the NHST and upper limit p values. The confidence interval partially falls in the predefined equivalence limit (the lower side of it falls in the equivalence limit). The confidence interval also includes zero. How should I interpret this? I know this means that the equivalence tests did not reject both null hypotheses; we cannot reject effects more extreme than the equivalence limits but how should we interpret the NHST p value?
1 Answers
You are doing equivalence testing via two one-sided tests (that's where the TOST comes from). You haven't explained anything about your data so I'm going to use the TOSTER::wilcox_TOST example, which as luck would have it demonstrates this exact scenario.
The problem with hypothesis testing is that these are always conducted under the assumption that $H_0$ is true, and a low probability of observing a certain test statistic under this assumption is seen as evidence against it. When testing equivalence you're exactly trying to prove what is usually $H_0:\theta_1-\theta_2=0$ however, and a 'regular' hypothesis test doesn't allow you to infer that.
This is why such equivalence tests are actually conducted as two tests against some equivalence boundary $\Delta$ (because you can still only ever achieve some probabilistic limit and not exactly $0$), their hypotheses are:
$$ H_{0-}:\theta_1-\theta_2\le{-}\Delta\ \mathrm{vs.}\ H_{A-}:\theta_1-\theta_2>{-}\Delta $$ $$ H_{0+}:\theta_1-\theta_2\ge\Delta\ \mathrm{vs.}\ H_{A+}:\theta_1-\theta_2<\Delta $$
Rejection of both of the above nulls allows you to conclude that ${-}\Delta<\theta_1-\theta_2<\Delta$, establishing equivalence. If the test is conducted at $\alpha$ this is equivalent to the $1-2\alpha$ confidence interval of $\theta_1-\theta_2$ falling entirely within $]{-}\Delta,\ \Delta[$. See also here for a bit more details on why you don't need to correct this for multiplicity for example.
With that background established, let's dive into a data example from TOSTER that uses $\Delta=3$:
TOSTER::wilcox_TOST(mpg ~ am, data = mtcars, eqb = 3)
> Test Statistic p.value
> NHST 42.0 0.002
> TOST Lower 73.0 0.975
> TOST Upper 18.5 < 0.001
These results should match your scenario except that the lower/upper test results are swapped. But what does this mean?
- The default null hypothesis that $\theta_1=\theta_2$ was rejected at $\alpha=0.05$. This does not matter, because they may very well not be exactly the same but still be equivalent up to that margin $\Delta=3$.
Honestly, I don't know whyTOSTERgives you this result. - We did not reject $H_{0-}$ above, so there is not much evidence to support that $\theta_1-\theta_2>{-}3$.
- $H_{0+}$ was rejected, so there is support for $\theta_1-\theta_2<3$.
This being a Wilcoxon test there isn't really a single summary statistic of the 'stochastic exceedance' that it tests (unlike the t-test and the mean for example), but we can use the median as a first approximation:
with(mtcars, tapply(mpg, am, median))
> 0 1
> 17.3 22.8
Looks like a ballpark estimate of $\theta_1-\theta_2$ is on the order of $-5.5$, which matches with what our hypothesis tests were telling us above.
Edit: There is a better summary statistic and confidence interval for these tests in the pseudomedian (one sample) or Hodges-Lehmann shift estimator (difference of two samples). This is calculated by wilcox.test though will not be exact if there's ties or zeroes in your data. The exact HL shift in this case is $-6.8$.
Finally, you can very easily run these tests by calling the base R wilcox.test (or t.test, or..) directly:
## 'NHST'
wilcox.test(mpg ~ am, data=mtcars)
> W = 42, p-value = 0.001871
Lower
wilcox.test(mpg ~ am, data=mtcars, mu=-3, alternative="greater")
> W = 73, p-value = 0.9749
Upper
wilcox.test(mpg ~ am, data=mtcars, mu=3, alternative="less")
> W = 18.5, p-value = 3.013e-05
Addendum
I realise that I didn't answer the question 'but what if we don't reject $H_0$?'. This does not establish equivalence because of the reasons I mentioned in my second paragraph. It may very well mean that you haven't estimated $\theta_1-\theta_2$ with sufficient precision to say that it isn't zero, but that doesn't mean that it is.
Continuing my example, we were interested in demonstrating that manual & automatic transmission cars were within 3 miles per gallon of one another. If I tell you, yeah, the difference is somewhere between -40 and +30 mpg (including zero! P=0.87 under the null), is that good evidence that they are indeed equal? No, by doing this you could underpower your study and always end up with that result.
- 4,378
-
Thanks for your elaborate response! it's getting clearer now. My only issue is that both my NHST p value and upper limit p value are not significant but the lower limit p value is. I mean how can this happen? shouldn't the NHST p value be significant when at least one of the lower/upper limit p value is also significant? – Lily Dec 15 '23 at 18:52
-
@Lily No, returning to the CI definition and referring to this image, the example was B (CI excludes 0 but covers one equivalence bound) whereas yours is D (CI does not exclude 0 and covers one equivalence bound). Sorry, I wrote it up & only then realized that the null hypothesis test differed from yours. Notice how the null says nothing about equivalence either (A vs. C). – PBulls Dec 16 '23 at 06:29
-
There is one scenario not shown in that image where none of the hypotheses would be rejected, which is a CI that covers both bounds plus zero. This would be my addendum of a way underpowered estimate, from which you can conclude just about nothing. I also edited in a better point estimate and CI for the Wilcoxon test as the pseudomedian and Hodges-Lehmann location shift (check out that blog link for how to calculate the 'exact' CI). – PBulls Dec 16 '23 at 06:30