Why is the pseudomedian better than the median in a Wilcoxon test?

Question

The wilcox.test function in R calculates the pseudomedian and a confidence interval when conf.int=TRUE. In this question, Wilcoxon signed rank test - help on interpretation of pseudo median for example, there is a description about the calculation of the pseudomedian, but I don't understand why the confidence interval cannot be done only with the median, but with the median of the means.

You could calculate a confidence interval for the median. [Of the paired differences ? ] But that isn't what the function does. In general, the Wilcoxon signed rank test isn't a test of the median, so it wouldn't quite match for the function to report a confidence interval for the median. — Sal Mangiafico, Apr 06 '23 at 19:07
You might clarify in your question what it is you are actually looking for. In a paired samples case, it's quite possible to look at the median of the differences and look at the confidence intervals for this statistic. This is related to implementations of the sign test in R. But this is essentially unrelated to the Wilcoxon signed rank test. — Sal Mangiafico, Apr 07 '23 at 16:36

score 1 · Accepted Answer · answered Apr 06 '23 at 22:44

I assume you are asking about the Wilcoxon signed-rank test.

So consider a single sample location problem where $X_1,X_2,\ldots,X_n$ are i.i.d with distribution $F(x-\theta)$, under the assumption that $F$ is continuous, $\theta$ is the unique population median and $F(\cdot)$ is symmetric about $0$.

In this setup, the sample pseudomedian is the Hodges-Lehmann estimator of $\theta$. It is a consistent and median-unbiased estimator. Note that the population pseudomedian coincides with the population median because $F$ is symmetric.

The Wilcoxon signed-rank statistic for testing $H_0:\theta=0$ is

$$T_n=\sum_{i=1}^n I(X_i>0)R_i^+\,,$$

where $R_i^+$ is the rank of $|X_i|$ among $\{|X_1|,|X_2|,\ldots,|X_n|\}$.

This can be rewritten in the form

$$T_n=\sum_{1\le i\le j\le n}I\left(\frac{X_i+X_j}{2}>0\right)$$

In other words, $$T_n=\sum_{k=1}^{m}I(Z_k>0)\,,$$

where $Z_1=X_1,\,Z_2=\frac{X_1+X_2}{2},\,Z_3=\frac{X_1+X_3}{2},\,\ldots,Z_m=X_n$ with $m=n+\binom{n}{2}=\frac{n(n+1)}{2}$.

Suppose the alternative hypothesis is $H_1:\theta>0$, so that a right-tailed test based on $T_n$ is appropriate.

Now, under $H_0$, distribution of $T_n=T_n(X_1,\ldots,X_n)$ is symmetric about $E_{H_0}(T_n)=\frac{n+\binom{n}{2}}{2}=\frac{n(n+1)}{4}$.

So, under $\theta$, distribution of $T_n(X_1-\theta,\ldots,X_n-\theta)$ is also symmetric about $\frac{n(n+1)}{4}$.

While constructing the Hodges-Lehmann estimator of location, we estimate $\theta$ by $\hat\theta$ such that

$$T_n(X_1-\hat\theta,\ldots,X_n-\hat\theta)\approx \frac{n(n+1)}{4}$$

It can be shown that this leads to the sample pseudomedian

$$\hat\theta=\operatorname*{med}_{1\le i\le j\le n}\left\{\frac{X_i+X_j}{2}\right\}$$

Why is the pseudomedian better than the median in a Wilcoxon test?

1 Answers1