Suppose we know $\mu_X<\mu_Y$. How can that information be used to improve estimates of the mean of $X$ and $Y$

Question

Suppose there are two distributions which we know nothing about except that the mean of second is greater than the first and each can be sampled.

Intuitively, the estimate of both means should be able to be improved upon with this knowledge.

For example if $X_i\sim N(a,1)$ and $Y_j\sim N(b,1)$ all i.i.d then one might consider the sample means. But given $a<b$, if the sample means have instead $\hat a> \hat b$, then one could imagine revising the means towards each other would not just be better, but would be necessary.

I suspect the answer of how to revise these estimates specifically would be using some bayesian parameter estimation. Still it's not clear to me how.

More generally I suppose this problem could be studied with stochastic ordering, not just mean. That is, if we know there is some stochastic ordering of random variables and can sample from any of them, we should be able to extract some information about any of them from a sample from any of them.

This is so general and vague that it invites a range of possible answers. In particular, it isn't at all evident that somehow employing this inequality assumption would actually improve the individual estimates, because we lack information concerning how you evaluate the quality of any estimate. What would that be? (A common example of not using such an assumption is the estimation of radiation where a background is subtracted. The subtraction can produce physically unrealistic negative estimates -- but those are retained because adjusting them will bias the estimator.) — whuber, Jul 05 '23 at 21:41
@whuber an example could be asymptomatic variance of the estimator. I agree it's vague. But in general I feel it is intuitive. Maybe more can be said if you assume some things about the RVs like normality or more generally the existence of pth order moments. But I don't see how a known ordering of means couldn't provide some information about the means - It has to. If you have a million samples from one distribution, and know another distribution has greater mean, you have a high certainty of a lower bound you didn't have before. How that information should impact an estimate, I don't know. — sczinner, Jul 05 '23 at 21:48
That's my point: whether that information should affect an estimate depends on the statistical assumptions you make and on how you assess the quality of estimates, among other things. There's no universal solution: there's not even a general need to apply that information. In a Bayesian setting you certainly could supply an informative prior and, if it assigns zero probability to the wrong order of means, the posterior would guarantee the intended ordering. But such an estimator could be a very bad one from many perspectives. — whuber, Jul 05 '23 at 21:54
Hi: Off the top of my head, the bayesian approach would allow for imposing that constraint in a more straightforward way than a classical approach. Assuming independence, $X - Y$ is $\sim Normal(\mu = a-b, \sigma^2 = 2)$ where the mean is some negative number. So, if you make a normal prior with negative mean, you'll get a normal posterior. — mlofton, Jul 06 '23 at 02:05
Regarding my comment above, note that, depending on the data, the mean of the normal posterior might not be negative but atleast there's a closed form. Zellner's book has all the gory details in it. — mlofton, Jul 06 '23 at 02:08
Unfortunately, it's crazy expensive but that's the way of the land. https://www.amazon.com/Introduction-Inference-Econometrics-Probability-Statistics/dp/0471981656/ref=sr_1_1?crid=3UOKMXDX8CGSY&keywords=zellner+bayesian&qid=1688609459&sprefix=zellner+bayesian%2Caps%2C143&sr=8-1 Actually, used and first edition is only 14.00. I'd get it if you want to go that route. — mlofton, Jul 06 '23 at 02:12
Given it's a Bayesian problem the restriction can be incorporated in the prior - where such knowledge belongs - by having 0 prior probability on the region in the joint prior that violates the restriction, and then the posterior will reflect that. — Glen_b, Jul 06 '23 at 04:33
This is a case of order-restricted inference https://stats.stackexchange.com/questions/381793/hypothesis-test-testing-for-monotonic-group-mean-change although that term is more used with more than two means. — kjetil b halvorsen, Jul 06 '23 at 12:19
@mlofton The standard "classical" approach would be to restrict the parameter space to $\mu_X\lt\mu_Y.$ It is hard to imagine a more straightforward method than that! The problems begin for datasets where the estimates lie on the boundary $\mu_X=\mu_Y,$ because the usual asymptotic formulas for estimates do not apply. TNSTAAFL. — whuber, Jul 06 '23 at 13:24
Thanks Glen_b and whuber for your comments. They make sense. whuber: If you don't mind. what is TNSTAAFL. — mlofton, Jul 07 '23 at 06:38

score 4 · Accepted Answer · answered Jul 06 '23 at 07:08

One possible Bayesian approach is a model with a flat prior according to \begin{align*} X_i | (a, b) &\sim \mathcal{N}(a, 1), \\ Y_i | (a, b) &\sim \mathcal{N}(b, 1), \\ \pi(a, b) &= \mathbf{1}_{a < b}. \end{align*}

We find that the posterior distribution is then the truncated normal distribution \begin{align*} (a, b) | (X, Y) \sim \mathcal{TN}\left( \begin{pmatrix} \bar{X} \\ \bar{Y} \end{pmatrix}, \begin{pmatrix} n^{-1} & 0 \\ 0 & m^{-1}\end{pmatrix}; a < b \right), \end{align*} where the $a < b$ part indicates the truncation.

Now, our job is to find $\mathbf{E}((a,b) | a<b)$ under these parameters. Using ideas similar to those discussed here: Sample from a multivariate gaussian distribution given a linear constraint on samples and here: If a multivariate Gaussian distribution is truncated what will be the new distribution?, we find that for a multivariate Gaussian $W \sim \mathcal{N}_p(\mu, \Sigma)$ and constants $c\in \mathbf{R}^p, \kappa \in \mathbf{R}$, we have that $$ \mathbf{E}(W | c^{\mathrm{T}}W > \kappa) = \mu + \frac{\psi(\kappa - c^{\mathrm{T}} \mu)}{c^{\mathrm{T}} \Sigma c} \Sigma c, $$ where $\psi(x) = \mathbf{E}(Z | Z > x)$ for a standard normal $Z$. We can write $\phi(x) = \phi(x) / (1 - \Phi(x))$ where $\phi$ and $\Phi$ are the standard normal pdf and cdf respectively.

In our case, the model parameters are $\mu = (\bar{X}, \bar{Y})$ and $\Sigma = (\begin{smallmatrix} n^{-1} & 0 \\ 0 & m^{-1}\end{smallmatrix})$, while the constraints are $c = (-1, 1)$ and $\kappa = 0$, so we can find the Bayes estimator by computing the posterior mean \begin{align*} \begin{pmatrix} \hat{a}_\pi \\ \hat{b}_\pi \end{pmatrix} = \mathbf{E}_\pi\left[\begin{pmatrix} a \\ b \end{pmatrix} \middle| X, Y \right] = \begin{pmatrix} \bar{X} - \frac{m}{n+m} \cdot \psi(\bar{X} - \bar{Y}) \\ \bar{Y} + \frac{n}{n+m} \cdot \psi(\bar{X} - \bar{Y}). \end{pmatrix} \end{align*}

If $\bar{X} \ll \bar{Y}$, then $\psi(\bar{X} - \bar{Y})$ is very small, so $(\hat{a}_\pi, \hat{b}_\pi) \approx (\bar{X}, \bar{Y})$.

On the other hand if $\bar{X} \gg \bar{Y}$, then $\psi(\bar{X} - \bar{Y}) \approx \bar{X} - \bar{Y}$, and so $$ \begin{pmatrix} \hat{a}_\pi \\ \hat{b}_\pi \end{pmatrix} \approx \begin{pmatrix} \frac{1}{n+m} (\sum_{i=1}^n X_i + \sum_{i=1}^m Y_i) \\ \frac{1}{n+m} (\sum_{i=1}^n X_i + \sum_{i=1}^m Y_i)\end{pmatrix}, $$ both of which make intuitive sense and mimic the behaviour of the maximum likelihood estimator.

score 3 · Answer 2 · answered Jul 06 '23 at 02:37

As whuber points out in the comments, there are a range of possible answers for how we might estimate in this case, but ultimately we would then need to investigate the quality of the proposed estimators by looking at their statistical properties (e.g., bias, consistency, MSE, etc.). Consequently, the only real way to answer this question is to set out some proposed estimators for this case and suggest that their properties be examined. The general problem you will encounter is that an "adjustment" used to incorporate the information for the order of the distribution means is likely to bias the estimators, though it might lead to some other beneficial properties. Ultimately, you would need to investigate the properties of any proposed estimators and compare them to standard estimators to decide if they are an improvement.

If we want to proceed in the greatest generality, without assuming any particular distributional form, then a natural starting point is to note that for IID data we usually estimate the distribution mean with the sample mean. Consequently, if we have IID samples from the two distributions (and they are assumed to also be independent of each other) then a natural starting point would be to estimate the distribution means with the sample means. However, if the sample means are in the "wrong" order compared to the known order of the distribution means, we might then propose to adjust our estimator by estimating both distribution means as some point between the two sample means.

Let's proceed on this basis and make a small start on formulating proposed estimators and examining a basic property of those estimators. Suppose we have $n$ observations $X_1,...,X_n$ and $m$ observations $Y_1,...,Y_m$. Using some proportion $0 \leqslant p(n,m) \leqslant 1$ we might decide to form the estimators:

$$\hat{\mu}_X = \begin{cases} \bar{X}_n & & & \text{if } \bar{X}_n < \bar{Y}_m, \\[6pt] p(n,m) \bar{X}_n + (1-p(n,m)) \bar{Y}_m & & & \text{if } \bar{X}_n \geqslant \bar{Y}_m. \\[6pt] \end{cases} \\[24pt] \hat{\mu}_Y = \begin{cases} \bar{Y}_m & & & \text{if } \bar{X}_n < \bar{Y}_m, \\[6pt] p(n,m) \bar{X}_n + (1-p(n,m)) \bar{Y}_m & & & \text{if } \bar{X}_n \geqslant \bar{Y}_m. \\[6pt] \end{cases}$$

It can easily be shown that $\hat{\mu}_X \leqslant \hat{\mu}_Y$ so that the estimators respect the known order of the distribution means. To find the expected value of the estimators, suppose we define the quantities:

$$\begin{align} K(n,m) &\equiv \mathbb{P}(\bar{X}_n < \bar{Y}_m), \\[6pt] L_X(n,m) &\equiv \mathbb{E}(\bar{X}_n|\bar{X}_n < \bar{Y}_m), \\[6pt] L_Y(n,m) &\equiv \mathbb{E}(\bar{Y}_m|\bar{X}_n < \bar{Y}_m), \\[6pt] U_X(n,m) &\equiv \mathbb{E}(\bar{X}_n|\bar{X}_n \geqslant \bar{Y}_m), \\[6pt] U_Y(n,m) &\equiv \mathbb{E}(\bar{Y}_m|\bar{X}_n \geqslant \bar{Y}_m), \\[6pt] \end{align}$$

and note the operative constraints:

$$\begin{align} \mu_X = L_X(n,m) \cdot K(n,m) + U_X(n,m) \cdot (1-K(n,m)), \\[6pt] \mu_Y = L_Y(n,m) \cdot K(n,m) + U_Y(n,m) \cdot (1-K(n,m)). \\[6pt] \end{align}$$

Using these quantities we have the expected values:

$$\begin{align} \mathbb{E}(\hat{\mu}_X) &= \mathbb{E}(\hat{\mu}_X|\bar{X}_n < \bar{Y}_m) K(n,m) + \mathbb{E}(\hat{\mu}_X|\bar{X}_n \geqslant \bar{Y}_m) (1-K(n,m)) \\[6pt] &= L_X(n,m) K(n,m) + U_X(n,m) p(n,m) (1-K(n,m)) \\[6pt] &\quad \quad + U_Y(n,m) (1-p(n,m)) (1-K(n,m)), \\[12pt] \mathbb{E}(\hat{\mu}_Y) &= \mathbb{E}(\hat{\mu}_Y|\bar{X}_n < \bar{Y}_m) K(n,m) + \mathbb{E}(\hat{\mu}_Y|\bar{X}_n \geqslant \bar{Y}_m) (1-K(n,m)) \\[6pt] &= L_Y(n,m) K(n,m) + U_X(n,m) p(n,m) (1-K(n,m)) \\[6pt] &\quad \quad + U_Y(n,m) (1-p(n,m)) (1-K(n,m)). \\[6pt] \end{align}$$

We therefore have:

$$\begin{align} \text{Bias}(\hat{\mu}_X) &= [U_Y(n,m) - U_X(n,m)] (1-p(n,m)) (1-K(n,m)), \\[12pt] \text{Bias}(\hat{\mu}_Y) &= [U_X(n,m) - U_Y(n,m)] p(n,m) (1-K(n,m)). \\[6pt] \end{align}$$

As you can see, these estimators would generally be biased, though they have the advantage of respecting the known ordering in the underlying distribution means. The amount of bias will depend on probabilities and conditional moments from the underlying distributions (through the quantities $U_X$, $U_Y$ and $K$). The bias also depends on the proportion $p(n,m)$ used in the estimator, which is under our control. In particular, if we take $p(n,m) = \tfrac{1}{2}$ then the magnitude of the bias will be the same for both estimators.

With some more analysis you could investigate other properties of this estimator, or formulate alternative estimators and then examine their properties. Here we have just scratched the surface of this type of analysis, but hopefully it gives you an idea of how you might proceed with this type of analysis.

Suppose we know $\mu_X<\mu_Y$. How can that information be used to improve estimates of the mean of $X$ and $Y$

2 Answers2