Calculate mean if I only know random predecates of each sample

Question

I'm not super experienced in statistics so sorry if some terminology is off.

I'm trying to find the mean of some distribution, call it $P$. The problem is, the samples aren't directly visible. For each sample $x_i \sim P$, I only know if $x_i > y_i$ where $y_i$ is another random variable drawn from a different distribution $P'$. All variables are independent of one another if that makes a difference. To be explicit, each $y_i$ is completely known and redrawn for each $x_i$. Thanks!

This is an interesting question. Statistical methods for data where only a minimum/maximum value is known looks like it should be helpful. — Stephan Kolassa, Oct 28 '20 at 13:11
What do you know, or are willing to assume, about the distributions $P$ and $P^\prime$? — whuber, Oct 28 '20 at 14:29
@Stephan That's an intriguing response. My instinct is that when you can justify limiting your assumptions to a low-dimensional parametric family for $(P,P^\prime),$ you can obtain estimates with smaller standard errors. At the very least you can more easily apply routine techniques such as MLE because you can write down the likelihood of each observation $(\mathcal{I}(x_i\gt y_i),y_i)$ explicitly as a function of the parameters of $P$ and $P^\prime.$ — whuber, Oct 28 '20 at 21:01

Matt F. · Answer 1 · 2020-11-02T15:35:14.040

Here are two cases with clear answers.

The $x$'s are known to be distributed normally, and there are enough observations with $y$'s near $q_1$ and $q_2$ that we can determine $P[x_1 < q_1] = p_1$ and $P[x_2 < q_2] = p_2$. This might happen if there are only two possible values of $y$.

To analyze this, let $Q$ be $\Phi^{-1}$, the standard normal quantile function. Then we have $$\mu + Q(p_1)\sigma = q_1$$ $$\mu + Q(p_2)\sigma = q_2$$ We get the mean of the normal distribution by solving these: $$\mu = \frac{q_1 Q(p_2) - q_2 Q(p_1)}{Q(p_2) - Q(p_1)}$$

The $x$'s are known to be distributed exponentially. Then the mean estimated by MLE can be approximated as a nice linear function of the $y$'s.

Let the $y_i$'s with $x_i<y_i$ be $a_1, \ldots a_m$. Let the $y_j$'s with $x_j>y_j$ be $b_1, \ldots b_n$.

Let the distribution for the $x$'s have mean $1/\lambda$. Then the probability of the observed outcome is: $$\left(\prod \left(1-e^{-\lambda a_i}\right)\right) \left(\prod e^{-\lambda b_j}\right)$$ So we can maximize this by maximizing its log: $$\left(\sum \ln\left(1-e^{-\lambda a_i}\right)\right)- \left(\sum \lambda b_j\right)$$ This will happen when its derivative with respect to $\lambda$ is 0, which is when:$$\sum \frac{a_ie^{-\lambda a_i}}{1-e^{-\lambda a_i}}= \sum b_j$$ This can be solved numerically. Alternatively, for small $\lambda a_i$, we can use Taylor series to approximate the left hand side as $m/\lambda - \sum a_i/2$, which gives the maximum likelihood estimate of the mean as approximately $$\frac{1}{\lambda}\sim \frac{1}{2}\bar{a} + \frac{n}{m}\bar{b}$$

I like this because the final result is both simpler and less obvious than might be expected. For instance, it means that if $x_i<y_i$ and $x_i>y_i$ about equally often, then the cases with $x_i>y_i$ are about twice as important in estimating the mean.

Calculate mean if I only know random predecates of each sample

1 Answers1