3

I have two random variables: $n_1$ and $n$.

$$n_1 \sim \textrm{Bin}(m, p_1g)\\ n\sim \textrm{Bin} (m, (p_1g+p_0(1-g)))$$

What would be distribution of (or at least expectation and variance) of $n_1/n?$

$p_0,p_1,g \in (0,1).$

This is based a a small model that I am working on but this stats part is where I am stuck.


Model: There's a population of size $N$. $g$ fraction is of type $1$ and rest $0$.

$m$ are randomly approached but type $1$ is available with probability $p_1$ and others with $p_0$.

It is observed that total $n$ are available of which $n_1$ are type 1.

I am looking to estimate $g$ from this data but first wish to prove that $n_1/n$ is a biased estimate of $g$. Further my hunch is that itay not even be consistent. And even if it is then the rate of decrease of variance would be slower than what it would have been, had $p_0=p_1$.

Hopefully my derivation of distribution are correct $n, n_1$. They are not independent clearly.

I have also run simulations and the intuition about variance seems to be correct. I am still lost about how to get a good estimate of g.

Dayne
  • 2,611
  • Do you need an analytical solution (doubt) or will an approximation do? – user2974951 Nov 22 '22 at 07:38
  • Preference is for analytical but I'd appreciate approximation as well...also be good bounds if can be provided @user2974951 – Dayne Nov 22 '22 at 07:58
  • I am thinking of doing simulations but still some mathematical result be great – Dayne Nov 22 '22 at 07:59
  • 4
    Where is the dependence between $N$ and $N_1$? There is no direct analytical formula but it is always feasible to compute $$\mathbb P(N_1/N=a/b)$$ when $a=0,\ldots,m$ and $b=1,\ldots,m$. The rv is undefined when $N=0$. – Xi'an Nov 22 '22 at 09:33
  • Yeah so I am not very sure about the dependence thing. Let me describe the full model as soon as I get time. Basic idea is that data for both come from same sample of size m and their parameters are related to g, p_1, p_0 @Xi'an – Dayne Nov 22 '22 at 10:39
  • @Xi'an: I have added the details of the model. I have also voted to reopen. Please do comment if you think the question is still incomplete. – Dayne Nov 26 '22 at 07:02
  • The model as described is incomplete, as it does not spell out the dependence between $N_1$ and $N$. I would suggest introducing a latent variable $M_1\sim\text{Bin}(m,g)$ and, conditional on $M_1$, $N_1\sim\text{Bin}(M_1,p_1)$ and $N_0=N-N_1\sim\text{Bin}(m-M_1,p_0)$. And there is always the issue of $N_1/N$ being undefined when $N=N_1=0$. – Xi'an Nov 26 '22 at 08:20
  • @Xi'an: I though this part was obvious from description. There is one sample and n and n1 are both from this sample. Let me be more explicit: let j be index of individual approached. Rj=1 if he's available. Gj=1 if he's of type 1. $n = \sum_1^m R_j$. And $n_1=\sum_1^m R_jG_j$ – Dayne Nov 26 '22 at 09:17
  • Rj and Gj are both Bernoulli – Dayne Nov 26 '22 at 09:18
  • Some more description: $Pr(R_j=1 | G_j=1) = p_1$, $Pr(R_j=1 | G_j=0) = p_0$. @Xi'an – Dayne Nov 26 '22 at 09:21
  • I think it would be better if you dropped all the maths, and just described your actual physical problem. People are wasting time addressing a likely incorrect mathematical description – seanv507 Nov 26 '22 at 10:49
  • Can you confirm or infirm that the model I gave in my last comment applies? – Xi'an Nov 26 '22 at 10:49
  • @SextusEmpiricus: I agree that my first post was wrong. But I thought the part I added later answers the queries raised. The multinomial description sounds interesting. But the probabilities of the events in multinomial are not independent. Is the model described in terms of Rj and Gj okay? n_1, n are realizations of binomial distributed random variables. If you think these are still insufficient will add more. Thanks for your comments – Dayne Nov 26 '22 at 10:50
  • @Xi'an I guess not. Is there a problem with Rj, Gj description. If so I'd better delete this and post ahain maybe. – Dayne Nov 26 '22 at 10:52
  • Yes, repost a new question with a fully specified model, including the dependence between $N$ and $N_1$. (In my understanding, $M_1=\sum_i G_i$, so I still do not get why this is not the model of interest.) – Xi'an Nov 26 '22 at 10:55
  • @SextusEmpiricus: So I am still not very sure about the multinomial thing. Basically we have 4 possibilities: an individual is type 1 but is not available, type 0 unavailable and same with available. I modelled this with two separate random variables. These two are not really independent variables as their conditional probabilities are different exogenously given (so not the product of probabilities of the events separately). In any case I have made a mess here so would appropriately edit the question. – Dayne Nov 26 '22 at 17:34
  • @SextusEmpiricus Meanwhile please share the link to your answer you are referring to – Dayne Nov 26 '22 at 17:34

1 Answers1

3

Your initial problem description does not follow entirely the description in the second part. But based on that description I get to the following:

You have a pool of $N$ People of which a fraction $g$ is of type 1 and a fraction $1-g$ is of type 0. From this pool you sample $m$ people but you only register the people that are 'available'. Among the different types there are different probabilities of being available (and being actually observed).


I translate this into the following problem.

Assuming that the sample is much smaller than the population size $m \ll N$, then the sampling from the population is similar to sampling without replacement and we can regard the distribution as approximately categorical distributed.

You sample the following cases:

  • $n_1$, type 1 and available, with probability $gp_1$

  • $n_0$, type 0 and available, with probability $(1-g)p_0$

  • $n_u$, unavailable, with probability $1-p_0-gp_1+gp_0$

You can not have a combination of these. It needs to be either one of these three. So the output needs to be multinomial distribution with the respective probabilities with

$$\begin{array}{rcl} \mu_{n_1} &=& mgp_1\\ \mu_{n_0} &=& m(1-g)p_0\\ \operatorname{Var}({n_1})&=& mgp_1(1-gp_1)\\ \operatorname{Var}({n_0}) &=& m(1-g)p_0(1-(1-g)p_0\\ \operatorname{Cov}(n_1,n_0) &=& -mgp_1(1-g)p_0 \end{array}$$


Distribution of $\frac{n_1}{n_1+n_0}$

When $m$ is large then you can approximate $n_1$ and $n_1+n_0$ as a multivariate normal distribution. And use the Delta method to approximate the distribution of the outcome.

A discription of this is given in this question A/B testing ratio of sums.

For the location of the Normal distribution that approximates the distribution of the ratio we use $$\frac{\mu_{n_1} }{\mu_{n_1} + \mu_{n_0}} = \frac{mgp_1}{mgp_1+m(1-g)p_0} = g \frac{1}{1 +\frac{1-g}{g}\frac{p_0}{p_1}}.$$

For the variance the computation is a bit longer but you can already see that this distribution will be biased.

User1865345
  • 8,202
  • I hope you have no problem with the edits done. – User1865345 Nov 27 '22 at 10:39
  • Thanks. So multinomial also looks like a great option but I still have this lingering doubt that the unavailable comprises of two categories although we cannot differentiate between these two. In any case multinomial approach seems to make life easier. I will check the link to your other answer for details about the variance. – Dayne Nov 27 '22 at 10:40
  • Btw, in the final formula there should be a other than the one in denominator. – Dayne Nov 27 '22 at 10:41