1

Let's say all people on earth can be divided in five categories depending on their favorite fruit or vegetable (so discrete probability distribution) and we know the true distribution.

For example 2% like apples the most, 30% like bananas the most, 40% like cherries, 20% like dates and 8% like eggplants.

I have surveyed $N$ people and measured that there's $x_a$ of them like apples, $x_b$ like bananas and so on with $x_{c},x_d$ and $x_e$.

How do I compute the probability of obtaining such set of measurements, given that I know the global distribution?

[This is not a textbook question, I just rephrased my real world problem as fruit and veggies, to save on domain-specific terminology.]

Marcin
  • 113

1 Answers1

2

If $N \ll M,$ where $M$ is the total number of people in the world, then you simply use the Multinomial Distribution as a perfectly valid approximation. (The assumption here is that the probability distribution doesn't change as you draw samples. It isn't exact because when you draw a sample, there is one less member of that group in the remaining population, so the probability distribution changes slightly.)

On the other hand, if $N$ and $M$ are of comparable size (which, I know, is extremely unrealistic if we're talking about surveys, but let's pretend it could be for the sake of completion) then you simply want to solve a combinatorics problem. Suppose there are $K$ groups, and $x_j$ is the number of members sampled from group $j$ where $j = 1, \dots, K.$ Also, let $M_j$ be the number of people in the world in group $j$, so that $\sum_{j=1}^K M_j = M,$ where $M$ is the number of people in the world. Then, if our observation is ${\bf x} = \{x_1, x_2, \dots, x_K\},$ with $\sum_j x_j = N,$ the number of possible ways to have gotten such an observation, assuming each individual in the world has an equal chance of being selected, is,

$$ n({\bf x}) = \prod_{j=1}^K {M_j \choose x_j}. $$

The number of possible choices of any subgroup of $N$ people is simply $$ {M \choose N}. $$

Thus the exact probability of observing the outcomes given by $\bf x$ is

$$ P({\bf x}) = \frac{\prod_{j=1}^K {M_j \choose x_j} } {M \choose N}. $$ This is the multivariate hypergeometric distribution. (Thanks @Glen_b for suggesting I include the name.)

Usually this won't be tractable to calculate for large values of $M$ and $N,$ so approximate factorial calculations (such as the Stirling approximation) should be used. However, suppose we take the original assumption that $N \ll M.$ Let's further assume that this is true for each subgroup of the population, i.e. $x_j \ll M_j \hspace{1mm} \forall j.$ Then,

$$ {M_j \choose x_j} = \frac{M_j!}{x_j! (M_j - x_j)!} \approx \frac{M_j^{x_j}}{x_j!}, $$ and similarly, $$ {M \choose N} \approx \frac{M^N}{N!}. $$

So $$ P({\bf x}) \approx \frac{N!}{\prod_j x_j!} \prod_j \left(\frac{M_j}{M}\right)^{x_j} = \frac{N!}{\prod_j x_j!} \prod_j q_j^{x_j} $$ where $q_j = M_j/M$ is the probability of selecting a member of group $j.$ Thus we have shown that the exact answer reduces to the approximate answer that I suggested at the beginning, i.e. the Multinomial distribution, when the sample size of each subgroup of the population is much smaller than the true size of each subgroup.

  • 1
    It might be worth noting the name of the distribution you discuss in the later part (the multivariate hypergeometric). – Glen_b Sep 19 '17 at 14:08