Binomial Approximation to Hypergeometric Probability

Question

I am trying to understand how to apply the binomial distribution to a simple probability problem. I can solve the problem directly via the classical definition of probability but, when trying to interpret the problem as sampling from a binomial distribution, I get different results.

Problem statement:

The prevalence of some disease in a given country is $p$. A sample of $n<N$ people is selected from a city with $N$ inhabitants (in that same country). What is the probability that exactly $k$ people in this sample have the disease?

method 1 (favourable/total): there are $\binom N n$ possible samples from the population. Out of those, we are interested in the ones with $k$ infected (out of $pN$) and $n-k$ healthy people (out of $(1-p)N$), which account for $\binom {pN} k\binom {(1-p)N} {n-k}$ possibilities. Thus, the probability is $$ P = \frac {\binom {pN} k\binom {(1-p)N} {n-k}} {\binom N n} $$

method 2 (binomial): It seems that this problem can be cast as sampling from a binomial distribution, with success probability $p$ and $n$ repetitions. We are interested in $k$ successes, thus we should have
$$ P(k) = \binom {n} k p^k (1-p)^{n-k} $$

If we take concrete numbers, eg N=200, p=0.1, n=20, k=2, we end up with $P\approx0.30$ for method 1, while method 2 gives $P \approx 0.28$.

Why are these numbers different?
What is wrong with the binomial solution?
Should it somehow depend on the sample size $N$?

If you select someone presumably you won't select them again in which case once they're in your sample the probability for those left for you to select from is changed. — Glen_b, Aug 01 '18 at 07:22
Thanks! Now I see that we are selecting from the population without replacement - which would be described by a hypergeometric distribution (I am also happy to notice that method 1 above derives exactly the hypergeometric pmf...) — Joseph Greenpie, Aug 01 '18 at 17:13

BruceET · Accepted Answer · 2018-08-01T22:32:04.437

The exact probability is hypergeometric, as in the displayed equation in your Question. It assumes sampling without replacement. (That is the same person cannot be chosen twice.)

If $n$ is very much smaller than $N,$ then a binomial model, which assumes sampling with replacement may be useful. (The approximation is based on the relatively low chance the same person would be chosen more than once when only a few $n$ are chosen out of many $N.$ A common rule of thumb for usefulness of the binomial approximation is to have $n/N < 0.1.)$

Let's look at specific numbers to see how this plays out computationally. Let $N = 100,000,\, n = 500,\,k = 10,\, p = .02.$

Hypergeometric: The number of infected individuals in the city is 2000 and the remaining 98,000 are uninfected: $P(X = 10) = 0.1267$ and $P(X \le 10) = 0.5831.$ Computations in R, where dhyper and phyper are a PDF and a CDF of a hypergeometric distribution.

> dhyper(10, 2000, 98000, 500)
[1] 0.1266969
> phyper(10, 2000, 98000, 500)
[1] 0.5830506

Binomial approximation: Here $Y \sim \mathsf{Binom}(n = 500, p = .02).$ Then $P(Y = 10) = 0.1264$ and $P(Y \le 10) = 0.5830.$

> dbinom(10, 500, .02)
[1] 0.1263798
> pbinom(10, 500, .02)
[1] 0.583044

In these examples the binomial approximations are very good. The plot below shows this hypergeometric distribution (blue bars) and its binomial approximation (red). Within the resolution of the plot, it is difficult to distinguish between the two.

Note: With huge population sizes, the binomial coefficients in the hypergeometric PDF can become so large that they overflow R's ability to handle them. The program is written to minimize this difficulty, but even so, there are limits on what can be computed. R makes it possible to find log probabilities to prevent overflow; then you can take exponents to get answers.

Joseph Greenpie · Answer 2 · 2018-08-02T01:58:45.980

Method (2) is invalid because we are sampling from the population without replacement - which leads to a hypergeometric distribution instead of the binomial. Notice that method (1) ends up corresponding precisely to the hypergeometric probability mass function.

As a sidenote notice that, for $N\gg n$, replacements should not make a difference, and thus we end up with a binomial. This limit has been calculated elsewhere.

Binomial Approximation to Hypergeometric Probability

2 Answers2

Linked