I am trying to understand how to apply the binomial distribution to a simple probability problem. I can solve the problem directly via the classical definition of probability but, when trying to interpret the problem as sampling from a binomial distribution, I get different results.
Problem statement:
The prevalence of some disease in a given country is $p$. A sample of $n<N$ people is selected from a city with $N$ inhabitants (in that same country). What is the probability that exactly $k$ people in this sample have the disease?
method 1 (favourable/total): there are $\binom N n$ possible samples from the population. Out of those, we are interested in the ones with $k$ infected (out of $pN$) and $n-k$ healthy people (out of $(1-p)N$), which account for $\binom {pN} k\binom {(1-p)N} {n-k}$ possibilities. Thus, the probability is $$ P = \frac {\binom {pN} k\binom {(1-p)N} {n-k}} {\binom N n} $$
method 2 (binomial): It seems that this problem can be cast as sampling from a binomial distribution, with success probability $p$ and $n$ repetitions. We are interested in $k$ successes, thus we should have
$$
P(k) = \binom {n} k p^k (1-p)^{n-k}
$$
If we take concrete numbers, eg N=200, p=0.1, n=20, k=2, we end up with $P\approx0.30$ for method 1, while method 2 gives $P \approx 0.28$.
- Why are these numbers different?
- What is wrong with the binomial solution?
- Should it somehow depend on the sample size $N$?
