How the hypergeometric distribution sums to 1?

Question

The hypergeometric distribution is defined for $\max(0, n+K-N)\leq k\leq \min(K,n).$ But, when we use Vandermonde's identity to prove that probabilities sum to $1$, then we use the range of $0\leq k \leq n.$ I wonder how this is justified?

In Vandermonde's identity (as given at your link), what happens, for example, when $r>m$? (It's quite simple ... and it's the same in the hypergeometric. Take care not to confuse the two different uses of $n$ in the two formulas though.) — Glen_b, May 29 '14 at 14:31

score 5 · Accepted Answer · answered May 29 '14 at 21:08

The reference mentions that this identity is from combinatorics: that is, it counts things.

What does it count? Consider $N$ objects. Once and for all, divide those $n$ things into a group of $K$ of them, which I will call "red," and the remainder, which I will call "blue." Each subset of $n$ such objects determines, and is determined by, its red objects and its blue objects. The number of such sets with $k$ red objects (and therefore $n-k$ blue objects) equals the number of ways to choose $k$ red objects from all $K$ ones (written $\color{red}{\binom{K}{k}}$) times the number of ways to choose the remaining $n-k$ blue objects from all the $N-K$ ones (written $\color{blue}{\binom{N-K}{n-k}}$).

Now if $k$ is not between $0$ and $K$, then there is no $k$-element subset of $K$ things, so $\binom{K}{k}=0$ in such cases. Similarly, $\binom{N-K}{n-k}=0$ if $n-k$ is not between $0$ and $N-K$. (This not only makes sense, it is actually how good software will evaluate these quantities. Ask R, for instance, to compute choose(5,6) or choose(5,-1): it will return the correct value of $0$ in both cases.)

Summing over all possible numbers $k$ shows that

$$\binom{N}{n} = \sum_k \color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}$$

and as you read this you should say to yourself "any $n$ objects are comprised of some number $k$ of red objects and the remaining $n-k$ blue objects."

The sum needs to include all $k$ for which both the terms $\color{red}{\binom{K}{k}}$ and $\color{blue}{\binom{N-K}{n-k}}$ are nonzero, but it's fine to include any other values of $k$ because they will just introduce some extra zeros into the sum, which does not change it. We just need to make sure all relevant $k$ are included. It suffices to find an obvious lower bound for it ($0$ will do nicely and is more practicable than $-\infty$!) and an obvious upper bound ($N$ works because we cannot find more than $N$ objects altogether). A slightly better upper bound is $n$ (because $k$ counts the red objects in a set of $n$ things). Thus, writing these bounds explicitly and dividing both sides by $\binom{N}{n}$, we obtain

$$1 = \sum_{0\le k\le n}\frac{\color{red}{\binom{K}{k}}\color{blue}{\binom{N-K}{n-k}}}{\binom{N}{n}} .$$

Despite the notation, this formula does not implicitly assert that all values of $k$ in the range from $0$ to $n$ can occur in this distribution. About the only reason to fiddle with the inequalities and figure out what the smallest possible range of $k$ can be would be for writing programs that loop over these values: that might save a little time adding up some zeros.

Just a small comment about the R choose function: it actually doesn't use the combinatorial definition if the top number is negative or not an integer. For example choose(-2, 4) is nonzero. I was once stuck on a calculation for weeks until I discovered this. — Flounderer, May 30 '14 at 02:01
@Flounderer For positive integers $n$, $\binom{-n}{k}$ actually does have a combinatorial interpretation and it should be nonzero when $k\ge 0$. Even when $n$ is nonintegral--even when it is a complex number!--this has a clear definition and makes sense: it is the coefficient of $x^k$ in the Binomial expansion (Taylor series) of $(1+x)^n$. — whuber, May 30 '14 at 02:04
Yes, but if it's to be "the number of ways of choosing k objects from -n objects", I still think that it should be zero. I believe it even was zero (or maybe undefined) in earlier versions of R, but they changed it. — Flounderer, May 30 '14 at 02:11
Thank you, so much, Sir! Your explanations are always exhaustive and fun to read. — Silent, May 30 '14 at 04:00

score 2 · Answer 2 · answered May 29 '14 at 14:25

I don't think there is a problem here.

Note that $min(n,K)=n$ when $n\leq K$ and $min(n,K)=K$ when $K<n$. Whenever $n\leq K$, we will not have any problem as the upper bound in Vandermonde's identity is $n$. Also when $K<n$, we have: $k\leq K<n$. This is because $k$ is the number of successes and $K$ is the number of success states in the population. So obviously $k\leq K$. Therefore in this case, again we will not have any problem as the upper bound in Vandermonde's identity is $n$.

Also $max(0,n-K-N)=0$ when $n+K-N\leq 0$ and $max(0,n-K-N)=n-K-N$ when $n+K-N>0$. Therefore, if $n+K-N\leq 0$, then we will not have any problem as the lower bound in Vandermonde's identity is $0$. Now for the final case, suppose $n+K-N>0$. So $K>N-n\geq 0$. We know that $k\leq K$. So we either have:

$K\geq k>N-n\geq 0$
$K>N-n\geq k\geq 0$ or
$K>N-n\geq 0 \geq k$.

Now looking at 1, 2 and 3 above, we can see that again in case 1, 2 as well as case 3 when $k=0$, we will not have any problem as the lower bound in Vandermonde's identity is $0$. Note that in case 3, $k<0$ is not possible, since the number of successes cannot be negative.

How the hypergeometric distribution sums to 1?

2 Answers2