2

Suppose we sample (uniformly, with replacement) $t$ times a set of $N$ items. What is the probability $x$ that the sample contains $y$ different items?

This is representative of a real-life scenario I am facing, trying to determine if a certain issue impacts the whole population, or is specific to a certain subset. For example, I have a population of 73 specimens ($N$). One issue has manifested 250 times ($t_1$) on 47 different specimens ($y_1$), while another has manifested 64 times ($t_2$) on 17 different specimens ($y_2$). Intuitively, issue 2 is much more likely to be specific to a subset of the specimens, but I would like to quantify this before trying to analyse further.

One way forward I struggled with for a while was to use the inclusion probability. There are formulas for including one item $x_k$ or two items $x_{kl}$ (below), but I couldn't generalize for $y$ items.

$$x_k = (1-\frac{1}{N})^t $$ $$x_{kl} = 2(1-\frac{1}{N})^t - (1-\frac{2}{N})^t $$

I also have a little simulation to get some taste of what the distribution would look like.

Any help or direction would be appreciated!

Ben
  • 124,856

1 Answers1

4

This is an example of the classical occupancy problem. Let $1 \leqslant K \leqslant \min(t,N)$ denote the number of different items sampled. Then the probability mass function for this random variable is the classical occupancy distribution:

$$\mathbb{P}(K=k) = \frac{(N)_k \cdot S(t,k)}{N^t} \quad \quad \quad \text{for all } 1 \leqslant k \leqslant \min(t,N),$$

where $(N)_k = N(N-1)(N-2) \cdots (N-k+1)$ are the falling factorials and $S(t,k)$ are the Stirling numbers of the second kind. Letting $E_r \equiv (1-r/N)^t$, the mean and variance of the distribution are:

$$\mathbb{E}(K) = N (1-E_1) \quad \quad \quad \quad \quad \mathbb{V}(K) = N [(N-1) E_2 + E_1 - N \cdot E_1^2].$$

The higher-order moments and other broader properties of this distribution are well-known (see e.g., O'Neill 2019). For large values of $t$ and $N$ you can approximate it well by a normal distribution with identical mean and variance, with approximation accuracy shown in the cited paper.

Ben
  • 124,856