I learned a while ago about an interesting place that $e$ shows up in probability: if there are $n$ items and you sample $n$ times with replacement, you would expect that the fraction of samples that is drawn is $1 - e^{-1} ≈ 0.63$, and the fraction of samples that never gets drawn is $e^{-1} ≈ 0.37$ (assuming sufficiently large $n$).
This comes up in the context of "bagging" in machine learning. A "bagged" model does not train on all $n$ data points, but trains on $n$ samples drawn randomly with replacement. If there are 100 samples, about 63 will be drawn on average (i.e. many of them will be drawn more than once while 37 are neglected).
I wanted to derive this using a random variable. I started with a random variable $X$ which represents the number of samples that are drawn at least once. The goal is to compute $E[X]$.
$$E[X]$$ $$\sum_{x=1}^{n}x * p_{X}(x)$$ $$\sum_{x=1}^{n}x * \frac{{n \choose x} {n - x + x - 1 \choose x - 1}}{{n + n - 1 \choose n}}$$
The term $n \choose x$ is from choosing the precise samples that will be drawn from.
The term $n - x + x - 1 \choose x - 1$ is from the following: We have exactly $x$ samples that we are considering drawing from. We allot the minimum number of draws so that we can guarantee each sample has 1 draw. This means there are $(n - x)$ remaining samples to allot into the $x$ buckets. We use the stars and bars technique to take $n-x$ indistinct draws with replacement from $x$ samples to obtain the binomial coefficient expression.
The term $n + n - 1 \choose n$ counts all the ways you can take $n$ indistinct draws with replacement from $n$ samples.
When I compute this for reasonably sized $n$, I do not get $63\%$ of samples that have been drawn, but instead I get closer to $50\%$. Here is some Python:
>>> from math import comb
>>> n = 100
>>> sum(x * comb(n, x) * comb(n - 1, x - 1) for x in range(1, n + 1)) / comb(n + n - 1, n)
50.25125628140704
I know I am using a valid distribution because I can sum just the probabilities: $\sum_{x=1}^{n}\frac{{n \choose x} {n - x + x - 1 \choose x - 1}}{{n + n - 1 \choose n}} = 1.0$
>>> sum(comb(n, x) * comb(n - 1, x - 1) for x in range(1, n + 1)) / comb(n + n - 1, n)
1.0
I have two questions:
- My expression must be wrong to model this problem. How would I have to change my expression to make $E[X] / n = 1 - e^{-1} ≈ 0.63$? I know there are alternative proofs for this fact, but I want to know why my proof is failing.
- Even if my random variable $X$ is not the correct one to model this problem, I am curious if there is some standard distribution it corresponds to? $X \sim Mystery(n); p_{X}(x) = \frac{{n \choose x} {n - x + x - 1 \choose x - 1}}{{n + n - 1 \choose n}}$