How many neglected samples when drawn with replacement? (bagging)

Question

I learned a while ago about an interesting place that $e$ shows up in probability: if there are $n$ items and you sample $n$ times with replacement, you would expect that the fraction of samples that is drawn is $1 - e^{-1} ≈ 0.63$, and the fraction of samples that never gets drawn is $e^{-1} ≈ 0.37$ (assuming sufficiently large $n$).

This comes up in the context of "bagging" in machine learning. A "bagged" model does not train on all $n$ data points, but trains on $n$ samples drawn randomly with replacement. If there are 100 samples, about 63 will be drawn on average (i.e. many of them will be drawn more than once while 37 are neglected).

I wanted to derive this using a random variable. I started with a random variable $X$ which represents the number of samples that are drawn at least once. The goal is to compute $E[X]$.

$$E[X]$$ $$\sum_{x=1}^{n}x * p_{X}(x)$$ $$\sum_{x=1}^{n}x * \frac{{n \choose x} {n - x + x - 1 \choose x - 1}}{{n + n - 1 \choose n}}$$

The term $n \choose x$ is from choosing the precise samples that will be drawn from.

The term $n - x + x - 1 \choose x - 1$ is from the following: We have exactly $x$ samples that we are considering drawing from. We allot the minimum number of draws so that we can guarantee each sample has 1 draw. This means there are $(n - x)$ remaining samples to allot into the $x$ buckets. We use the stars and bars technique to take $n-x$ indistinct draws with replacement from $x$ samples to obtain the binomial coefficient expression.

The term $n + n - 1 \choose n$ counts all the ways you can take $n$ indistinct draws with replacement from $n$ samples.

When I compute this for reasonably sized $n$, I do not get $63\%$ of samples that have been drawn, but instead I get closer to $50\%$. Here is some Python:

>>> from math import comb
>>> n = 100
>>> sum(x * comb(n, x) * comb(n - 1, x - 1) for x in range(1, n + 1)) / comb(n + n - 1, n)
50.25125628140704

I know I am using a valid distribution because I can sum just the probabilities: $\sum_{x=1}^{n}\frac{{n \choose x} {n - x + x - 1 \choose x - 1}}{{n + n - 1 \choose n}} = 1.0$

>>> sum(comb(n, x) * comb(n - 1, x - 1) for x in range(1, n + 1)) / comb(n + n - 1, n)
1.0

I have two questions:

My expression must be wrong to model this problem. How would I have to change my expression to make $E[X] / n = 1 - e^{-1} ≈ 0.63$? I know there are alternative proofs for this fact, but I want to know why my proof is failing.
Even if my random variable $X$ is not the correct one to model this problem, I am curious if there is some standard distribution it corresponds to? $X \sim Mystery(n); p_{X}(x) = \frac{{n \choose x} {n - x + x - 1 \choose x - 1}}{{n + n - 1 \choose n}}$

I think you are thinking theory and not practice. Each random number generator is good but imperfect, so they will all have their tolerable levels of imperfection. When I make code in R and run it stably (around 30k loops) the value is 64.4%, not 63.2%. The "real" is coming out quite reliably at 0.27% off than theory. — EngrStudent, Jan 23 '21 at 22:59
My python code does not simulate taking the draws, but it actually computes the expected value (the theory is intentional, I never use RNG). You are correct that when you actually simulate taking the draws, you will get an answer that is slightly off from 63.2%, but your answer is within a reasonable approximation. My question is why my theory deviates so largely from what you observe in practice - my theory is wrong and your practice is right. — efthimio, Jan 23 '21 at 23:05
When you sample with replacement, each of the $n$ draws is independent of the other, so it suffices to consider random variables that model a single draw. Since the chance of not drawing a specified element is $1-1/n,$ the chance of not drawing it in a sample of size $n$ is $(1-1/n)^n,$ whence the expectation of the indicator is $1-(1-1/n)^n.$ This also is the expected proportion of elements drawn. It's an exact value. With $n=100$ it's approximately equal to $0.634.$ @Engr If you are getting a mean of $0.644$ in 30K iterations, you have a bug in your code. Don't blame the RNG! — whuber, Jan 23 '21 at 23:37
Thanks @whuber. I referenced that technique as one of the "alternative proofs". Is there a reason that your modeling of a single sample gets a result that differs from my modeling of all samples? What does my model fail to take into consideration that yours properly accounts for? — efthimio, Jan 24 '21 at 00:16
@whuber - it was 0.634, not 0.644. The 0.27% difference offered a check of that. It was close. My code is fine. Thanks for pointing out how to handle the exact. — EngrStudent, Jan 24 '21 at 17:22

gunes · Answer 1 · 2021-01-24T20:33:55.333

1

Your probabilities are wrong. Take $x=1$ for example. Choosing only one of the samples $n$ times has the probability $$P(x=1)={n\choose 1}\frac{1}{n^n}=\frac{1}{n^{n-1}}$$

However, in your formula, it is $\frac{n}{2n-1\choose n}$.

In probability calculations, counting approaches sometimes can mislead because some objectives care for (in)distinguishability. However, distinct or not, every granular outcome has unit probability mass and when grouped together, they'll add up to the mass of the grouped object.

Edit: (Answering for your comment below, since it'd be quite long there)

If we have two elements ($n=2$) in our set, i.e. $\mathcal S=\{s_1,s_2\}$, and we draw two samples with replacement, we'll have four possible outcomes: $\{(s_1,s_1),(s_1,s_2),(s_2,s_1),(s_2,s_2)\}$, where the binary tuples refer to $(\text{draw 1 outcome},\text{draw 2 outcome})$.

$x=1$ case is interested only in the outcomes $(s_1,s_1)$ and $(s_2,s_2)$, where both outcomes are the same. So, the probability of having one distinct sample out of two draws is $1/2$.

The reason for your formula being wrong is that $(s_1,s_2)$ and $(s_2,s_1)$ are actually different outcomes, and their probabilities add up for $x=2$. Consider a simpler case where you have a fair coin and you want to know the probability of obtaining either all Heads or all Tails in your draws. There are four possible outcomes of your draws: $(H,H),(H,T),(T,H),(T,T)$, and the event you're interested in is the subset $(H,H),(T,T)$. The probability of this event is $1/2$. This setup is exactly the same as above.

In a nutshell, you can't always group cases where objects are indistinguishable like you do in counting problems.

edited Jan 24 '21 at 20:33

answered Jan 24 '21 at 00:39

gunes

57,205

Your PMF is wrong because it doesn't sum to one. I can't discern what the general form of your equation is as a function of x (since you already plugged in x=1 on RHS), but if I plug in x for either/both of the 1's, and then try to sum for all n, I never get one. e.g. n=100; sum(comb(n, x) * (1 / n ** n) for x in range(1, n + 1)). You're right that each outcome has probability 1/|sample space|, but your sample space is too large at n^n; if there were two buckets and two draws I wouldn't want to count [sample1:{draw1}, sample2:{draw2}] differently from [sample1:{draw2}, sample2:{draw1}]. – efthimio Jan 24 '21 at 01:19
I didn't provide the PMF, or how to generalize the case $x=1$ for $x>1$. I just gave you a counter example, where your formula doesn't hold with the true probability. You can compress the sample space if you can apply the compression to every element uniformly. But, if three outcomes merge into one, and another four outcomes merge into one in your new sample space definition. you shouldn't regard these two new events as equals. Moreover, the fact that your PMF sums up to $1$ may mislead you because you could have found another combinatorial equation and plug in as your PMF. – gunes Jan 24 '21 at 12:34
Your formula is incorrect for x=1 & mine is correct. Assume n=2. There are three possible outcomes: both draws for sample1, both draws for sample 2, and one draw on each. For x=1, the equation should output 2/3, which mine does and yours doesn't. But the broader point is I am inquiring about what the right formula is - my question statement already recognizes that my formula is wrong to model this problem. If you can show the right formula and where mine went wrong, I will mark as correct. – efthimio Jan 24 '21 at 16:52
@efthimio I've tried to explain your mistake in my edit. The probability is 1/2, not 2/3. For the exact formula, I think it might be quite (or very) hard to calculate a PMF similar to your method (couldn't think of an easy wayout, but I'm not saying impossible). So, in this answer, my aim was only to show you that your formula is not correct. – gunes Jan 24 '21 at 20:41
Thanks @gunes, I figured out the correct answer. Your answer did not get me all the way, but it freed me from a conceptual snare. – efthimio Jan 24 '21 at 23:56

efthimio · Accepted Answer · 2022-10-27T19:16:26.457

The correct derivation is:

$$E[X]$$ $$\sum_{x=1}^{n} x * p_{X}(x)$$ $$\sum_{x=1}^{n} x * \frac{{n \choose x} \left(\sum_{i=0}^{x} (-1)^{i} {x \choose i} (x - i)^{n} \right)}{n^{n}}$$

The term ${n \choose x}$ is from choosing the precise samples that will be drawn from (as noted above).
The term $\sum_{i=0}^{x} (-1)^{i} {x \choose i} (x - i)^{n}$ is a Stirling number of second kind (technically it is a Stirling number of second kind multiplied by $x!$). It is necessary here because it allows us to count all the ways to make $n$ distinct draws from $x$ samples with the constraint that each sample must be drawn at least once.
The term $n^{n}$ counts all the ways you can take $n$ distinct draws with replacement from $n$ samples.

As @gunes noted, we should be thinking in terms of distinct draws rather than indistinct. This is confusing because the "draws" themselves don't seem distinguishable, so why consider them distinct? I think it is because we want each selection (that is, a particular assignment of the draws to the samples) to be "weighted" as if the draws were distinct.

To take the example from my comment on @gunes answer, "Assume n=2. There are three possible outcomes: both draws for sample1, both draws for sample 2, and one draw on each." It remains true that there are three possible outcomes, but I think the caveat is that we want the selection "one draw on each" to be "weighted" twice as much (i.e. count as if we were making distinct draws, and then merging the selections that looked alike under an indistinct interpretation).

The expected value computes as originally desired: $\lim_{n \to \infty} \frac{1}{n} E[X] = \lim_{n \to \infty} \frac{1}{n} \sum_{x=1}^{n} x * \frac{{n \choose x} \left(\sum_{i=0}^{x} (-1)^{i} {x \choose i} (x - i)^{n} \right)}{n^{n}} = 0.63... = 1 - e^{-1}$

>>> from math import comb
>>> n = 100
>>> 1/n * sum(x * comb(n, x) * sum((-1) ** i * comb(x, i) * (x - i) ** n for i in range(x + 1)) for x in range(1, n + 1)) / n ** n
0.6339676587267705

The probabilities sum to one: $\sum_{x=1}^{n} \frac{{n \choose x} \left(\sum_{i=0}^{x} (-1)^{i} {x \choose i} (x - i)^{n} \right)}{n^{n}} = 1$

>>> sum(comb(n, x) * sum((-1) ** i * comb(x, i) * (x - i) ** n for i in range(x + 1)) for x in range(1, n + 1)) / n ** n
1.0

How many neglected samples when drawn with replacement? (bagging)

2 Answers2