4

Suppose I have a population of size $n$ in which each element has a probability $p$ of being sick. By sampling this population without replacement, how many elements would I have to sample in average (expected value) to find all the sick ones?

Simulating this in Python, when $n=100$ and $p=0.5$, the expected value is $\approx$99.

faken
  • 243
  • 2
    is this a home work? – Aksakal Aug 03 '23 at 00:49
  • @Aksakal, you can easily deduce that by looking at my Stack Exchange profile. Takes less than a minute. In any case, I'm still waiting for the correct answer for this apparently trivial question. If you know the answer, please provide it and I'll be glad to upvote it and give it the green light. Cheers. – faken Aug 03 '23 at 11:43
  • Sampling without replacement is a meaningless distinction: you are simply flipping a coin but limiting it to 100 flips and asking whether, if you continued to a full 100 flips, all the remaining results would be tails. Thus, if you stop after $k$ flips, the chance the remaining $100-k$ flips are tails is $2^{k-100}.$ Evidently, even if you stop after 99 flips there's still a 50% chance you haven't observed all the heads. – whuber Aug 03 '23 at 14:11
  • 2
    What @Aksakal is trying to get at is that routine textbook problems, whether homework or not, are treated differently than real statistical problems. It might take less than a minute to read about our policy here ;-). BTW, your CV profile is anonymous and uninformative, so please don't complain that we don't know you. – whuber Aug 03 '23 at 18:23
  • 1
    @whuber, I'm well aware of the policy. My public profile shows that I'm a Stack Exchange (SE) user for almost 13 years (with mostly answer-built reputation on other SE sites), also showing links to my GitHub and Twitter accounts. So, even though it's allowed to ask about homework, it can come across as rude when the answer is so easy to find. In any case, the real answer I came here for is still M.I.A. (although Hunaphu below is close), so I would ask that the focus of this interaction would be on the actual answer, as I'm still scratching my head around it. Cheers. – faken Aug 03 '23 at 18:51
  • It sounds like you didn't read my first comment, which I believe answers it, unless I have misinterpreted your question ;-). Your points are well taken but please bear in mind that some active community members are not native English speakers. As I'm sure you know from your long experience, it is well not to read anything into a comment that isn't explicitly there and to assume that comments are volunteered out of friendliness and a desire to be constructive. – whuber Aug 03 '23 at 19:36
  • 1
    @whuber, point taken. Regarding your second comment, I've read it and understand it for $p=0.5$, but I can't generalize it to $p\neq 0.5$ and make a connection with Hunaphu's answer below (which contains a formula that perfectly fits the simulation), nor am I understanding how that formula relates to the mentioned hypergeometric distribution. Any hints? – faken Aug 04 '23 at 00:11
  • The hypergeometric distribution is irrelevant for your probability model (although of course it supplies a good approximation, because intuitively, sampling from a population with a fraction $p$ of successes is similar to sampling from a population where each subject has a chance $p$ of success). The generalization to arbitrary $p$ replaces "$0.5$" with "$p.$" – whuber Aug 04 '23 at 12:45
  • @whuber, except that in your second comment there's no "0.5". There is a 2, which I would imagine to be $2 = 1/0.5 = 1/p$, but that won't work with $p \neq 0.5$: I've tried it analytically and in simulation. At this time the only thing that works is Hunaphu's formula in his answer below, but I don't have a clue how he got to that result. Thus far, no one has been able to provide a complete answer. Any more hints? – faken Aug 04 '23 at 12:59
  • How, exactly, does it not work? Yes, I used "2" instead of "0.5" for simplicity. The general formula is $(1-p)^{100-k}$ for the chance that the final $100-k$ observations will be non-successes. – whuber Aug 04 '23 at 13:03
  • 2
    @faken a simple "no" would suffice. the question is not trivial as isn't its answer, btw. – Aksakal Aug 04 '23 at 21:55
  • 1
    @whuber, please check my answer below. Cheers. – faken Aug 06 '23 at 01:01

3 Answers3

3

Cumulative distribution function (CDF)

Based on @whuber's comments, their formula represents the probability that the final $n-k$ observations will be non-successes, which is the same as the first $k$ observations containing all existing successes (i.e., the CDF).

$$ P(k)=(1-p)^{n−k} $$

Probability mass function (PMF)

The PMF represents the probability that after $k$ observations, all existing successes have been found. We can derive it from the CDF:

$$ \begin{align*} p(k)&=P(k)-P(k-1)\\ &=(1-p)^{n−k}-(1-p)^{n−(k-1)}\\ &=p(1-p)^{n-k} \end{align*} $$

i.e.

$$ p(k)=p(1-p)^{n-k} $$

Expected value

To determine the expected value (mean) for our probability mass function, we use the expected value's general expression:

$$ E[X]=\sum_i x_i\,p(X=x_i) $$

where:

  • $E[X]$ is the expected value of the random variable $X$
  • $x_i$ represents the possible values of the random variable $X$
  • $p(X=x_i)$ is the probability mass function of the random variable $X$, which gives the probability that $X$ takes on the value $x_i$

Replacing with the terms from our problem, we get:

$$ \begin{align*} E[k]&=\sum_{k=0}^{n}kp(1-p)^{n-k}\\ &=p(1-p)^n\sum_{k=0}^{n}k(1-p)^{-k} \end{align*} $$

We need to analytically simplify the summation $\sum_{k=0}^{n}k(1-p)^{-k}$. If we define $r=(1-p)^{-1}$, the summation becomes:

$$ \sum_{k=0}^{n} kr^k $$

To simplify our summation, we can now use the sum of the first $n$ terms of the arithmetico-geometric sequence, which has the form:

$$ \sum_{k=1}^{n}\left[a+(k-1)d\right]br^{k-1}=\frac{ab-(a+nd)br^n}{1-r}+\frac{dbr(1-r^n)}{(1-r)^2} $$

If we set $a=d=1$ and $b=r$, the expression becomes:

$$ \sum_{k=1}^{n}kr^k=\frac{r-(1+n)r^{n+1}}{1-r}+\frac{r^2(1-r^n)}{(1-r)^2} $$

Using it in our $E[k]$ expression while remembering that $r=(1-p)^{-1}$, we get:

$$ E[k]=p(1-p)^n\left[\frac{(1-p)^{-1}-(1+n)(1-p)^{-n-1}}{1-(1-p)^{-1}}+\frac{(1-p)^{-2}(1-(1-p)^{-n})}{(1-(1-p)^{-1})^2}\right] $$

Using SymPy to simplify the previous expression, we obtain:

$$ E[k]=\frac{p \left(n - \left(1 - p\right)^{n} + 1\right) + \left(1 - p\right)^{n} - 1}{p} $$

Which can be further rewritten to get to @Hunaphu's formula:

$$ \begin{align*} E[k]&=\frac{p \left(n - \left(1 - p\right)^{n} + 1\right) + \left(1 - p\right)^{n} - 1}{p}\\ &=n-(1-p)^n+1+\frac{(1-p)^n-1}{p}\\ &=n+\frac{p-p(1-p)^n+(1-p)^n-1}{p}\\ &=n+\frac{(1-p)^n(1-p)-(1-p)}{p}\\ &=n+\frac{(1-p)^{n+1}-(1-p)}{p} \end{align*} $$

Therefore, the answer to my question is:

$$ E[k]=n+\frac{(1-p)^{n+1}-(1-p)}{p} $$

Which is exactly @Hunaphu's formula, but doesn't seem at all related with the hypergeometric distribution, as they suggested.

In any case, thanks to @whuber for providing an idea on how to start solving the problem, and to @Hunaphu for having presented the final answer, although lacking any feasible explanation on how they got there.

If there's a shorter path to this answer, please provide more answers, as I would like to know about it.

faken
  • 243
2

Problem formulation

Index the subjects $1, 2, \ldots, n$ in the order to be observed and for each $k=1,2,\ldots, n$ let $X_k$ be the indicator of whether subject $i$ is sick. The index of the last sick subject, $K,$ is function of the $X_k$ and therefore also is a random variable. (When no subjects are sick, you don't have to sample any at all, in which case we will define $K=0.$) You want to know the expected value of $K.$

Solution

A simple, direct method begins with the distribution function of $K$ defined by

$$F_K(k) = \Pr(K\le k) = \Pr(X_{k+1}=\cdots=X_n= 0) = (1-p)^{n-k}$$

assuming the $\{X_k\}$ are independent. Now

$$E[K] = \int_0^\infty (1 - F_K(k))\,\mathrm dk = \sum_{k=0}^{n-1} (1 - (1-p)^{n-k}) = n - (1-p)\frac{1 - (1-p)^n}{p}.$$

(This is algebraically equivalent to the formulas given in the existing answers, but -- as requested -- is derived in a short, simple manner.)


Because it's easy to make off-by-one errors, let's do some quick checks of simple cases.

  1. $n=0.$ By definition $K=0,$ so the formula had better give $0$ for its expectation. Indeed, $$0 - (1-p)\frac{1 - (1-p)^0}{p} = -(1-p)\frac{0}{p} = 0$$ provided $p\ne 0$ (an exception taken care of below).

  2. $n=1.$ The unique subject is sick with probability $p,$ where $K=1;$ otherwise $K=0.$ Thus the expectation must be $p(1) + (1-p)(0) = p.$ The formula also gives $$1 - (1-p)\frac{1 - (1-p)^1}{p}=p,$$ again provided $p\ne 0.$

  3. $p=1.$ All subjects are sick and so $K=n$ almost surely. The formula also gives $$n - (1-1)\frac{1 - (1-1)^n}{1} = n.$$

  4. $p=0.$ No subjects are sick and so $K=0$ almost surely. The formula is undefined, but we may take the limit as $p\to 0$ from above via L'Hopital's Rule, $$\lim_{p\to 0^+}n - (1-p)\frac{1 - (1-p)^n}{p} = n - \lim_{p\to 0^+} \frac{-1 + (n+1)(1-p)^n}{1} = 0.$$

whuber
  • 322,774
0

The number of sick patients on a specific sample follows a hypergeometric distribution. The expectation, once calculated, is $$ N+\frac{(1 - p)^{N + 1} - (1-p)}{p} $$

Hunaphu
  • 2,212
  • 1
    Thank you for your answer. It would be correct if $p$ was the number of sick patients, but in this case $p$ is the probability of each patient being sick. If we convert the formula to $(N+1)\frac{pN}{pN+1}$, results become very similar to those obtained in simulation, but not quite the same. Therefore, while your answer seems to be on the right path, it's not correct.

    Another question, I was previously looking into the hypergeometric distribution, but the expected value formula is not quite that one. How did you derive that formula? Thanks.

    – faken Aug 03 '23 at 11:32
  • Yes sorry, I have corrected it. – Hunaphu Aug 03 '23 at 13:43
  • This still answers a different question. There's a distinction between sampling a population with, say, 50% sick people, and sampling a population in which each individual has a 50% chance of being sick. – whuber Aug 03 '23 at 14:10
  • @Hunaphu, that formula after your second edit, perfectly fits results from the simulation. However, I still don't understand how you derived it from the hypergeometric distribution. More than having a formula, I would like to understand how you got to it. Thanks in advance! – faken Aug 03 '23 at 16:29