2

Let $X_1, X_2, ..., X_k, ...$ be a sequence of i.i.d. random variables with $X_i \sim \mathcal{U}\{1, 2, ..., n\}$ (discrete uniform distribution). The parameter $n$ is unknown.

Let's say the first time you see an item already seen before is the k-th attempt (i.e. $X_k = X_m$, with $m<k$).

Then an estimation for $n$ seems to be:

$$n \approx \frac{1}{2}k^2 - \frac{5}{6} k + \frac{11}{18}$$


Here is the heuristic argument: let $U$ be the number of unique values seen in the sequence. The situation described here corresponds to $U = k-1$ (i.e. after $k$ attempts, the number of unique values seen is $k-1$). This answer suggests (both approach A. and B.) that an estimation for $n$ should be a real solution of $x (x-1)^k - (x-1) x^k = 0$, and we are then in "Regime $a$ very close to $k$" of this answer, and $n$ should be:

$$n \approx \frac{1}{2}k^2 - \frac{5}{6} k + \frac{11}{18}$$

How to turn this into a real argument (with probability/statistics theory/estimation theory)? (I'm not asking about the approximation of the root of the polynomial, this is well proved here)

Note: this is a specific case ($U=k-1$) of this, I think it will be easier to start with.

Application

Every day you visit a website which displays (randomly) 1 funny video among $n$ videos. For the first time today (it's your 20th visit), you see a video that you've already seen in the past (i.e. $k=20$, $U=19$). You are a bit disappointed that the stock is not infinite, but you think "Let's estimate how many videos there are on this website". The formula gives an estimated number of

$$n \approx \frac{1}{2}20^2 - \frac{5}{6} 20 + \frac{11}{18} \approx 183.94$$

videos.

Basj
  • 498
  • 1
  • 4
  • 16

0 Answers0