0

I have a list $L=\{x_1,x_2,\cdots,x_N \}$ of $N$ random integer numbers $x_i\in[a,b]$, $b\gt a$, $a,b\in\mathbb{N}$, $N\gg b$ and they follow a discrete uniform distribution.

I need to scan the list from its beginning and "collect" each number from $a$ to $b$ and I am interested in how far in the list I had to go (reaching the index $F$, $b\leq F\leq N$) in order to collect the full set. I am not able to formulate the problem in a more formal way so please accept some examples.

  1. $a=1$, $b=3$, $L=\{1,2,3\}=\{x_1,x_2,x_F\}$ $\Rightarrow F=3$
  2. $a=1$, $b=3$, $L=\{3,2,1\}=\{x_1,x_2,x_F\}$ $\Rightarrow F=3$
  3. $a=1$, $b=3$, $L=\{1,2,2,3,3,3\}=\{x_1,x_2,x_3,x_F,x_5,x_6\}$ $\Rightarrow F=4$
  4. $a=1$, $b=3$, $L=\{1,1,1,1,2,3\}=\{x_1,x_2,x_3,x_4,x_5,x_F\}$ $\Rightarrow F=6$

I did some simulations in Python:

import random
import matplotlib.pyplot as plt
import numpy as np

a is fixed at 1

b = 3 M = 10000

def sim(): s = set() cnt=0 while(len(s)!=b): s.add(random.randint(1,b)) cnt=cnt+1 return cnt

F = [] for i in range(1,M+1): F.append(sim())

fig, ax = plt.subplots() ax.hist(F,density=True,alpha=.5, color='grey',edgecolor='white',bins=np.arange(0, max(F) + 1.5) - 0.5,log=True) ax.set_title("$a=1$, $b=$"+str(b)+", "+str(M)+" simulations") ax.set_xlabel(str(min(F))+'$\leq F \leq$'+str(max(F))) ax.set_ylabel('density')

and I get some fancy plot for the distribution of $F$:

enter image description here

enter image description here

Is it possible to theoretically derive the distribution of $F$?

What is the expected value for $F$?

What is the largest value for $F$? The worst case scenario for a fixed size list is that I need to reach the last value of $L$ at the end of the list in order to collect all the numbers... but what happen if $N\rightarrow \infty$?

What are the keywords to search for in order to find papers for this kind of problem? Or, what is a more suitable title for this question?

  • 1
    This is the coupon collector problem. We have an extensive set of threads on many variations of it, too. One detail is unclear based on your "largest value" question: what is the support of this uniform distribution? Your initial description makes it sound like all the numbers are randomly and independently drawn from $[a,b],$ which means you might not see the entire interval $[a,b]$ among your sample. – whuber Sep 01 '22 at 14:03
  • 1
    @whuber Thank you very much for the clarification! – Alessandro Jacopson Sep 01 '22 at 14:15

0 Answers0