Is there a general relationship that gives the total number of drawn samples (d) on average required to have at least percent (p) of distinct samples in a set size N when sampling randomly with replacement?
If I have 200 distinct elements (N) in the bag, and I sample with replacement, I can estimate that around 10^2.5 drawn samples (d) yields about 80% (p) of those elements. If I have 2000 samples then about 10^3.5 yields 80%.
Here is code that allows the eyeball estimate of 10^3.5 from 2000 and 80%:
#total samples
N <- 2000
#how many to draw
samp_list <- 10^seq(from=2, to=4.0, length.out=1000) %>% round()
res <- 0*samp_list
for(i in 1:length(samp_list)){
#draw
y <- sample(1:N, samp_list[i], replace=T)
#what percent of unique is drawn
res[i] <- (length(unique(y))/N)
}
#plot it
df <- data.frame(i=samp_list %>% log10(), y=res)
ggplot(df, aes(x=i, y=y)) +
geom_point() + geom_smooth(col="Red") +
geom_hline(yintercept = 0.8)
Is there an algebraic expression that given n and p yields d?
References and derivation appreciated. Also, if it is known by a different name, please let me know the proper terminology.