1

Is there a general relationship that gives the total number of drawn samples (d) on average required to have at least percent (p) of distinct samples in a set size N when sampling randomly with replacement?

If I have 200 distinct elements (N) in the bag, and I sample with replacement, I can estimate that around 10^2.5 drawn samples (d) yields about 80% (p) of those elements. If I have 2000 samples then about 10^3.5 yields 80%.

Here is code that allows the eyeball estimate of 10^3.5 from 2000 and 80%:

#total samples
N <- 2000

#how many to draw samp_list <- 10^seq(from=2, to=4.0, length.out=1000) %>% round()

res <- 0*samp_list

for(i in 1:length(samp_list)){

#draw y <- sample(1:N, samp_list[i], replace=T)

#what percent of unique is drawn res[i] <- (length(unique(y))/N)

}

#plot it df <- data.frame(i=samp_list %>% log10(), y=res)

ggplot(df, aes(x=i, y=y)) + geom_point() + geom_smooth(col="Red") + geom_hline(yintercept = 0.8)

Is there an algebraic expression that given n and p yields d?

References and derivation appreciated. Also, if it is known by a different name, please let me know the proper terminology.

EngrStudent
  • 9,375
  • The meanings of your terms "samples" and "values" are ambiguous. Would a "sample" be a single object and a "value" be an outcome? Is it possible for distinct objects in the population to have the same "value"? – whuber Oct 24 '22 at 16:19
  • @whuber - is this less bad? – EngrStudent Oct 26 '22 at 17:18
  • 1
    Would it be correct to view this as closely related to the distribution in the Coupon Collector's problem? – whuber Oct 26 '22 at 18:47
  • @whuber - yes! Didn't know about coupon collector. Partial fulfillment means there are a multiplicity of ways to not get everything. If there are 100 coupons, while there is one way to get 100% of them, there are 100 ways to get 99% of coupons. There are (I think) 4950 ways to get 98%. If you are looking for 80% then there are 5e20 ways to get it. – EngrStudent Oct 26 '22 at 20:02
  • 1
    I believe Ben answers your question -- or at least sketches a way to answer it, with references -- at https://stats.stackexchange.com/questions/257924. – whuber Oct 26 '22 at 20:12

0 Answers0