partial set coverage sampling rule

Question

Is there a general relationship that gives the total number of drawn samples (d) on average required to have at least percent (p) of distinct samples in a set size N when sampling randomly with replacement?

If I have 200 distinct elements (N) in the bag, and I sample with replacement, I can estimate that around 10^2.5 drawn samples (d) yields about 80% (p) of those elements. If I have 2000 samples then about 10^3.5 yields 80%.

Here is code that allows the eyeball estimate of 10^3.5 from 2000 and 80%:

#total samples
N <- 2000
#how many to draw
samp_list <- 10^seq(from=2, to=4.0, length.out=1000) %>% round()
res <- 0*samp_list
for(i in 1:length(samp_list)){
#draw
  y <- sample(1:N, samp_list[i], replace=T)
#what percent of unique is drawn
  res[i] <- (length(unique(y))/N)
}
#plot it
df <- data.frame(i=samp_list %>% log10(), y=res)
ggplot(df, aes(x=i, y=y)) +
  geom_point() + geom_smooth(col="Red") +
  geom_hline(yintercept = 0.8)

Is there an algebraic expression that given n and p yields d?

References and derivation appreciated. Also, if it is known by a different name, please let me know the proper terminology.

The meanings of your terms "samples" and "values" are ambiguous. Would a "sample" be a single object and a "value" be an outcome? Is it possible for distinct objects in the population to have the same "value"? — whuber, Oct 24 '22 at 16:19
Would it be correct to view this as closely related to the distribution in the Coupon Collector's problem? — whuber, Oct 26 '22 at 18:47
@whuber - yes! Didn't know about coupon collector. Partial fulfillment means there are a multiplicity of ways to not get everything. If there are 100 coupons, while there is one way to get 100% of them, there are 100 ways to get 99% of coupons. There are (I think) 4950 ways to get 98%. If you are looking for 80% then there are 5e20 ways to get it. — EngrStudent, Oct 26 '22 at 20:02
I believe Ben answers your question -- or at least sketches a way to answer it, with references -- at https://stats.stackexchange.com/questions/257924. — whuber, Oct 26 '22 at 20:12

partial set coverage sampling rule

0 Answers0