0

I have a limited set of repeated experimental values, roughly 100. The original experiments are expensive, so creating more data points in not an option.

If I use bootstrapping to estimate the mean and std error, then each bootstrap iteration samples abut 63% of the data elements, repeating those to fill in the N data points.

Alternately I could sample K elements, where K > N. For example if I sample 300 elements with replacement from 100 elements, then the coverage of the 100 points is 95% in each iteration.

Is this a 'thing' with any papers written about it? any good reasons not to do it? Other solutions to this issue?

I understand it will contain more outliers. My concern with the original data is that 100 values may not create the extreme statistics of the 'true' distribution. The usual bootstrap may also leave gaps in the sampled distribution.

TIA

Chris
  • 625
  • 2
    This looks pointless, because the oversampling overstates the precision available in the data, whence anything you derive related to that -- including (but not limited to) p-values, standard errors margins of error, confidence limits, prediction limits -- will be invalid. The bootstrap itself is usually computed (really, approximated) by obtaining a bunch of random resamples, but the bootstrap itself is based on the distribution of all possible such resamples and thereby doesn't suffer from this "coverage" problem you seem to refer to. – whuber Aug 04 '23 at 21:02
  • 3
    Hi Chris. Does this answer your question? https://stats.stackexchange.com/questions/263710/why-should-boostrap-sample-size-equal-the-original-sample-size – J-J-J Aug 04 '23 at 21:03
  • (Continued) Consequently, your question might be best addressed by reading some of our threads about the bootstrap or consulting the literature about it, such as Efron's book. – whuber Aug 04 '23 at 21:03
  • 1
    I upvoted J-J-J referral to that post as it gave a clear reference to how using the original N leads to the same std error as the CLT value. They key thing is that post and others hinge on "assuming the data is a good approximation of the theoretical distribution. – Chris Aug 06 '23 at 22:50

0 Answers0