2

Question

How do I perform a bootstrapped difference test when my bootstrapped sample sizes are smaller than the original sample sizes?

Typically I would do this by taking $p = \frac{\sum_{i=1}^{B} t_{i}^ {*} \geq t}{B}$, where

  • $t_{i}^ {*}=\frac{\bar{x^ {*} }-\bar{y}^ {*}}{\sqrt{\sigma^{*2}_x/m + \sigma^{*2}_y/m}}$ for a pair of bootstrapped samples $x^{*}$, $y^{*}$
  • $t=\frac{\bar{x}-\bar{y}}{\sqrt{\sigma^{2}_y/m + \sigma^{2}_x/m}}$

In this case, however, the obvious substitute for $t$ is

  • $t=\frac{\bar{x}-\bar{y}}{\sqrt{\sigma^{2}_x/n_1 + \sigma^{2}_y/n_2}}$, where $m \neq n_1 \neq n_2$

I am unsure if this is valid. Any help on this would be much appreciated!

Background

I'm using bootstrapping to estimate if the respective population parameters for two populations $P_1$ and $P_2$ are significantly different.

For this purpose, I have samples from each population; namely, a sample $x_1$ of 900,000, and a sample $x_2$ of size 600,000. From these, I have done the following:

  • I've computed test statistics $\bar{x_1}$ and $\bar{x_2}$¹, as well as their difference
  • I've bootstrapped 2000 $x^∗_{1,i}$ 500,000-item samples and 2000 $x^{∗}_{2,i}$ 500,000-item samples
  • I've computed 2000 test statistics $\overline{x^∗_{1,i}}$ and $\overline{x^∗_{2,i}}$, as well as the differences between each

¹These are not the means, but it is convenient to represent them this way. Anyone is free to change this if it is annoying.

  • 1
    Why are you only drawing 500K samples per group, rather than the true sample sizes of 900K and 600K? – Eoin Sep 21 '22 at 11:08
  • @Eoin Memory reasons. Computing the statistic for a given 500k-item sample runs a /lot/ faster on my machine. Beyond that risks going over 28GB of RAM. – David McKnight Sep 21 '22 at 11:11
  • It's worth noting that it takes about a couple months' worth of computation time to acquire these as-is – David McKnight Sep 21 '22 at 11:19
  • Wow, ok. Out of curiosity, can you say what the statistic is? I suspect the approach you've proposed overestimates the uncertainty involved, but how to deal with that is beyond me. – Eoin Sep 21 '22 at 11:37
  • It's the optimal value for one of the weights in a neural network when that network is acting on the sample in question. – David McKnight Sep 21 '22 at 11:38
  • For full disclosure, given that each of these samples are three-hot 49-dimensional vectors with an associated 1-dimensional vector, I'm not sure I really /have/ valid measures of variance for individual samples beyond the bootstrap distribution (I could of course take the variance of these, but I'm unsure if that would be meaningful), but even if I can't use this test I'm pretty curious at this point and imagine it might be of use to someone else – David McKnight Sep 21 '22 at 11:41

1 Answers1

1

Boostrap is used for assessing the uncertainty of the statistic. We know that the uncertainty changes with increasing sample size (approximately by $\sqrt n$), so if you use bootstrap samples that are smaller than the size of the data, you are overestimating the uncertainty. By using a smaller bootstrap sample size you can be almost certain that the result would be off, just how much off would depend on that particular scenario.

Tim
  • 138,066