Which statistical test is appropriate to compare two parametric bootstrap samples drawn from model fits of two time series?

Question

In previous threads , parametric bootstrapping was suggested as a method to test for differences between time series at specific time points. I have followed the methods as described in this answer (see below for more specifics), however I am not sure which test to run after obtaining the parametric bootstrap samples drawn from the fitted models. I tried a two-sample t-test, however the results seems questionable (see below).

Specifics

To test whether there are significant differences at specific times between two time series of hourly air quality data, I have fitted two Gaussian Process models to the two time series(using the GauPro() function from the GauPro-package in R). Subsequently, I have used the sample() function from the same package to draw 30 000 datapoints from the probability distributions of both fitted models at a time point of interest. This is what the histogram of the drawn datapoints look like (text continued below):

To test if there is a difference in the means of the distributions, I tried running a t-test on the bootstrapped data points, however the highly significant p-value seems a bit unreasonable, given the histogram above. Have I used an appropriate test for this comparison?

Test output:

    Welch Two Sample t-test
data:  GP1_nox_boot[, 2] and GP2_nox_boot[, 2]
t = -44.422, df = 59916, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.340702 -1.227391
sample estimates:
mean of x mean of y 
 70.24509  71.52913

Edit: Answers that I have encountered mostly deal with the nonparametric version of this bootstrap test (i.e., comparing two observed dataset by resampling, instead of comparing two simulated datasets)

The reason your test is highly significant is that you have 60,000 samples, more if you arbitrarily increase your bootstrap. You should use the standard deviation of the samples (which doesn't scale with the number of bootstrap samples, it only becomes more precise) to conduct your test. — PBulls, Nov 29 '23 at 16:41
Thanks for your comment Pbulls! Could you clarify what you mean? As I understand it, the mean also doesn't scale with the sample size, it is just the test statistic that does, and both the mean and standard deviation are used to get the test statistic.
Ignoring the arbitrarily large sample size, would you say a t-test is appropriate here? — randomEcologist, Nov 30 '23 at 08:40
A t- or even Z-test is most likely appropriate (you already make parametric assumptions in the bootstrap itself). The problem is that the test you're using right now is based on $\sigma / \sqrt{n}$ with $n$ your number of bootstrap samples, meaning you can make the variance arbitrarily small. You should use $\sigma$ as the estimate of the standard error of your statistic instead. I'll write up an answer on this in a bit. — PBulls, Nov 30 '23 at 08:49

score 0 · Accepted Answer · answered Nov 30 '23 at 10:26

You can certainly use a parametric test to compare these two distributions, but the key point is that you cannot condition on the number of bootstrap samples in this test - by arbitrarily varying the number of resamples you could get whatever statistical conclusion you'd like otherwise.

Recall that the Z-test for comparing two sample means boils down to $\frac{\hat\mu_1-\hat\mu_2}{SE}$, i.e. the difference between the estimated means over some estimate of variability, which asymptotically follows a standard normal distribution under the null. The primary goal of your bootstrap is to estimate the standard error $SE$, which is based on the sample's variance $\sigma^2$.

If you throw all of your bootstrap means into R's t.test it will roughly calculate the above denominator as $\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}$. The actual calculation is more complex since it uses a t distribution and Welch correction by default, but the point still holds: $n_1$ and $n_2$, the number of resamples in each bootstrap, appear in the denominator of this expression. This is a problem because the more samples you draw the smaller t.test thinks your test denominator should be. We're exactly trying to estimate that denominator from the bootstrap however, so this is wrong.

Instead of the default $\sigma/\sqrt n$ you should use the $\sigma$ across means of your bootstrap resamples in the denominator: that $\sigma$ is in fact the standard error of your bootstrap means. Drawing more bootstrap samples will still stabilize this estimate, but it will not infinitely decrease as you increase the number of bootstrap samples.

From back-transforming your t-test it seems like the pooled standard deviation of your bootstrap samples is roughly $7.075$. You can estimate this yourself more accurately by calculating $\sqrt{\sigma^2_1+\sigma^2_2}$ as above - technically you should account for the balance between bootstrap sample sizes here, I'm assuming they are equal. Your Z-statistic becomes $\frac{\hat\mu_1-\hat\mu_2}{\sigma}=\frac{-1.284}{7.075}$ which results in $P\approx 0.86$. If you have an estimate available on the degrees of freedom (for example from the sample size of original data) you can turn this into a t-test, but again it's important not to use the number of bootstrap samples.

Which statistical test is appropriate to compare two parametric bootstrap samples drawn from model fits of two time series?

1 Answers1