Correct creation of the null distribution for bootstrapped $p$-values

Question

Let's say we want to calculate the two-sided $p$-value for a linear regression coefficient using bootstrapping. The null hypothesis is $\beta_1 = 0$. For the non-parametric (case-resampling) bootstrap, we would do the following steps:

Calculate the original statistic $T_0$, e.g. the $t$-statistic of the coefficient of interest in the original sample.
Sample the observations with replacement to create a bootstrap sample of the same size as the original sample.
Calculate the regression on this bootstrap sample and store the bootstrap statistic $T^*$, i.e. the coefficient of interest.
Repeat steps 2 and 3 a large number of times, say $R$ (e.g. $R = 1000$). This gives us $R$ bootstrapped statistics $T_r^*$.
Shift the distribution of $T_r^*$ to generate the null distribution.
Calculate the two-sided $p$-value by calculating: $p_{\text{boot}} = \dfrac{\text{#}\{|T_r^*| \geq |T_0|\} + 1}{R + 1}$.

My question concerns step 5: Should we shift the bootstrapped statistics by $T_r^*-\bar{T}_r^*$ so that its distribution is centered around $0$ or by $T_r^*-T_0$ so that it's centered around at the bias $\hat{B}^*=\bar{T}_r^* - T_0$?

The first version respects the bootstrap analogy that the sample is our population regarding the bootstrap samples. The second version makes sense because it corresponds exactly to the null value specified in the null hypothesis.

Here is a small R script that illustrates the procedures. The $p$-values calculated by the two versions are virtually identical here (the bias is only $0.12$) but I suspect that this is not always the case, especially when the bias is large.

# Load data
data(swiss)
Calculate original model
mod <- lm(Infant.Mortality~., data = swiss)
Store sample size
N <- nobs(mod)
Store original t-statistic for "Fertility"
stat_orig <- summary(mod)$coefficients["Fertility", "t value"]
Start the bootstrap
R <- 10000 # Number of bootstrap samples
Start resampling
set.seed(142857) # Reproducibility
stat_boot <- replicate(R, {
  boot_ind <- sample.int(N, replace = TRUE) # Sample indices with replacement
  tmp_mod <- lm(Infant.Mortality~., data = swiss[boot_ind, ])
  summary(tmp_mod)$coefficients["Fertility", "t value"]
})
Create null distribution by shifting bootstrap distribution
stat_boot_1 <- stat_boot - mean(stat_boot) # Mean centered
stat_boot_2 <- stat_boot - stat_orig # Centered on the bias
Calculate p-values
(pval_boot_1 <- (sum(abs(stat_boot_1) >= abs(stat_orig)) + 1)/(R + 1))
[1] 0.03019698
(pval_boot_1 <- (sum(abs(stat_boot_2) >= abs(stat_orig)) + 1)/(R + 1))
[1] 0.03149685

Is the centering (with or without removing the bias) sufficient to argue that we've generated a bootstrap sample under the null hypothesis? — dipetkov, Jan 31 '23 at 20:46
@dipetkov I've seen multiple ways to calculate p-values for regression coefficients. Hinkley & Davison use the null model to generate data by resampling the residuals. I saw the shown method here a few times on this site and also in an R package. It doesn't generate the sample under the null directly. If you have reservations against the method or have some pointer to the literature, I'd be glad to hear it. — COOLSerdash, Feb 01 '23 at 11:55
@dipetkov The approach I use here is used Efron & Tibshirani in section 16.4, albeit not for regression. Implicitly, we assume that the distribution under the alternative is just a translated version of the null distribution. — COOLSerdash, Feb 01 '23 at 12:30
"An Introduction to the Bootstrap" has a chapter on bootstrapping regression but the examples are about bootstrapping the estimates and their standard errors. I also looked at "Applied Regression Analysis and Generalized Linear Models" by J. Fox. That book has a section on bootstrapping tests but is explicit that the p-values are not under the null hypothesis. — dipetkov, Feb 01 '23 at 12:58
J. Fox writes: "We have to be careful to draw a proper analogy here: Because the original-sample estimates play the role of the regression parameters in the bootstrap “population” (i.e., the original sample), the bootstrap analog of the null hypothesis—to be used with each bootstrap sample—is $H_0:\beta_1=\hat{\beta}_1,\ldots,\beta_k=\hat{\beta}_k$". This is in reference to the case bootstrap, ie sampling observations, not residuals. — dipetkov, Feb 01 '23 at 12:58
Step 5 assumes that the statistic is pivotal--that the shift in position to the null doesn't affect the shape of its distribution. Are you safe in making that assumption? — EdM, Feb 02 '23 at 03:22
@EdM If I'm not mistaken, the basis of the described procedure is detailled in Hinkley & Davison on pages 280 ff. The details differ, as they resample only the residuals but they also consider $Z = (\hat{\gamma} - \gamma)/V^{1/2}$ a pivot. Upon reading the section again, they use $z^* = (\hat{\gamma}^* - \hat{\gamma})/v^{*1/2}$, effectively centering the bootstrap distribution on the observed statistic in the full sample. — COOLSerdash, Feb 02 '23 at 09:26
"effectively centering the bootstrap distribution on the observed statistic in the full sample": So, is this the same as saying that the bootstrap resamples are not under the null hypothesis? — dipetkov, Feb 02 '23 at 13:27
@dipetkov Sorry, I misspoke. $z^$ is not centered on the observed test statistic. But The pivot $z^$ uses $\hat{\gamma}$, the coefficient in the original sample as the "population" value I think. But yes, they resample the residuals from the full model, not the null model in this version. — COOLSerdash, Feb 02 '23 at 14:18

score 2 · Accepted Answer · answered Feb 05 '23 at 21:36

2

If I understand Hall and Wilson correctly,* then the problem is in your Step 3. You should not be calculating t-statistics with respect to a null hypothesis of $\beta_1=0$ in your bootstrap samples. Instead, for the bootstrap samples you should be calculating t-statistics with respect to a null hypothesis $\beta_1= \hat \beta_1$, where $\hat \beta_1$ is your coefficient estimate from the original model.

They summarize their general recommendations for bootstrap-based tests, with $\hat \theta$ and $\hat\sigma$ being the original sample estimates of location and scale and asterisks representing corresponding estimates from bootstrap samples, in their Second Guideline:

Base the test on the bootstrap distribution of $(\hat \theta^* -\hat \theta)/\hat \sigma^*$ , not on the bootstrap distribution of $(\hat \theta^* -\hat \theta)/\hat \sigma$ or of $(\hat \theta^* -\hat \theta)$.

That's called "bootstrap pivoting" and is what you quote from Hinkley and Davison in a comment. Hall and Wilson explain in their First Guideline that evaluating bootstrap differences from the original sample estimate $(\hat\theta^*-\hat \theta)$ instead of from the original null hypothesis, $(\hat\theta^*-\theta_0)$, increases the power of the test, potentially by a lot if the original null hypothesis is far from correct.

In your example, you should do the shift $(\hat\beta_1^*-\hat\beta_1)$ before you divide by $\hat\sigma^*$ to get each of your bootstrapped t-statistics. That nicely brings the center of the distribution to 0 in the best way while respecting the bootstrap principle.

*reference suggested by the late Michael Chernick in this answer

answered Feb 05 '23 at 21:36

EdM

92,183
10
92
267

Thanks for the answer and for this awesome reference I didn't know. To be sure, $\hat{\sigma}^*$ would be the standard error of the coefficient in the bootstrap samples, right? Finally, you would compare the bootstrapped pivots with the original $t$ statistic in this example, correct? – COOLSerdash Feb 05 '23 at 21:51
1

@COOLSerdash yes. $\hat\sigma^$ represents the set of coefficient standard error estimates among the bootstrapped samples just as $\hat\beta_1^$ represents the corresponding coefficient estimates. I hadn't read Hall and Wilson previously either; I came across it as Michael Chernick's recommendation with respect to a related question. – EdM Feb 05 '23 at 21:57
Thanks again. Final question: Hall & Wilson don't explain how one would calculate bootstrap $p$-values. I suspect one would compare the bootstraped pivots with the original $t$ statistic of the model (calculated like $p_{boot}$ in my question). Does that sound reasonable to you? – COOLSerdash Feb 05 '23 at 22:05
1

@COOLSerdash yes, you use the distribution of $(\hat \theta^* -\hat \theta)/\hat \sigma^*$ as a reference for comparing against the original value of the t statistic. See Equation 2.1 and the text that follows for a p < 0.05 cutoff estimate that can be extended to further values. I'm not sure how precise these estimates of p-values will be, as with low p values you will be far out in notoriously unreliable tails of bootstrap distributions. – EdM Feb 05 '23 at 22:12
Got it! Thanks for taking the time. – COOLSerdash Feb 05 '23 at 22:13
This answer is very clear and succinct, +1. Hall & Wilson have a followup article which describes the first guideline as "misleading": Bootstrap Hypothesis Testing Procedures (It's not the answer's fault but all this makes me question the value of bootstrap p-values over and beyond bootstrap CIs. CV threads on the topic also seem to be inconsistent.) – dipetkov Feb 06 '23 at 08:01
@dipetkov thanks for the link. I admit that I don't see much point in getting p-values by bootstrapping as CI are typically more informative. Also, anything lower than a p-value of 0.05 is likely to be unreliable in bootstrapping, given the problems with getting reliable results in the far tails of distributions. – EdM Feb 06 '23 at 14:01
1

@dipetkov the claim of the first guideline as "misleading" in the paper you linked is a quote attributed by the author Becher to Tibshirani, not by Hall & Wilson. In their response to the Becher paper, Hall & Wilson stand firmly behind their guidelines. Becher's analysis seems to suffer from the switching the tails of distributions, the problem introduced by percentile bootstraps when there is bias and skew. See this page for an (admittedly, not succinct) discussion that illustrates that tail switching. – EdM Feb 06 '23 at 14:51
1

Oh, I see. The second paper is a reaction/commentary to the other paper. And then there is a rejoinder too. I apologize for the confusion; I've been trying to understand this topic for some days, not sure I'm getting there. And thank you for pointing me to the other thread; to me it's more interesting because it's about bootstrap CIs. (And thorough is better than succinct.) – dipetkov Feb 06 '23 at 15:14

Correct creation of the null distribution for bootstrapped $p$-values

Calculate original model

Store sample size

Store original t-statistic for "Fertility"

Start the bootstrap

Start resampling

Create null distribution by shifting bootstrap distribution

Calculate p-values

1 Answers1

Linked