Generating a bootstrap simulation of the response in regression

Question

I'm studying regression bootstrap procedures for my thesis, and have stumbled something which I find difficult to understand. Consider the standard normal linear model $$ Y = \beta x + \epsilon \,, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2) \,, $$ and suppose we want to simulate the distribution of the response $Y_+$ at a new value $x_+$ of the regressors. I have only found very few mentions of this setup in the literature (for instance this article and some domain-specific ones in my particular field), and these suggest applying the following bootstrap procedure:

Fit the model using OLS to obtain $\widehat{\beta}$ and resample the residuals $e_k = Y_k - \widehat{\beta} x_k$ (or some standardised version) to obtain a bootstrap sample $\epsilon_1^{(b)}, \dots, \epsilon_n^{(b)}$.
Simulate new responses $Y_k^{(b)} = \widehat{\beta} x_k + \epsilon_k^{(b)}$ and refit the model to obtain bootstrapped regression coefficients $\widehat{\beta}^{(b)}$ and variance estimate $\widehat{\sigma}^{(b)}$.
Simulate new responses at the point of interest $x_+$ by drawing $Y_+^{(b)}$ from the estimated distribution $\mathcal{N}(\widehat{\beta}^{(b)} x_+, (\widehat{\sigma}^{(b)})^2).$

I find it difficult to understand the intuition behind this approach. Specifically, why is it necessary to randomise the regression parameter $\widehat{\beta}$? The justification given in the expositions I've been able to find is usually something hand-wavy about 'incorporating the estimation error on the coefficients', but I find this unconvincing. The true response $Y_+ = \beta x_+ + \epsilon$ does not vary for different realisations of our sample data, contrary to the predictor $\widehat{Y}_+$, and so the variation in the parameter estimate should not be relevant to its simulated distribution. Indeed, the bootstrap procedures outlined in Stine and Davison & Hinkley (page 285) for estimating the prediction error use simulated regression parameters to obtain new predictor realisations, but stick with the original $\widehat{\beta}$ to simulate responses, only adding simulated residuals to them, i.e. $\delta := \widehat{Y}_+ - Y_+$ is approximated by $$ \delta^{(b)} = x_+ \widehat{\beta}^{(b)} - (\widehat{\beta} x_+ + \epsilon^{(b)}) \,. $$ Can anyone provide a good explanation/reference for bootstrapping the response distribution? Apologies for the lengthy question.

This is a prediction interval problem. Intuitively, the uncertainty in the prediction depends on two independent quantities: (1) the error term and (2) the estimated regression coefficients. Thus, you need somehow to incorporate both sources of uncertainty in your procedure if you wish it to be realistic and not overly optimistic. If this is unfamiliar, please see https://stats.stackexchange.com/questions/16493. — whuber, Apr 02 '23 at 14:24
I'm afraid you misunderstand the question. I'm not interested in obtaining a prediction interval for $Y_+$, rather I would like to simulate its distribution. See the section "Predicted values" in the first article linked above. — Othman El Hammouchi, Apr 02 '23 at 14:44
Simulating the distribution is prediction. Once you have the distribution, you can obtain any prediction interval you want. The same intuition therefore applies. — whuber, Apr 02 '23 at 14:47
That's precisely the problem: I'm not convinced it does. Notice in the final part of my post that the estimation error and response variance correspond to two different terms in the prediction interval formula. The simulated response only incorporates the intrinsic variance, whereas the simulated predictor incorporates the estimation error. When we want to simulate the response directly, I don't see why the estimation error should be incorporated. — Othman El Hammouchi, Apr 02 '23 at 14:52
If it isn't, you will underestimate the dispersion of the posterior because you will be treating the estimates as if they were known. — whuber, Apr 02 '23 at 14:55
Let me rephrase: I don't see why it should be incorporated in the manner specified above. In other words: why should $x_+ \widehat{\beta}^{(b)} + \epsilon^{(b)}$ be a good proxy for $x_+ \beta + \epsilon$? It seems to me that it rather approximates $x_+ \widehat{\beta} + \epsilon$, which doesn't correspond to anything. — Othman El Hammouchi, Apr 02 '23 at 14:59

Gregg H · Answer 1 · 2023-04-08T13:08:00.670

This is a great question because it is a variation of a problem/issue my students often pose when learning about confidence and prediction interval estimation with multiple regression (MR). The addition of the bootstrapping element for the parameters to the simulation protocol is indeed appropriate (and I'm surprised I haven't thought about it for class/lecture demonstration purposes).

My hope is to provide a useful explanation for why we definitely should bootstrap the parameters in order to simulate the response distribution. But, I will also provide a simpler protocol (which draws on the prediction interval from an MR analysis). And lastly, I'll briefly indicate a scenario where the bootstrapping protocol detailed here would probably be more applicable.

First, in the OP, you find the statement: “The true response $Y^+ = \beta x_+ + \epsilon$ does not vary for different realisations of our sample data, contrary to the predictor $Y^+$, and so the variation in the parameter estimate should not be relevant to its simulated distribution.” This actually is not a fully correct statement...but part of it is...and this is the key to understanding the need for the simulation.

Let me use the convention that lower-case Greek letters represent the parameter (true) values of the population, and upper-case Latin letters represent sample estimates for these parameters. So, while it is true for all values of the population that $$Y =\beta x + \epsilon$$ it is key to remember that when we obtain our parameter estimates from a sample, we do not have $\beta$, but $B$. And even if these values are off by just a little bit, this means the error estimate will also be incorrect. So, what we have is $$Y = B x + E$$

If you think about this as a basic bivariate regression, if your slope estimate is off from the population slope by just a little bit, well...you can still find the error to make your predicted values match your observed values of $Y$, but one of either the predicted value or the error estimate will be too high (and the other too low). So, if you just run a bootstrap using the estimated slope, you run the risk of making all of your simulated predictions be a little bit too high or low.

Thus, in a bootstrapping approach, we indeed would want to simulate a variety of the possible $B$ estimates we might obtain in order to make our prediction distribution.

And the nice part is that the normal theory behind ordinary least squares (OLS) MR has already answered this question of future prediction without needing to rely on bootstrapping the parameter estimates. Without elaborating the proof of these intervals here, I will give the confidence interval for the conditional mean and the prediction interval.

The confidence interval for the conditional mean is a range of values in which we would reasonably expect to find the mean value of $Y$ for some given value $x$ (that is to say, $\mu_Y$ is conditioned on knowing this value of $x$). This interval is $$\mu_{Y|x_+}= \hat{Y}_+ \pm t_\text{c.v.} · \hat{\sigma}_\epsilon\sqrt{x_+ (X^T X)^{-1} x_+^T}$$ This interval adds variability to our predicted (estimated) value to account for the fact that our parameter estimates for the regression $B$ may not have perfectly matched up with the population parameters $\beta$ in our MR model.

Next, we can adjust this formula slightly to get the prediction interval for any future value of $Y$ at this value of $x_+$: $$Y|x_+= \hat{Y}_+ \pm t_\text{c.v.} · \hat{\sigma}_\epsilon\sqrt{1+ x_+ (X^T X)^{-1} x_+^T}$$ This interval accounts for the variability in the parameter estimates from the MR analysis, and it accounts for the error variability (which also has been estimated from the MR analysis).

And this gives us the simpler bootstrapping protocol. All we need do is simulate new responses at the point of interest $x_+$ by drawing $Y_+^{(b)}$ from the normal distribution with mean $\hat{Y}_+$ and standard deviation $\hat{\sigma}_\epsilon\sqrt{1+ x_+ (X^T X)^{-1} x_+^T}$.

However, as noted above...as this is a fairly well understood problem (and solution), I am unsure why the bootstrapping protocol would be necessary unless there is some violation of the assumptions for the MR model. And this brings me to my final point: ¿when would this bootstrapping protocol be appropriate?

I would argue that this protocol would be appropriate if you are using some other estimation method to obtain the regression model, like a robust estimation using the median instead of the mean. As the associated distributions of these protocols are more complicated, a bootstrapping process that resamples the parameter estimates prior to making each prediction would be appropriate to gain a better understanding of the distribution of future predicted values.

I hope this answer proves useful, and I’m happy to elaborate further as needed.

Hi, thank you for your elaborate response. I agree that this is not necessary for Gaussian linear regression, my use-case is a GLM, but it's easier to ask the question this way. You explain how to simulate the distribution of the predictor $\widehat{Y}+$, which should of course contain parameter uncertainty. What I'm asking is why simulation of the response itself should contain this uncertainty. In other words, why is $x+ \widehat{\beta}^{(b)} + \epsilon^{(b)}$ a good proxy for $x_+ \beta + \epsilon$? It seems rather to approximate $x_+ \widehat{\beta} + \epsilon$ — Othman El Hammouchi, Apr 08 '23 at 14:39
I'm unsure I understand your follow-up...$\beta$ would be the population parameter, and $\hat{\beta}$ would be an estimate of that parameter...which is most often $B$ (the sample parameter estimates). — Gregg H, Apr 09 '23 at 12:14
$Y_+ = \beta x_+ + \epsilon$ is the response at a new point $x_+$ which we're trying to predict using $\widehat{Y}$. This predictor is based on an estimate $\widehat{\beta}$ of the parameter $\beta$. If I were interested in simulating a bootstrap distribution of the predictor, it makes sense to incorporate uncertainty about this estimate. But that is not what I'm interested in, I want to simulate the response itself. — Othman El Hammouchi, Apr 09 '23 at 14:33

Generating a bootstrap simulation of the response in regression

1 Answers1