4

If I have a regression and I am bootstrapping the coefficients, if I end up with many calculated coefficients how do I calculate standard errors?

Is it the std of the samples? Or the std divided by the square root of the boot sample size? Or do I compute N samples of size B and compute the std of each sample and take the mean of those stds? (Where std = standard deviation)?

I have found references suggesting we can simply take the std of the bootstrapped samples but this is for normally distributed bootstrap sample data, if the bootstrap samples are not normal then what can we do to get estimates of standard errors?

MHall
  • 285
  • 1
    "I have found references suggesting we can simply take the std of the bootstrapped samples but this is for normally distributed data."

    Why do you think this is only for normally distributed data?

    – Matthew Drury Jul 09 '20 at 17:25
  • I would imagine that the distribution of your bootstrapped statistics - regression coefficients, in your case - will always converge to normal, the more bootstrap iterations (samples) you have, somehow a manifestation of the central limit theorem. Am I confused? – ouranos Jul 09 '20 at 17:26
  • Oh sorry Mathew Drury, I should update the question, I mean only for normally distributed bootstrapped samples, not only for normally distributed sample data. – MHall Jul 09 '20 at 19:07
  • Ouranos, the bootstrap I'm doing gives really large outliers, like -5000 or +3000 when the coefficients estimated at all under 2 in absolute value. It is an experimental model, causal/noncausal VAR and I'm trying to get the bootstrap to make sense. With 1000 boot samples the dist is not near to normal. I may have to try larger sample sizes. – MHall Jul 09 '20 at 19:25

2 Answers2

1

You get the regression standard errors for the coefficients from the standard deviation of the bootstrapped coefficients from each rep:

. sysuse auto
(1978 Automobile Data)

. bs, reps(101) saving("bs_reg.dta", replace): reg price foreign mpg weight (running regress on estimation sample)

Bootstrap replications (101) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 .

Linear regression Number of obs = 74 Replications = 101 Wald chi2(3) = 58.55 Prob > chi2 = 0.0000 R-squared = 0.4996 Adj R-squared = 0.4781 Root MSE = 2130.7695


         |   Observed   Bootstrap                         Normal-based
   price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- foreign | 3673.06 622.7476 5.90 0.000 2452.498 4893.623 mpg | 21.8536 81.81615 0.27 0.789 -138.5031 182.2103 weight | 3.464706 .7507974 4.61 0.000 1.99317 4.936242 _cons | -5853.696 3816.206 -1.53 0.125 -13333.32 1625.931


. use "bs_reg.dta", clear (bootstrap: regress)

. summarize

Variable |        Obs        Mean    Std. Dev.       Min        Max

-------------+--------------------------------------------------------- _b_foreign | 101 3599.271 622.7476 2129.934 4902.719 _b_mpg | 101 36.65845 81.81615 -175.5054 288.2288 _b_weight | 101 3.512753 .7507974 1.817149 5.068236 _b_cons | 101 -6261.379 3816.206 -16303.59 2266.739

Of course, this relies on independence of observations or clusters of observations to work. In your time series context, unless you adapt the procedure, the basic bootstrap will not work, even asymptotically. Here's a nice answer that gets into the various ways to adapt the BS to TS setting.

dimitriy
  • 35,430
1

In many contexts the distribution of coefficient estimates is normal. In that case taking the standard deviation among the coefficient estimates makes sense.

It seems, however, that your particular application is not leading to such a distribution of coefficient estimates. You can calculate an empirical standard deviation among the coefficient estimates but it won't necessarily have the usual interpretation in terms of coverage. You can't necessarily think of +/- 1 SD among the estimates from the bootstrapped samples as representing a 68.3% confidence interval about the mean. This happens if the coefficient value being calculated from the bootstrap samples isn't pivotal.

This answer is one of many on this site that discusses this problem, and ways to deal with it. This search provides links to more. Briefly, there are ways that can deal with bias and skew in values calculated from bootstrap samples although sometimes there can be intractable problems.

EdM
  • 92,183
  • 10
  • 92
  • 267