0

I do backward elimination, by iteratively removing the biggest p-values until the biggest p-value is < 0.157.

Then, I have a model, which confidence intervals displayed are not wide enough: "The method yields confidence intervals for effects and predicted values that are falsely narrow; see Altman and Andersen (1989).".

To determine appropriate confidence intervals, I repeat the experience by bootstrapping*. Is it correct to compute the confidence intervals that way:

  • I store the values of a given parameter when it's selected** (let's say it's selected 70,041 times out of 100,000 bootstrap iterations);
  • I sort that list, and take the smallest parameter value such that at least 5% of the coefficients are below it - same for the 95% upper bound;
  • I display this as being my confidence interval around the value I had without bootstrapping*** (i.e. doing the backward elimination on the dataset not bootstrapped).

*Are there situations where this is not valid?

**Why should I store the values of a given parameter when it's selected and not when it's not, it seems like i'm biasing something there.

***That should be the same value as if I took the mean of this list (but the mean of the list would be less precise because I only did 100,000 bootstrap iterations), right?

Sources (scientific articles) are welcome.

  • 1
    My questions are: why are you doing this? Why are you doing variable selection? Why are you using p-values for this purpose, given that this is not their purpose at all? Why are you stopping at a p-value of 0.157? – user2974951 Aug 19 '22 at 08:40
  • What do you use for the parameter estimate from bootstrap samples (you didn't say, but I'd assume 0, right?)? Why would you use the estimate from the naive model? Surely, it should be the average of the bootstrapped estimates (otherwise the estimate of each parameter are too optimistic, aren't they?)? – Björn Aug 19 '22 at 08:55
  • @user2974951 I'm doing this because I have more than 1 variable per 10 events. So I need some sort of variable selection. I know this is a bit of a rule of thumb, so probably not the best argument, but I hope it makes sense? – FluidMechanics Potential Flows Aug 19 '22 at 09:06
  • @user2974951 I'm using p-values and stopping at 0.157 because I'm following a scientific paper that suggests doing so (for backwards selection). – FluidMechanics Potential Flows Aug 19 '22 at 09:06
  • I'm not sure I understand what you mean @Björn – FluidMechanics Potential Flows Aug 19 '22 at 09:07
  • For what it's worth, if I read an article claiming that I should use backward elimination (especially with p-values) OR that I should stop at a p-value of 0.157... I would close that article immediately. – user2974951 Aug 19 '22 at 09:27
  • 1
    You have B bootstrap samples. In some a term will not be selected to be in the model (effectively the coefficient for the term will be 0), so the normal thing to do would be to use that for the bootstrap samples where it's not selected. It would be wrong to just work with the coefficients from the bootstrap samples where the terms are selected. Just like the CIs being overoptimistically wide, estimates would be over-optimistically far away from zero, if you take them from the naive model without accounting for model selection (by bootstrapping). – Björn Aug 19 '22 at 09:28
  • Ohhh, I understand, thank you @Björn – FluidMechanics Potential Flows Aug 19 '22 at 09:33
  • @user2974951 well it corresponds to the theoretical AIC threshold, is that a bad thing? – FluidMechanics Potential Flows Aug 19 '22 at 09:35
  • 1
    Variable selection has absolutely nothing to do with increasing the effective sample size. If your sample size is too small for full modeling, variable selection doesn't help. To the original question, you don't use just those selected parameters in confidence intervals. To get approximately correct CIs you need to include a zero estimate when a variable is not selected. This was studied in a paper by Peter Austin. I wish I had the reference. – Frank Harrell Aug 19 '22 at 11:36
  • Is that the one? https://onlinelibrary.wiley.com/doi/10.1002/sim.3104 – FluidMechanics Potential Flows Aug 19 '22 at 11:54
  • By adding 0s, my new confidence intervals aren't gonna be centered around the initial coefficient estimation (on the non bootstrapper dataset after backward elimination). Is it a problem? I believe this is discussed there: https://stats.stackexchange.com/questions/532796/why-are-stepwise-regression-coefficients-biased is it or am I drawing parallels where I shouldn't?) – FluidMechanics Potential Flows Aug 19 '22 at 11:56

0 Answers0