Repeated model selection on bootstrapped data to identify robustness of selected parameters

Question

Consider I have a regression model and I want to identify predictor variables that have a significant effect on my dependent variable. (or improve the fit).

I can fit a model with all parameters and do a stepwise backward elimination based on AIC, BIC, Ftest, or else or do LASSO.

No matter what, I obtain a model with reduced parameters, for which I consider the remaining terms as significantly influencing my response .

To check how robust this approach is, I can bootstrap my data and redo the parameter selection. I do this 1000 times and always note the selected model parameters.

I now have a frequency distribution of how often a term was selected.

To build my final model I now include all terms that are robustly selected in say 99% of the time during bootstrapping.

Is this a valid approach and can I apply corrections for multiple testing here? Say the 99% threshold is arbitrary, but can I interpret the 99% as a p=0.01, collect all p values for each term originally in the model and apply the Benjamini-Hochberg or Bonferroni correction and obtain only terms with say p <0.05?

Update: The model should be used for inference, identifying parameters that have impact on the response, and not for optimal prediction. It should be parsimonious as possible, so I tend to be conservative in term selection and want to apply the above mentioned p-value correction. Typically the model includes ~7 predictor variables, but first order interactions may be allowed (if this is not making things more complicated) say the full model is:

lm(y~(x1+x2+..x5+fac1+fac2)^2)

Thanks.

Is this a model built for prediction on future cases, or for some other purpose like inference? Do you need a parsimonious model? How many maintained predictors are typically involved? Please provide that information by editing the question, as comments are easy to overlook and can be deleted. — EdM, Nov 07 '22 at 21:51
For approaches similar to what you propose have a look at the bolasso method, where variables occurring in more than 90% of the fits on bootstrapped data are retained, https://arxiv.org/abs/0804.1302. And stability selection is also a similar idea, but instead working on random splits of your data, https://academic.oup.com/jrsssb/article/72/4/417/7076513?login=false. — Tom Wenseleers, Jul 15 '23 at 14:13

EdM · Accepted Answer · 2022-11-07T22:58:01.317

1

Statistical Learning with Sparsity (SLS) covers use of the bootstrap in LASSO and other variable selection in Section 6.2, and continues to discuss inference in Section 6.3. I'd recommend reading that freely available resource closely, and using the related R selectiveInference package for inference, to follow best current practice.

Briefly, it probably makes the most sense to build your model first on the full data set, then illustrate the reliability of the predictor selection via bootstrapping, as in Figure 6.4 of SLS. If you want parsimony, you could use the lambda.1se criterion in selecting the penalty factor, which "gives the most regularized model such that the cross-validated error is within one standard error of the minimum." That keeps fewer predictors than the lambda.min criterion, which is based on the lowest cross-validated error. You document the variability of coefficient estimates among bootstraps and the frequency with which each candidate predictor is omitted from the model.

Inference in this context is a bit tricky. If you have a set of correlated predictors associated with outcome, the particular one(s) selected might change from bootstrap sample to bootstrap sample. In that case no one of them might be selected 99% of the time, but you would still want to have at least one of them maintained in the final model. The approach to inference in SLS and the associated software can document that a particular selected predictor is significantly associated with outcome, even if you can't show that it's "significantly better" than any of its correlated partners.

Finally, given your large number of potential interaction terms, you might want to look at this page for an introduction to the considerations if you maintain interaction terms in a model but omit associated "main effects." This answer includes further references on inference with LASSO.

edited Nov 07 '22 at 22:58

answered Nov 07 '22 at 22:48

EdM

92,183
10
92
267

thanks for your help and this extensive answer. Would this approach for selectiveInterference via LASSO also be valid for models including random effects and other error distributions say via the R package glmmLASSO? or does all this only holds for the OLS case ? – Jmmer Nov 08 '22 at 00:55
On the front of inference - would it be OK to repeatedly boostrap the dataset & do variable selection on each of those boostrapped datasets, calculate the union of selected variables across all fits on boostrapped datasets, repeat this until the size of the union of selected variables no longer grows & then refit a single regular GLM using base R's GLM function on this union of selected variables to do inference? I don't know if something like this has ever been suggested in the literature? – Tom Wenseleers Jun 06 '23 at 09:36
@TomWenseleers once you have used the outcomes to choose predictors, the assumptions underlying inference about the associations of those predictors with outcome is lost. Even in the selectiveInference package, inference with LASSO is based on a pre-chosen value of the penalty factor. I'm not sure how well that extends to using outcomes in cross-validation to select the penalty first. Yes, people do sometimes follow the type of approach that you suggest, but it's likely that p values are then lower than they should be. – EdM Jun 06 '23 at 15:19
@Edm I know a lot that just refit a GLM on selected variables of a single LASSO or other variable selection method - that's clearly wrong since that doesn't take into account the uncertainty on the selected set. But in what I propose that would not be the case, as we would use the union of all selected variables in all of the fits on the bootstrapped data. And one could also use cross validation to re-choose the optimal penalty for each bootstrap replicate to take into account uncertainty on penalty... But haven't seen this done... – Tom Wenseleers Jun 06 '23 at 16:39
@TomWenseleers it's not clear that the repeated bootstrapping and then using the union of selected variables gets around the problem of using the outcomes to choose the predictors. Why not ask this directly as a new question, to try to get ideas from other with more expertise? – EdM Jun 06 '23 at 16:45
@EdM Good idea - just posted it as a new question here! https://stats.stackexchange.com/questions/617990/inference-for-high-dimensional-models-based-on-running-a-glm-on-the-union-of-s – Tom Wenseleers Jun 06 '23 at 20:27

Repeated model selection on bootstrapped data to identify robustness of selected parameters

1 Answers1