What is wrong with this model selection procedure?

Question

I have a set of ~400 observations and ~20 covariates. Some covariates are logged, sqrt'd or truncated versions of others, so lots of dependence in my model matrix.

My response is a proportion. I would like to find the best quasi-binomial model with no more than three covariates.. I want to choose the best model for prediction, not inference. My present model selection strategy is:

Fit all models with no more than 3 covariates. (thousands of models)
Discard all models for which not all coefficients are significant at the 0.05 level. (130 models left)
Discard all models for which max(VIF) > 5. (15 models left)
Choose the model that aligns with my business intuition (coefficients have the signs that I expect). (one model left, final model)

I feel that this could be a very problematic model selection process, because it's just based on p-values and VIF, and not something like cross-validation error (a better metric I feel, since I ultimately want the best model for making predictions), but I don't have any "hard evidence" that this is truly a bad idea. What makes this model selection process a bad idea? I feel that I am falling prey to the multiple testing problem since I have thousands of p-values.

Can someone comment on the validity of this model selection procedure and point me in a better direction?

Are there any good resources regarding model selection of quasi-likelihood models? I assume AIC, BIC are not workable with quasi models.

I know that model selection is still very much an unsettled problem in statistics, but is there a better way?

Principal component analysis might be a better way of getting rid of multicollinearity if you don't care about interpretation. — Huy Pham, Nov 24 '18 at 20:04
What is special about the number 3 when it comes to force your model to include only 3 covariates? It feels like such an arbitrary choice! Personally, I would be highly skeptical of such a procedure unless you could justify it based on subject matter considerations. — Isabella Ghement, Nov 24 '18 at 20:28
Yeah, the number 3 is just a limit on model complexity that we have imposed on ourselves. — JTH, Nov 24 '18 at 20:37
What are you really trying to find out? In other words, if you throw away this arbitrary limitation on modelling complexity, what question would you like to be able to answer at the end of the day? — Isabella Ghement, Nov 24 '18 at 22:30
What's the best model (in terms of predictive accuracy and business intuition) we can make with 3 or fewer variables? — JTH, Nov 24 '18 at 23:05

What is wrong with this model selection procedure?

0 Answers0