I have a set of ~400 observations and ~20 covariates. Some covariates are logged, sqrt'd or truncated versions of others, so lots of dependence in my model matrix.
My response is a proportion. I would like to find the best quasi-binomial model with no more than three covariates.. I want to choose the best model for prediction, not inference. My present model selection strategy is:
Fit all models with no more than 3 covariates. (thousands of models)
Discard all models for which not all coefficients are significant at the 0.05 level. (130 models left)
Discard all models for which max(VIF) > 5. (15 models left)
Choose the model that aligns with my business intuition (coefficients have the signs that I expect). (one model left, final model)
I feel that this could be a very problematic model selection process, because it's just based on p-values and VIF, and not something like cross-validation error (a better metric I feel, since I ultimately want the best model for making predictions), but I don't have any "hard evidence" that this is truly a bad idea. What makes this model selection process a bad idea? I feel that I am falling prey to the multiple testing problem since I have thousands of p-values.
Can someone comment on the validity of this model selection procedure and point me in a better direction?
Are there any good resources regarding model selection of quasi-likelihood models? I assume AIC, BIC are not workable with quasi models.
I know that model selection is still very much an unsettled problem in statistics, but is there a better way?