1

I need to convince colleagues that variable selection on the same data you use for inference is a bad idea. I know of some general references on the problems with model selection, listed below -- but none is really appropriate for people with limited time and no mathematical background.

I'm looking for a short, punchy empirical demonstration (i.e. simulation) of what can go wrong with variable selection, ideally containing a single chart that illustrates the problem clearly. A video might well work better than a book here...


Altman, D. G., & Andersen, P. K. (1989). Bootstrap investigation of the stability of a Cox regression model. Statistics in medicine, 8(7), 771–783. https://doi.org/10.1002/sim.4780080702

Derksen, S. and Keselman, H.J. (1992), Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. British Journal of Mathematical and Statistical Psychology, 45: 265-282. https://doi.org/10.1111/j.2044-8317.1992.tb00992.x

Harrell, F. E., Jr. (2016). Regression modeling strategies. Springer International Publishing.

Leeb, H., & Pötscher, B. M. (2005). Model Selection and Inference: Facts and Fiction. Econometric Theory, 21(1), 21–59. http://www.jstor.org/stable/3533623

Mohan
  • 865
  • 2
    Variable selection isn't automatically bad - it has some benefits, and some costs/disadvantages. However doing it using the same data you plan to use for inference or prediction is definitely a problem. If you have those references you will presumably be aware that many of the problems will be with properties of inference or prediction. Demonstrating those sort of problems will require simulation (or other methods like algebra) – Glen_b Nov 26 '23 at 21:10
  • 1
    @Glen_b Edited to clarify. I am indeed looking for someone who has done the simulation + presents it in a nice way. – Mohan Nov 26 '23 at 23:49
  • 1
    @StephanKolassa Do you mean the main answer? It does a good job of explaining the conceptual issue, but I'm looking for a concrete example where someone has run the simulation. It will be more appropriate for a nontechnical audience. – Mohan Nov 26 '23 at 23:54
  • 1
    It would be best if someone could add an answer with such a simulation to the other thread, if there is indeed nothing there yet (I haven't looked at it in a while). That way, the information is all in one place. There is absolutely no problem with looking at a common issue from multiple points of view. I will try to find the time to write up this kind of answer. – Stephan Kolassa Nov 27 '23 at 06:00
  • @StephanKolassa FWIW the best source I have found so far is https://www.youtube.com/watch?v=CwGyoo-D8iY by Frank Harrell, from about the 56 minute mark. He mentions the Derksen and Keselman result in a compact way that is easy for people to understand. – Mohan Nov 27 '23 at 06:28

0 Answers0