0

I'm a long-time lurker and first-time poster to this forum...

I am currently working my way through an Introduction to Statistical Learning, and I have a question regarding the algorithms presented for best and stepwise subset selection. Am I correct in assuming that none of these algorithms check for multi-collinearity between the predictor variables? Would the next step in the process be to look at the variables selected by these methods and then confirm that there is no multi-collinearity between the variables each method selects? (I have experience in detecting multi-collinearity using the VIF factor)

I don't have a ton of experience in data science, but I have taken several graduate level business courses with an analytics focus, and I am having trouble understanding where one would use best and stepwise subset selection. Should these only be used when there are a lot of independent variables to explore and we are unsure what independent variables have a strong relationship with the dependent variable?

I am about to study the Lasso and Ridge regression sections of the book, and I think they may address the issue of multi-collinearity to where it isn't as much of a concern, but I wanted to make sure I was thinking about best and forward and backward stepwise selection the correct way.

Blake
  • 1
  • A search of our site on stepwise regression will give you helpful information. (It is rather telling that the titles in the top hits begin "Why avoid" and "Main drawbacks of" ...) – whuber Aug 11 '20 at 18:38
  • @whuber I apologize for asking a question that's been asked 100 times already! Please forgive me... I'll eventually ask some intelligent questions. Why does one of the #1 recommended textbooks for statistical learning even discuss it? It is it just to give context? – Blake Aug 11 '20 at 18:42
  • This site's users take an unusually dim view of stepwise regression, so bear that in mind. It's a technique that has been in use for a couple of generations and IMHO when used judiciously can be effective--but it has been supplanted by superior techniques based on regularization and cross-validation that were either inaccessible or impossible to carry out on older computing platforms. Since the largest, most popular university stats textbooks tend to stick around for 20-30 years past the dates they should be retired, don't be surprised to find this method discussed there. – whuber Aug 11 '20 at 18:47
  • BTW, your question is quite intelligent and meaningful. But I do encourage you to do a little searching of our site whenever you are inspired to ask a new question, because it is our intention for such searches to be helpful. – whuber Aug 11 '20 at 18:49

0 Answers0