2

In a linear regression context, and we observe that some independent variable can be approximately written as a linear combination of a set of other independent variables (e.g., with $R^2 > 0.95 \implies VIF > 20$), when would we want to keep this variable?

What if we're using Lasso/Ridge, should we just let regularization handle this variable for us, or should we remove it before training?

  • 2
    What if all other independent variables are uncorrelated with the response variable? That would make your variable the only one of any possible value in a model, regardless of any values of VIF, condition number, etc. – whuber May 09 '21 at 20:51
  • 1
    When you have a predictive multicollinearity, you want to keep the variables that are involved. Examples: Predicting species sex (M or F) using length, height and width. In some species, it is the shape more than overall size that predicts. The variables are highly collinear (larger animals are larger all around), but all variables are needed to characterize the animal's shape. – BigBendRegion May 09 '21 at 21:17
  • For another more mundane example, polynomial regression when there is curvature. Centering helps, but it is not necessary. – BigBendRegion May 09 '21 at 21:18

1 Answers1

1

The rationale for dropping variables see to go something like this.

  1. Having many parameters in the model risks overfitting.

  2. Thus, if we can reduce the parameter count, we might be able to guard against overfitting.

  3. When variables are related, dropping one would seem to retain much of the information available in both (or a whole group), due to the relationship. In some sense, it is like dropping a quarter of a variable to get a reduction of a full parameter.

  4. Therefore, if we drop one of those variables, we might be able to cut down on overfitting without sacrificing much of the information that is available in our features.

While it is true that a high parameter count can risk overfitting, it also is true that a low parameter count can risk underfitting, so it is not obvious that removing variables puts you in a better position. Further, as Frank Harrell discusses here, variable selection techniques tend not to be very good at what they claim to do.

If you find yourself tempted to drop variables, ask yourself why you want to drop any and why you want to drop those particular variables.

To some extent, the above is just for predictive modeling. If you want to interpret your model, the situation gets even worse. First, much of variable selection distorts downstream inferences, so your confidence intervals and p-values on regression coefficients are not accurate. Second, omitting variables that are correlated with variables that enter the model risks. Maybe you have a simpler model that reduces the VIF on your variable of interest, but:

  1. It is not a given that removing a correlated variable will shrink the confidence interval on your variable interest, since the VIF is competing with the overall error variance that might be higher after you remove a variable.

  2. You're perhaps giving a confidence interval for a biased estimate. Of all of the methods for doing biased estimation, it is not clear why this is the best or even a remotely competitive approach.

One of the major advantages of ridge and LASSO regression is that they work fine when you have huge variable counts. If you can pare down the parameter count using domain knowledge (knowing the literature or the scienfitic theory behind the study), that could be a reasonable way of reducing the variable count before you present data to the ridge and LASSO estimators. Aside from that, however, one of the points of using regularization techiques is to allow for large variable counts.

Dave
  • 62,186