2

I'm wondering about techniques like ridge regression with regard to both multicollinearity and outliers.

My understanding is that ridge regression is primarily used for multicollinearity, but that somehow it is robust to outliers.

If you have a multicollinearity problem, what exactly is it that ridge regression gives you? Isn't the solution to remove the problematic variables? Does ridge regression simply make one of your covariates non-significant and signify that you should remove it?

If you have an outlier problem, what does it mean that ridge regression is robust to outliers? I tried it and using ridge regression gave me more outliers in terms of standardized residual diagnostics than OLS.

Dave
  • 62,186
fmtcs
  • 525
  • Dropping a feature is one way to address multicollinearity, but why is multicollinearity a problem in the first place? – Dave Dec 07 '21 at 14:07
  • 1
    @Dave Just because you could have simply included highly correlated variables not knowing they were correlated? That's why I'm wondering what the advantage of ridge regression is, because if there's multicollinearity should you not be dropping the variable? Even if you get lower coefficients that generalize better to new data, wouldn't the model generalize even better if you just removed correlated variables? – fmtcs Dec 07 '21 at 14:26
  • @fmtcs your are totally correct! :) Please have a look at my answer below and give me an upvote if you think I am on the right track! – John Sonnino Aug 02 '22 at 16:15
  • @JohnSonnino The upside to dropping a variable is reduced variance. The downside is adding bias by forcing a coefficient to be zero. My answer discusses the interplay between these competing notions. Your stance seems to ignore the bias introduced by forcing a parameter estimate to be zero, even though the parameter probably is not zero. If you’ve ever seen how a constant can be an admissible estimator, even if a silly one, the idea here is similar. – Dave Aug 02 '22 at 16:20
  • Dave, again you need to research what multicolinearity means – John Sonnino Aug 02 '22 at 16:25
  • Perhaps you can say what definition you’re using. Better yet, perhaps you could post a new question about your concerns. – Dave Aug 02 '22 at 16:27
  • Dave, you need to restart your understanding of what multicolinearity is. Firstly you do not need to include interrelated independent variables in the same regression, that is common sense. If you do so you will end up with multicolinearity. The effects of multicolinearity lead to spurious confidence intervals and t-scores. To resolve this one MUST drop interrelated independent variables based on either their background (they are one and the same with respect to Y) or on their performance in the overall model. – John Sonnino Aug 02 '22 at 16:39
  • Then you’re biasing the other coefficients and distorting their sampling distributions. // Yes, it is reasonable to drop a “length in kilometers” variable if another is “length in meters”. If you have “height” and “weight” as measures of size, then the bias-variance interplay is important, as those are not duplicates of information. // Again, I recommend posting a new question rather than arguing in the comments to multiple answers in multiple question threads, and I will be exiting this conversation until it comes up in a formal question. – Dave Aug 02 '22 at 16:43

1 Answers1

6

I understand the argument about dropping variables when they are correlated. After all, if variables are correlated, the information contained in one variable is partially contained in another. Including fewer variables results in fewer parameters in the model, and this can lead to a reduction in variance. Combined, I understand the appeal of dropping a variable: retain most of the information but drop a parameter.

The trouble is that, while you decrease the variance, you can introduce bias, perhaps enough that you are worse off, despite the decreased variance.

Regularization techniques are an alternative to dropping variables. They introduce bias, yes, but they do so in a way that does not completely drop variables. For instance, LASSO regularization tends to result in many coefficients calculated as $0$. If you consider that a feature selection step$^{\dagger}$ and run your regression on the "surviving" features that have nonzero coefficients, you will get different results, since (among other reasons) the "dead" features still contributed to the coefficient calculation.

Ridge regression does not set coefficients all the way to zero, and ridge regression seems to have a tendency to outperform LASSO.

None of these techniques---ridge, LASSO, and manually dropping variables---are inherently better than each other. In the link, Harrell gives his arguments for ridge regression but concedes that there are situations where LASSO can do better. If you have theoretical knowledge about the process or have a signal in the data screaming at you to drop a variable, perhaps that would work best.

$^{\dagger}$The link also discusses the issues with using LASSO to select features.

Dave
  • 62,186
  • This answer is unfortunately incorrect. Please make some investigation into the effects of multicolinearity. – John Sonnino Aug 02 '22 at 15:56
  • One should never sacrifice the reliability of their results for the inclusion of additional variables that contribute to a model that cannot be implemented effectively in real life situations. Moreover it would be impossible to select the final model considering that the TSS will be chock full of overfitting variances. – John Sonnino Aug 02 '22 at 16:54
  • Unless you’re using a different definition, $TSS=\sum_{i=1}^n\big(y_i-\bar y\big)^2$ and makes no reference to a model, model features, or model predictions (except in a technical sense where $TSS$ can be seen as the performance of an intercept-only model that always predicts $\bar y$). – Dave Aug 02 '22 at 17:11
  • The TSS would be the result of a bad selection practice, since ESS = TSS - RSS – John Sonnino Aug 02 '22 at 17:20
  • Here is a more detailed explanation: as an example, you decide to include an independent variable that normally has a t-score of -2 < t < 2 (not significant) but, after the inclusion of correlated independent variables, results in a spuriously large t-score. That variable should not be included but is nonetheless.

    Remember that ESS = TSS - RSS, however any variable can erode at the TSS whether is makes sense to included it in the model or not, so long as there is just a slight correlation with Y. Many economic variables make this kind of an overfitting trap possible with multicollinearity.

    – John Sonnino Aug 02 '22 at 17:47
  • 1
    Overfitting is bad and having many features can contribute to overfitting, sure, but discarding features just because they correlate with other features ignores the opportunity to damage your analysis with underfitting. – Dave Aug 02 '22 at 20:23
  • Let me explain it this way. If you have Y, X1 and X2 and X1 = X2 + e (1) and X2 = e, if you use Y = C + b1X1 + b2X2 + e (2) you are including the same variable in the regression and the standard errors will be spurious. If you believe that the error term of both X1 and X2 are relevant, then include only X1 in the regression, since this will include both features completely. This is because X1 is higher in the causal chain (Gauss-Markov theorem) given formula (1). If you wrongly include X1 instead of X2, being that the errors in X1 are not relevant, you are making a misspecification. – John Sonnino Aug 04 '22 at 18:23
  • Underfitting is called Omitted Variable Bias (OVB). You will not get OVB by removing correlated variables. – John Sonnino Aug 04 '22 at 18:28
  • Putting (1) into (2): Y = C + b1(X2 + e2) + b2X2 + e

    Does this look like a good regression to you?

    – John Sonnino Aug 04 '22 at 18:35
  • Edit: Since I have been guilty of doing this in the past, I will say this, you can ignore multicollinearity if the interpretation of the model requires both variables to be included, but you must show that the variables are relevant in the model individually by showing t-scores prior to inclusion. I.e. using the example above, showing in Y = C + b1 X1 + e that b1 is significant – John Sonnino Aug 04 '22 at 19:40
  • Cases where interpretation for two similar variables are however extremely rare and might only occur if you are instrumenting one variable to omit information from the first. What's is most important is that you show they are both significant. Coefficients will not be affected and in the case of machine learning, noone will see the significance levels, but you are setting yourself up for overfitting nonetheless. – John Sonnino Aug 04 '22 at 19:43
  • Thank you for this discussion it has been constructive for me. I have now gone through some past research and made note of these discussions for future reference. I hope you will too. – John Sonnino Aug 04 '22 at 19:45