0

I'm working with an essentially linear unsupervised modeling approach which (predictably) has problems when there is (multi)-collinearity. To avoid this, I've written some code which removes variables one-by-one based on which reduces VIF the most. I.e.

The function VIF(X, i) returns the VIF of the column i of X when regressed against the other columns. Also, assume we have.

max_VIF = max([VIF(X, i) for i in columns of X])

while max_VIF > threshold: best_VIF = infinity remove_col = None for i in columns of X: Y = X without column i Y_VIF = max([VIF(Y, i) for i in columns of Y]) if Y_VIF < best_VIF: best_VIF = Y_VIF remove_col = i remove i from X max_VIF = best_VIF

This, unfortunately, requires me to calculate the VIF several times which is a VERY slow process. Is there a better way to do this?

Edit: making pseudo code more clear.

  • Linear unsupervised model? Which model is this? What is your goal, apart from fitting a model successfully? Why are you doing this? – user2974951 Feb 09 '23 at 16:42
  • It is not clear that you should do this at all. What problems are there when you have multicollinearity, and how does removing variables remedy those issues? Consider reading my CW post here, the latter half of which addresses feature dependence and links to additional material (which also contains further reading). – Dave Feb 09 '23 at 16:57
  • @user2974951 the model is a causal discovery algorithm which assumes linearity. This specific model isn't one that is very well known and I only have access to it as a black box. However, I think the same issue would arise with the PC algorithm if the implemented with linear regression to check for independence. My overall goal, at least at the moment, is to "find something interesting in the resulting causal graph." – roundsquare Feb 09 '23 at 17:37
  • @Dave in general, I agree. However, LASSO and and ridge regression both require a target variable (at least as far as I know). When I have one, I do use these techniques pretty frequently, but right now I don't have one (this is all somewhat preliminary). I'm open to other techniques as well. – roundsquare Feb 09 '23 at 17:39
  • If you don't have a target variable, then what variance is being inflated by the VIF? – Dave Feb 09 '23 at 18:04
  • @Dave VIF doesn't require a target variable, right? You just take each variable, regress it against the other variables, and $VIF = 1/(1 - R^2)$. Even in the case of supervised learning, one just does this with the predictor variables - the target variable is ignored. – roundsquare Feb 09 '23 at 19:03
  • You can calculate the $R^2$ of a regression using all but one feature to predict the remaining feature, sure, but what, then, is the variance being inflated? // For various models, there are various extensions of the usual variance inflation factor, so while the $R^2$ of a regression using all but one feature to predict the remaining feature is always able to be calculated, it need not have the same meaning in all circumstances. In such a case, it is worth considering why you care about the classical VIF. – Dave Feb 09 '23 at 19:08
  • Somewhat unrelated to my previous comments, VIF is a property of a particular feature, not an entire model, so your while loop does not make sense. Perhaps you could edit your original post to clarify what you are doing with that loop. – Dave Feb 09 '23 at 19:17
  • @Dave I updated the pseudo code. It was, indeed, unclear. As regards your other point, I'm not sure I follow. I don't really understand the post (in part because it is heavy on R which I haven't used in years). There is a 1992 paper there which I have printed out to read and try to understand, However at the moment, the people who made the model are the ones suggesting reducing (classical) VIF which is why I'm looking for an efficient algorithm to do so. At the moment, it seems like my best bet is to randomly sample the data to speed up the VIF calculations. – roundsquare Feb 09 '23 at 20:46
  • 2
    But why care about the classical VIF? In linear models, multicollinearity has drawbacks, but removing features introduces issues that might make you worse off for having removed the features. However, at least VIF has an interpretation there, in that it literally refers to the factor by which coefficient variances are inflated compared to when the features are independent. When you are not in a situation where linear modeling is occurring, the interest in VIF does not necessarily make sense (despite the calculation of VIF only involving the features and not the outcome, if there even is one). – Dave Feb 09 '23 at 21:08

0 Answers0