3

In my dataset, all the variables are highly correlated (correlation coefficient > 0.95). However, the correlation with the dependent variables is very low (<0.35). I checked the variation inflation factor (VIF), but it's 'inf' for some variables and very high (around 6 figures) for other variables. So cannot filter with the threshold of 10 as usual but setting a threshold around 1000 does not seem to be correct either. The response is a binary outcome. I'm trying to build a logistic regression model but due to high collinearity coefficients are not interpretable. How can I handle multi-collinearity in this case?

EDIT: I want to use the model to check what variables are significantly associated with the outcome and also to predict the outcome. The training set size is 128 (after splitting into train-test) and there are 96 variables in total. Data is highly imbalanced.

Dushi Fdz
  • 145
  • What is your goal with the model, what will you use it for? – user2974951 Mar 01 '23 at 15:46
  • I want to use it to check what risk factors are significantly associated with the outcome and also for the prediction of the outcome. – Dushi Fdz Mar 01 '23 at 15:50
  • How are "risk factors" related to the "variables" you refer to in this question? Are they synonyms, or are risk factors perhaps functions of one or more variables? – whuber Mar 01 '23 at 15:59
  • 1
    It would be helpful to see the singular values of the centered design matrix (aka the PCA "scree plot"). This will help us understand "how many" variables there "actually". – John Madden Mar 01 '23 at 16:06
  • How many predictor variables are involved? How many cases? – EdM Mar 01 '23 at 16:25
  • @whuber "risk factors" and "variables" here are synonymous. @ EdM, there are 96 variables and 128 cases in the training set. – Dushi Fdz Mar 01 '23 at 17:28

1 Answers1

8

The correlations and VIFs are telling you the story. All of your features are extremely related to each other, so any notion of increasing one while holding others constant is nonsense, and I see no hope for untangling your features.

It’s nice to be able to have a clean interpretation, but the universe does not owe you one.

Further, you have way too many variables for your sample size. Even if you combined your training and testing data, you would have more variables than instances of the minority class. I fear that whoever tasked you with doing this has, probably unknowingly, sent you on a snipe hunt.

Finally, while VIF gives some idea of how features are related, in a logistic regression, the usual VIF does not have its literal interpretation as the factor by which the coefficient variance is inflated. When you work on a more reasonable logistic regression problem, you might want to consider generalizations of the usual VIF.

Dave
  • 62,186
  • 3
    Nicely said. (+1) This page provides links related to the generalized VIF appropriate for logistic regression. The data here are too few for a train/test split, just as you indicate. Substantial data reduction or penalization (ridge, LASSO) are called for. – EdM Mar 01 '23 at 21:09
  • Thank you so much! In the case of high multi-collinearity, is there a preferred method for feature selection if I use machine learning models for the prediction of the binary outcome? Can I still use methods like Recursive Feature Elimination with cross-validation (RFE-CV)? – Dushi Fdz Mar 02 '23 at 15:47
  • Features selection, in general, in unstable and overrated. If you bootstrap your data and try the RFE-CV, you’re likely to find little consistency in which features are selected. – Dave Mar 02 '23 at 16:04
  • Thank you @Dave. Given the small sample size (n=160) and high-class imbalance (80:20), should I use repeated or nested cross-validation instead of a held-out test set (20% of data)? – Dushi Fdz Mar 02 '23 at 16:32
  • That could be a reasonable next question to post. It warrants a full answer instead of just a comment. (Note, however, that such an answer is likely to mention the instability of feature selection.) – Dave Mar 02 '23 at 16:34
  • Thank You. I posted a separate question here: https://stats.stackexchange.com/questions/608184/can-i-only-use-cross-validation-when-sample-size-is-very-small-or-do-i-still-nee – Dushi Fdz Mar 02 '23 at 16:52