Using VIF (variance inflation factor) to reduce the data set and account for multicollinearity decreases the r2 and the performance of my model

Question

Currently I am working on three models to predict the personality factors of subjects according to three data sets. All three data sets are exported features from video interviews, i.e. audio features, visual features, and a merged data set (audio-visual). The visual features data set has 571 variables, and the audio 791 (and merged:1362), they all have 5476 rows.

As you can imagine, there is high multicollinearity in the data set. I trained and tested a Random Forest Regressor, a Support Vector Regressor and a Partially Least Squares Regressor, and in all cases: the model that was trained/tested on the original data outperformed the model that trained/tested on the data set after the feature elimination with VIF. I used a VIF>5 and a VIF>10, which resulted to the same feature elimination (did this recursively until the variables that were left had a VIF<5, VIF<10 accordingly).

A couple of side notes: I used cross-validation k=5, I used r-squared as the performance metric in which I compare the models.

Theoretically, using VIF for feature elimination should increase the explained variance/performance. Why is this not the case in my experiment?

Welcome to Cross Validated! $1)$ Is this performance measured in-sample or out-of-sample? $2)$ How do you calculate $R^2?$ $3)$ Why do you expect the performance to improve when you eliminate features? $4)$ For what models are you calculating this $R^2$ value? — Dave, Feb 08 '23 at 13:58
Hi Dave, (1) it is measured in-sample, (2) I calculate R2 using the cross_val_score() (from sklearn.model_selection, and then scoring='r2'), (3) literature on multicollinearity and VIF point towards that conclusion, especially literature on formative measurement models where I am doing my research project on, (4) I am calculating for the Random Forest Regressor, Support Vector Regressor and Partial Least Squares Regressor, and then I do that for raw data set (all variables) and the data set after feature elimination using VIF separately, and then I compare them. I hope this clarifies things! — Evi S, Feb 08 '23 at 15:08
You say that you calculate in-sample, yet you also call a cross-validation function in cross_val_score. This does not make sense to me. Could you please clarify? — Dave, Feb 08 '23 at 15:12

score 0 · Accepted Answer · answered Feb 08 '23 at 15:24

Plenty of people think that removing features that are correlated with other features will benefit the predictive ability of your model. The thinking seems to be that having many features puts the model at a risk for overfitting, so if you can remove some of the features while preserving most of the information contained in the features, model performance should improve.

A high number of features does risk overfitting, but it is not obvious that removing correlated features preserves so much explanatory information that you are better off for having done so. Maybe that will be the case, but maybe it will not be. Your work shows that removing correlated features harms your predictive ability, so such an approach turns out to be detrimental.

Using VIF (variance inflation factor) to reduce the data set and account for multicollinearity decreases the r2 and the performance of my model

1 Answers1

Linked

Related