Currently I am working on three models to predict the personality factors of subjects according to three data sets. All three data sets are exported features from video interviews, i.e. audio features, visual features, and a merged data set (audio-visual). The visual features data set has 571 variables, and the audio 791 (and merged:1362), they all have 5476 rows.
As you can imagine, there is high multicollinearity in the data set. I trained and tested a Random Forest Regressor, a Support Vector Regressor and a Partially Least Squares Regressor, and in all cases: the model that was trained/tested on the original data outperformed the model that trained/tested on the data set after the feature elimination with VIF. I used a VIF>5 and a VIF>10, which resulted to the same feature elimination (did this recursively until the variables that were left had a VIF<5, VIF<10 accordingly).
A couple of side notes: I used cross-validation k=5, I used r-squared as the performance metric in which I compare the models.
Theoretically, using VIF for feature elimination should increase the explained variance/performance. Why is this not the case in my experiment?
cross_val_score. This does not make sense to me. Could you please clarify? – Dave Feb 08 '23 at 15:12