Machine learning approach for inherent multicollinearity?

Question

I have to perform the classification of a categorical variable using R. My model has 121 predictors (already filtered for possible importance) all of which are numeric. Many of these predictors are collinear with each other, since they are relative frequencies (which add up to 100) obtained by pivoting a certain categorical variable: for example of a categorical variable with 10 modalities I created 10 numerical variables representing the frequency of the various modalities ( procedure that I performed mainly to reduce my dataset from having hundreds of rows for each individual to having each individual on a single row). My question is which machine learning models/procedures should I use in such a context.

score 1 · Answer 1 · answered Mar 23 '23 at 09:01

Check Is multicollinearity really a problem? and Why is multicollinearity not checked in modern statistics/machine learning. TL;DR if you are interested in building a predictive model, multicollinearity should not be a problem in most cases. It would be a problem if you used a model that does not do any regularization (e.g. standard linear regression), but would not be a problem for most of the machine learning models. If you find that you have problems with convergence, you could start seeking for fixes, but otherwise you can treat it as innocent until proven guilty.

Machine learning approach for inherent multicollinearity?

1 Answers1