Correlation of features in binary classification

Question

I read somewhere that, in binary classification problems, very strong correlation does not imply redundancy of features, for example, if $X_i$ and $X_j$ have a correlation coefficient $\rho> 0.95$, dropping $X_j$ might lead to losing information and thus make the classification model less accurate.

Is that true and if so, is the correlation matrix of features of any use if you cannot drop highly correlated features?

Further info about my problem: Classifying tuples of $50$ values as either signal or background. Many variables have correlation higher than $0.8$ and some have even stronger ($> 0.9$). I doubt that dropping those variables is the right thing to do, but I cannot explain it in theory.

It depends on the type of approach you ae going to use in buiding your classifier. GLM may suffer from very high correlation. Decision-tree based methods, and neural neworks, do not need the removal of redundant features, instead they willgain additional information from them. — Alexey Burnakov, Jan 31 '21 at 09:40
If the correlated features are precisely measured they might be very useful, while if they are noisy maybe not? And, how to decide which to drop, if any? Maybe think about regularization as an alternative? — kjetil b halvorsen, Jan 31 '21 at 13:54

Dave · Answer 1 · 2023-06-02T14:04:15.813

When you include additional features, you put yourself at greater risk of overfitting to coincidences rather than modeling the true relationship. The logic about dropping one of two correlated variables seems to be that this allows you to decrease the overfitting risk while retaining most of the information (in a loose sense) available in both features, since the correlation means there is some level of redundancy; that is, you lower the risk of overfitting without having to sacrifice much.

The trouble is that, just based on the correlations, you do not know how much you are sacrificing by dropping one of the two correlated variables. It might be that each variable has a unique effect, in which case, you will struggle to make up for having dropped one of the variables.

In my answer here I discuss why I see feature selection as overrated (using some arguments that will look familiar) and link to additional material on the topic. Particularly if you allow yourself to use some kind of regularization to constrain the problem, your best bet might be to use all of the features, despite the correlations between them.

score 1 · Answer 2 · answered Jun 02 '23 at 14:16

In general, even highly correlated features can carry independent information. Consider the case where we have a target variable that simply represents the equality of two variables $X_1$ and $X_2$. No matter how highly correlated those variables are, it is impossible to predict the target at a rate better than random by using only one of the variables.

The trouble is, it's hard to know how much the "unique" signal in each variable is actually contributing to the target value. If the target variable is determined only by $X_1$, you'd be justified in dropping all other variables even if uncorrelated, but as the counterexample shows, sometimes you can't even drop the highly correlated features and expect to maintain predictive performance. I don't think this is anything unique to binary classification problems, it should extend to multi-class and regression problems as well - similarity in the input feature space doesn't preclude the possibility that the "independent parts" of imperfectly but even highly correlated features can be useful in predicting the target.

Correlation of features in binary classification

2 Answers2