1

I am working with a dataset with high dimensionality (107 columns vs. 800 rows). All columns are binary in nature indicating the presence of a specific column value or not in the dataset. I used pandas pearson's correlation coefficient to check for multi-collinearity within the independent variables. Suppose I am checking the correlation coefficient between column A & B, where A is a somewhat irrelevant column for the feature set, while B is more important. If the correlation coefficient between A & B is highly positive, meaning it is close to ~0.9/1, I could drop one of the columns. Basically I can get rid of A (since it is somewhat contextually irrelevant) or combine it with B.

But if the correlation coefficient between A & B is highly negative, i.e., it is - 0.9 or - 1 what does it mean for the dataset? Can I still drop one of the lesser relevant columns (column A) or combine it with column B?

pandi20
  • 71
  • 5

1 Answers1

1

If the correlation $r_{A,B} = -0.9$ then $r_{A, -B} = 0.9$ and, conversely, $r_{-A, B} = 0.9$ too.

In other words, you do the same thing with a high negative correlation as you would with a high positive correlation. The only difference will be the sign of the effect of the variables.

So A and B are strongly negatively correlated and you include A, it might have, say, a positive effect on the outcome. If you include B instead, it would have a negative effect on the outcome, but of similar magnitude.

I would try both models and cross-validate, but I'm guessing you'll have more success by keeping just the most relevant covariate, since they basically say the same thing.

kqr
  • 814