1

I'm applying LogisticRegression on breastcancer dataset.

Steps : -

1- A correlation matrix resulted in only four features having >0 correlation value w.r.t. the target.

2- I used these four features and got very low test and train accuracies (0.55-0.63) with LogisticRegression and some other models.

3- I took 4 more features at random, those features have negative correlation w.r.t the target in the range [-0.7,-0.3] because I thought I'm just not using enough features, so model can't learn properly.

4- With 8 features in total, the test and train accuracies shot up to >0.9.

How can features that have negative correlation with the target can improve the model?

1 Answers1

0

There’s nothing wrong with negative correlation. All negative correlation means is that, as one variable increases, the other tends to decrease. If this kind of relationship is strong, then the correlation will be close to $-1$. For instance, as speed increases, travel time decreases. I would call aircraft speed a strong predictor of how long the flight is, however, and I would include such a variable in a regression that predicts flight time, despite the negative correlation.

There are issues with univariate screening of features based on correlation with the outcome variable, yes, but screening for positive correlation puts you in a position where you can miss strong predictors.

Also note the drawbacks of using accuracy as a performance indicator.

Dave
  • 62,186