6

I'm fitting a logistic regression model to predict probabilities from a set of variables. I'm comparing two such models, say M1 and M2. The only difference is that M2 includes all the variables of M1 plus a few more variables. The idea is to see which variables are useful in predicting my dependent variable.

I expected that AUCs should be non-decreasing with the addition of new variables. If the new variables have predictive power, they should increase the AUC, if they don't, then the AUC should be unaffected. But I find that AUC actually decreases as I add a particular set of new variables. What could be the issue here?

I'm using predict() to get the predicted probabilities. Does it automatically drop all the statistically insignificant variables when calculating the predicted value? Could this be the cause of the drop in AUC?

Nick Stauner
  • 12,342
  • 5
  • 52
  • 110
user46768
  • 267
  • 1
    The most important missing piece of information for me is, whether you are evaluating the AUC on the training data or you use cross validation or a dedicated test set. – Erik Jun 27 '14 at 10:55
  • I calculate AUC on test data, not training. – user46768 Jun 27 '14 at 11:17
  • 1
    In this case I recommend reading up on overfitting. There should be some answers around here. – Erik Jun 27 '14 at 12:27

3 Answers3

4

The effect of uninformative features depends largely on your modeling strategy. For some approaches they are irrelevant while for others they can dramatically decrease overall performance.

Your intuition that using more features should necessarily yield a better model is wrong.

Marc Claesen
  • 18,401
  • Can you elaborate on why this happens? I have an econometrics background, and if you add variables, you see R-square, a measure of the fit of the model, always go up. Does predict() use the statistically insignificant variables as well when predicting values? That is the only reason why I can think of why this could happen. – user46768 Jun 27 '14 at 11:20
  • 3
    user46788: The coefficient of determination ($R^2$) doesn't necessarily increase on the test set when you include more predictors. See @Erik's comment. – Scortchi - Reinstate Monica Jun 27 '14 at 13:21
  • 2
    The reason varies depending on the modelling strategy. Some examples: (1) for distance-based methods such as kNN, adding irrelevant features may have a large impact on the relative distance between instances, (2) for logistic regression, adding features means adding coefficients that need to be estimated (which can degrade the estimates of coefficients related to good features if you don't have enough training instances). – Marc Claesen Jun 27 '14 at 13:49
1

4 years late but I just had the same experience now.

For logistic regression, the model should be smart enough to disregard useless variables. There is no constraint preventing the coefficients of these variables from being 0.

It is important to remember how a logistic regression works. I believe the model optimises squared error not AUC directly. You might want to check if your MSE improved when your AUC deteriorated. In my case my MSE did improve despite my AUC getting worse.

I did notice that there is sometimes a very small increase in my MSE with more features. I think it might be one of the model default parameters maybe maximum iterations or a tolerance criteria for convergence. BTW i am using logistic regression from sklearn.

Joseph
  • 11
  • 1
    Welcome to the site. Was this intended as an answer to the OP's question, or a comment contributing to the general discussion? Please only use the "Your Answer" field to provide answers to the original question. You will be able to comment anywhere when your reputation is >50. Since you're new here, you may want to take our [tour], which has information for new users. – gung - Reinstate Monica May 16 '18 at 01:33
  • 3
    LR does not optimize squared error. Generally they are fit by maximizing the likelihood. This should also maximize the AUC w/i the constraint that the model's form, set of variables, etc, remains constant. – gung - Reinstate Monica May 16 '18 at 01:35
0

Check if you have not missings values in the new variables. Logistic regression reject the cases with missing data, and only adjust the model for full cases. You must sure that you are comparing the discrimination in the same cohorts.

  • No, I'm sure I don't have missing values. Both the models are trained on the same dataset and prediction is done on identical test datasets. – user46768 Jun 27 '14 at 11:21