2

I would like to know how we can choose between model fit (calibration) vs AUC when building the predictive model. For example, if I have one predictor which improves the model fit but results in a lower AUC (e.g. 0.80 -> 0.70), would you still include this predictor in this predictive model?

Thank you for the feedback!

  • By what measure does the model fit improve? – Dave Nov 28 '21 at 05:18
  • @Dave I am thinking of using the likelihood ratio test to compare models – R Beginner Nov 28 '21 at 18:14
  • Likelihood ratio testing considers the performance of the log loss, which is a strictly proper scoring rule. AUC is not strictly proper, so there is a sense in which log loss is preferred to AUC. Frank Harrell, for instance, considers log loss the gold standard for model performance. – Dave Nov 28 '21 at 18:24
  • Thank you! When it comes to model fit, between AIC vs likelihood ratio testing, which one is better? – R Beginner Nov 28 '21 at 19:33
  • Do you mean AIC or AUC? – Dave Nov 28 '21 at 20:00
  • AIC (Akaike information criterion)! – R Beginner Nov 28 '21 at 20:49
  • @Dave Just want to follow up with one question: if I include one predictor in my model, it improves the model fit (ie., the p-value between the nested model and my new model by likelihood ratio testing is <0.05) but the p-value of this predictor itself is >0.05, should I still include this predictor in my model? – R Beginner Dec 01 '21 at 19:16
  • Variable selection is its own topic that warrants its own question (or textbook...or master's degree in statistics). Briefly, what you're doing is some kind of stepwise regression, which is problematic. You seem to be working with collinear features, too, which makes stepwise regression even more problematic. – Dave Dec 01 '21 at 19:46

1 Answers1

0

Some of the trouble here is that $AUC$ and the likelihood ratio test are based on different ideas.

The $AUC$ measures the extent to which the predictions are separated by true category: the ability of the model to discriminate or discern between categories. Notably, if you divide the predictions by two or apply any other monotonically increasing function (multiplying by $1/2$ is an increasing function), you do not change the order of the predictions, so you do not change the extent to which the predictions are separable into the two categories. Consequently, $AUC$ does not consider the output calibration and if a predicted probability of $p$ corresponds to the event truly happening with probability $p$.

The likelihood involves both the ability of the model to discriminate but also the calibration of the outputs. Consequently, if adding a variable slightly lowers the ability to discriminate but dramatically improves the calibration, the likelihood favors adding this variable. However, the $AUC$ will suffer when you add this variable.

If you want to consider the fit according to the likelihood and also want to have some kind of "absolute" measure of performance (it is hard to say that any particular score counts as "good", but it can be nice to give some context for a likelihood value that lacks an easy interpretation like mean absolute error), you might consider McFadden's $R^2$, which compares the likelihood of your model (fraction numerator) to the likelihood of a reasonable baseline model that always predicts the overall probability (fraction denominator).

$$ R^2_{McFadden} = 1-\left( \dfrac{ \overset{N}{\underset{i=1}{\sum}}\left[ y_i\log(\hat y_i) + (1 - y_i)\log(1 - \hat y_i) \right] }{ \overset{N}{\underset{i=1}{\sum}}\left[ y_i\log(\bar y) + (1 - y_i)\log(1 - \bar y) \right] } \right) $$

In the equation above, $y_i\in\left\{0, 1\right\}$ are the true labels, $\hat y_i$ are the predicted probabilities, and $\bar y$ is the overall probability of the event coded as $1$.

While McFadden's $R^2$ does not seem to be as popular in machine learning circles as the $AUC$, it is part of the literature, and a big part of me thinks that it should be a more popular metric.

Dave
  • 62,186