I've seen this question answered here but I do not understand the answer. Harrell recommends using deviance based measures. David Hand (referenced in the thread) says that that the AUC is inappropriate to compare different models, because it uses different misclassifications costs. I don't see how this would be the case when comparing lasso and elastic net, given that they are both trained on the same set of predictors.
1 Answers
Harrell's complaint about using ROCAUC to compare model performance is that it is not sensitive enough to catch real differences. Tests that lack the ability to notice what they are meant to detect are problematic. Getting away from statistics, imagine a cancer screening that has minimal ability to detect cancer: why bother with it?
One place where ROCAUC is deficient is that it has nothing to do with how well-calibrated the predictions are. Your models predict probabilities of event occurrences. If an event is continually predicted to happen with a probability of $0.9$ yet only happens $60\%$ of the time, then that predicted $0.9$ is, in some sense, not telling the truth. ROCAUC will not catch this. In fact, you can halve (or divide by a billion) all of the predicted probabilities without changing the ROCAUC.
Comparing your LASSO and elastic net models on the ROC will give you an idea of which method gives the model that has better ability to discern between the two categories. However, you are missing out on the output calibration. If you plan to use the raw probabilities, and Harrell has argued in many places why they are useful, you would want to know about their calibration.
Deviance-based statistics are related to the log loss, which is a strictly proper scoring rule that can be thought of as seeking out the true probabilities. Log loss considers all aspects of prediction quality, not just the ability to distinguish between the categories. In that regard, functions of the log loss, such as deviance, give a more complete sense of the performance.
- 62,186