0

I have a risk model that I want to evaluate on different (patient) groups in order to compare how well the model is working on each of them. The groups may differ in size, baseline / prevalence / class balance, disease subtype / difficulty to diagnose, and basically everything else you can imagine.

(Under which conditions) is it meaningful to compare within-group AUROCs between these groups and/or with the AUROC obtained over all data (in order to quantify per-group discriminative ability)?

Some notes and initial thoughts:

  • We know that AUROC is indifferent to class balance, so that should not be an issue.
  • While I never got a really satisfactory answer to this question (it is possible that I simply did not understand the given answer correctly), it seems to me that sample size biases should also not be an issue. (Except for higher variance in smaller samples, of course, requiring some kind of AUROC uncertainty quantification for proper comparison. That's an issue I am not interested in here.)
  • The remaining potential problem then are other group differences, especially relating to the risk distributions within the different groups. This paper seems to suggest that there is an issue when naively comparing within-group AUROC between groups, but I could honestly never really wrap my head around what exactly is the problem.
  • AUROC measures pure discriminative ability, so comparisons of within-group AUROC might hide wild miscalibration issues. I am aware of that; this question is about whether there are additional issues related to the assessment of the model's discriminative ability. There is some strange interplay with calibration here, however. Imagine risk estimates in one group being systematically far too low. Across all data, the model would be performing very poorly on this group (high FNR for many thresholds), but the within-group AUROC might still be high.
  • AUROC is of course equivalent to the probability of ranking a random positive example above a random negative example. This immediately suggests to me that within-group AUROC measures something very different from AUROC across all groups and provides the motivation for the xAUC metrics proposed in the paper linked above.

My current intuition (partially based on the paper linked above) is that there is indeed some fundamental problem with this very widely used approach, but my intuition is often wrong and I find it hard to put my finger on the exact issue. If there is a problem, an illustrative example that makes the issue very clear would be highly appreciated.

Eike P.
  • 3,048

0 Answers0