2

Some metrics are sample size-biased, i.e., they will give biased estimates in small sample sizes. Is there a formal proof that AUROC is either sample size-biased or unbiased? I.e., will the expectation over the AUROC computed over test sets of size 100 be the same as the expectation over the AUROC computed over test sets of size 10,000 drawn from the same distribution?

Eike P.
  • 3,048

1 Answers1

2

As with almost any sample-derived statistic, empirical AUC-ROC's calculation is indeed affected by sample size but smaller samples do not bias it. A smaller sample size directly affects its variance. We should note that because the calculation is exact (i.e. we know all the pairs) so there is no "under-" or "over-" estimation in regard to the empirical calculations given a fixed (smaller or larger) sample size in the context of classification. The count of the number of concordant pairs (i.e. Wilcoxon’s rank-sum test statistic) and the probability of concordance (i.e. the ratio of the number of concordant pairs over all possible pairs) are not biased by smaller or larger sample size.

We have thus an unbiased estimator of the empirical AUC-ROC; the recenti(ish) paper Confidence Intervals for the Area Under the Receiver Operating Characteristic Curve in the Presence of Ignorable Missing Data (2019) by Cho et al. looks at this in more details in a modern and approachable manner. To that effect, there has been a constant stream of papers on this since probably... 1970's as the equivalence relation between the AUC-ROC and Wilcoxon’s rank-sum test statistic has been a fertile ground for such work. sklearn uses trapezoid integration of the ROC curve which is equivalent to its AUC because it too uses all the possible threshold values. Note that, unlike other integration tasks, we know that we have a "real" discontinuous step function, so no smoothing is needed; Hand & Till (2001)'s paper A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems on Section 2 comment this further and caution against smoothing. CV.SE has an excellent relevant thread on "How to calculate Area Under the Curve (AUC), or the c-statistic, by hand" that is also relevant (and shows how smoothing also leads to a slightly biased AUC-ROC estimate in one case). Oh and if we want a "within CV.SE" derivation as to why this AUC is the same as this probability of concordance, this is here.

The above being said, we should not overlook the issue of variance, as Bounding Sample Size Projections for the Area Under a ROC Curve (2009) by Blume emphasises: "sample size projections dealing with AUCs are very sensitive to assumptions about the variance of the empirical AUC estimator", so while we get an unbiased estimate the variance can really mislead us. Obviously, what we are dealing with is not the population ROC curve but rather a sample ROC curve, that's why it is important to be aware of its variance, a short informal commentary on sampling variation can also be found in the SAS blog here: "ROC curves for a binormal sample". To that effect Small-sample precision of ROC-related estimates (2010) by Hanczar et al. also comment that variance can be significant. That is particularly noticeable in use cases with less than 100 samples and for that matter they suggest a simulation approach to get more accurate estimates for variance; in terms of bias though they find no real problem with any of the standard resampling approaches (bootstrap, LOO, 10-fold CV).

usεr11852
  • 44,125
  • Interesting. Can you point me to a proof for the fact that smaller sample sizes do not cause bias? E.g., in the first paper you linked, they just write that "it is straightforward to have an unbiased estimator of the AUC as ..." and proceed to write down a rule for calculating AUROC, without any further explanation or citation regarding the unbiasedness. (Also, the calculation rule they write down differs from the standard TPR/FPR trapezoidal integration scheme that e.g. scikit-learn implements, and it is not obvious to me whether the unbiasedness would still hold for that implementation.) – Eike P. Jan 25 '23 at 10:39
  • Thank you for your comment, good clarifications to add. I should have said that the calculation is exact (i.e. we know all the pairs) so there is no "under-" or "over-" estimation in regard to the empirical calculations given a fixed (smaller or larger) sample size in the context of classification. This is what is described as the $c$-index / concordance-index / Wilcoxon’s rank-sum test statistic i.e. the counts of the number of concordant pairs. (Cont.) – usεr11852 Jan 25 '23 at 12:33
  • 2
    sklearn uses trapezoid integration of the ROC curve which is equivalent to its AUC. CV.SE has a relevant thread here. (I will amend the main answer accordingly.) – usεr11852 Jan 25 '23 at 12:33
  • 2
    The exactness argument seems to suppose there is no larger population from which the sample is drawn. If the sample is from some larger (esp. infinite, continuous) distribution, then is the sample AUROC unbiased for the population AUROC? – Ben Reiniger Jan 25 '23 at 14:16
  • 1
    @BenReiniger Yes, if anything, I was particular to say "empirical" a number of times. And yes, that goes for all exact tests of course. And that's why I mention Blume's critique too. I added also a point about bias/variance from a simulation study. I hope this helps. – usεr11852 Jan 25 '23 at 15:34
  • @usεr11852 Thanks again for the comprehensive answer, all this info is really appreciated! I am, however, still missing a clear argument (or reference to one) for why the AUROC / c-statistic is sample size-unbiased. You write that it is unbiased, but why? Maybe I am just not understanding your argument? I investigated quite a bit but sadly was not able to find a clear formal statement to this effect, neither in the papers you linked nor in various others I consulted. (I understand that AUC is identical to e.g. the c-statistic, but then I am lacking a proof for the unbiasedness of that...) – Eike P. Feb 18 '23 at 13:48