Some metrics are sample size-biased, i.e., they will give biased estimates in small sample sizes. Is there a formal proof that AUROC is either sample size-biased or unbiased? I.e., will the expectation over the AUROC computed over test sets of size 100 be the same as the expectation over the AUROC computed over test sets of size 10,000 drawn from the same distribution?
1 Answers
As with almost any sample-derived statistic, empirical AUC-ROC's calculation is indeed affected by sample size but smaller samples do not bias it. A smaller sample size directly affects its variance. We should note that because the calculation is exact (i.e. we know all the pairs) so there is no "under-" or "over-" estimation in regard to the empirical calculations given a fixed (smaller or larger) sample size in the context of classification. The count of the number of concordant pairs (i.e. Wilcoxon’s rank-sum test statistic) and the probability of concordance (i.e. the ratio of the number of concordant pairs over all possible pairs) are not biased by smaller or larger sample size.
We have thus an unbiased estimator of the empirical AUC-ROC; the recenti(ish) paper Confidence Intervals for the Area Under the Receiver Operating Characteristic Curve in the Presence of Ignorable Missing Data (2019) by Cho et al. looks at this in more details in a modern and approachable manner.
To that effect, there has been a constant stream of papers on this since probably... 1970's as the equivalence relation between the AUC-ROC and Wilcoxon’s rank-sum test statistic has been a fertile ground for such work.
sklearn uses trapezoid integration of the ROC curve which is equivalent to its AUC because it too uses all the possible threshold values. Note that, unlike other integration tasks, we know that we have a "real" discontinuous step function, so no smoothing is needed; Hand & Till (2001)'s paper A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems on Section 2 comment this further and caution against smoothing. CV.SE has an excellent relevant thread on "How to calculate Area Under the Curve (AUC), or the c-statistic, by hand" that is also relevant (and shows how smoothing also leads to a slightly biased AUC-ROC estimate in one case). Oh and if we want a "within CV.SE" derivation as to why this AUC is the same as this probability of concordance, this is here.
The above being said, we should not overlook the issue of variance, as Bounding Sample Size Projections for the Area Under a ROC Curve (2009) by Blume emphasises: "sample size projections dealing with AUCs are very sensitive to assumptions about the variance of the empirical AUC estimator", so while we get an unbiased estimate the variance can really mislead us. Obviously, what we are dealing with is not the population ROC curve but rather a sample ROC curve, that's why it is important to be aware of its variance, a short informal commentary on sampling variation can also be found in the SAS blog here: "ROC curves for a binormal sample". To that effect Small-sample precision of ROC-related estimates (2010) by Hanczar et al. also comment that variance can be significant. That is particularly noticeable in use cases with less than 100 samples and for that matter they suggest a simulation approach to get more accurate estimates for variance; in terms of bias though they find no real problem with any of the standard resampling approaches (bootstrap, LOO, 10-fold CV).
- 44,125
sklearnuses trapezoid integration of the ROC curve which is equivalent to its AUC. CV.SE has a relevant thread here. (I will amend the main answer accordingly.) – usεr11852 Jan 25 '23 at 12:33