Using bootstrap to compare performance of two machine learning models

Question

I have a binary classification problem, and two ML models $A$ and $B$. I evaluate performance of these models using the area under the ROC curve (AUROC). I want to assess whether model $A$ is significantly better than model $B$. My test set is rather small (156 samples). Since it is drawn from an unknown distribution (to which I do not have access) and thus random, the AUROC results are also somewhat random. The statistical test should tell me whether the difference in results is likely due to a better model ("statistically significant"), or whether it could well be due to the specific choice of test data.

I have the following approach of testing this using bootstrapping:

I draw $n$ bootstrap samples $i\in\{1,...,n\}$ of the test set, with e.g. $n=1000$.
For each bootstrap sample $i$, I calculate $\text{AUROC}^A_i$ and $\text{AUROC}^B_i$, as well as their difference $d_i = \text{AUROC}^A_i - \text{AUROC}^B_i$.
From the distribution of differences $d_i$, I calculate a bootrap confidence interval
If this interval is entirely above zero or below zero, the difference in performance is statistically significant, if the interval contains zero, it's not.

Is this a sound way of testing the difference in performance?

Your approach to get the confidence intervals is fine and somewhat common I believe; the step in which you determine "statistical significance" is borderline wrong. I'm not a real statistician, but "statistical significance" is always defined w.r.t. a significance level greater than 0. The difference interval is thus typically allowed to cover 0. See any of the standard "comparing the means of two distributions" approaches (which is what you're doing). — Eike P., Feb 15 '24 at 10:58
I would also ask: do you actually need the significance statement, or might just giving the intervals be enough? I have generally moved away from p values, statistical significance statements etc. because they rely on a whole bunch of arbitrary assumptions, are very hard to interpret, and don't really add anything IMHO versus just displaying the confidence intervals. — Eike P., Feb 15 '24 at 10:59
There is a lot of related questions already, e.g. here, here, and here. — Eike P., Feb 15 '24 at 11:04
Comparing 2 concordance probabilities or Wilcoxon statistics (i.e., ROC areas) is not sensitive enough. You’d need a paired predictor probability that one is more concordant than the other to get power. Better: use the sensitive measures at https://hbiostat.org/post/addvalue — Frank Harrell, Feb 15 '24 at 12:37
@FrankHarrell I'm getting a "Page Not Found" message when following the link. — Travis L, Feb 15 '24 at 15:51

Using bootstrap to compare performance of two machine learning models

0 Answers0