I have a binary classification problem, and two ML models $A$ and $B$. I evaluate performance of these models using the area under the ROC curve (AUROC). I want to assess whether model $A$ is significantly better than model $B$. My test set is rather small (156 samples). Since it is drawn from an unknown distribution (to which I do not have access) and thus random, the AUROC results are also somewhat random. The statistical test should tell me whether the difference in results is likely due to a better model ("statistically significant"), or whether it could well be due to the specific choice of test data.
I have the following approach of testing this using bootstrapping:
- I draw $n$ bootstrap samples $i\in\{1,...,n\}$ of the test set, with e.g. $n=1000$.
- For each bootstrap sample $i$, I calculate $\text{AUROC}^A_i$ and $\text{AUROC}^B_i$, as well as their difference $d_i = \text{AUROC}^A_i - \text{AUROC}^B_i$.
- From the distribution of differences $d_i$, I calculate a bootrap confidence interval
- If this interval is entirely above zero or below zero, the difference in performance is statistically significant, if the interval contains zero, it's not.
Is this a sound way of testing the difference in performance?