Statistical comparison of two probabilistic classifiers

Question

TL;DR There are well-known tests to compare classifiers. How can we generalize to classifiers with random training steps?

I am comparing two classification algorithms (A and B). I can see that typically algorithm A is better than algorithm B in terms of the out-of-sample error (error meaning misclassification rate). I would like to test whether the difference is statistically significant -- A strictly better than B with a some significance level.

I found good answers on this at Cross Validated SE. However, my setting is different in the following sense: these classifiers have random components in the training stage, and running them on the same training set several times would result in different results.

What I have for a given dataset is the following:

For a given train/test split of this dataset, I have 1,000 out-of-sample errors for each method (A and B), with "out-of-sample error" meaning training these methods on the training set (this includes the random component) and then testing the classifiers' errors on the test set.
I have the above results for 100 random train/test set splits of this dataset.

What would be the best way to test significance in this setting? Thank you for your time!

EDIT: "probabilistic classifier" looks misleading. I meant, classifiers with random training components -- e.g., a decision tree where the variable to split is taken randomly out of top-x important features).

Thoughts: A naive approach is the following:

Fix a train/test split.
Compare 1,000 errors and statistically test A < B.
Return the largest significance level where we fail to reject A < B. If we then iterate the above approach for all 100 train/test splits, we will then get 100 significance levels. I can then plot a histogram of these levels. Or, is there instead a way to uniformly compare all the $100 \times 1,000$ errors?

Another approach would instead be checking if the expected error of algorithm A is smaller than the expected error of algorithm B. I can take the average of the 1,000 errors for a fixed split, and then compare these averaged errors over 100 train:test splits via the methods in the literature.

Why are you thinking about comparing two procedures for training a classifier rather than two trained classifiers? Even if you can show that one procedure is better "on average", what we would care about in practice is the performance of a model trained with the procedure. — dipetkov, Jun 17 '22 at 20:46
@dipetkov I’m not convinced of that. Read, for instance, any of the many times Frank Harrell has written about bootstrapping the entire modeling procedure. — Dave, Jun 17 '22 at 21:45
@Dave Thanks, I've read Prof. Harrell's writings on this topic. As reference to others here are some links: 1, 2. I remain convinced that when one deploys a classifier it's a particular instantiation of a training procedure, not the template for training a classifier. — dipetkov, Jun 17 '22 at 21:55
Thank you both for the commens! @Dave ,dipetkov. I think, if I thought about comparing once-trained classifiers, then the link I shared above would have been sufficient. But in reality, there exist many algorithms that learn with random components. To some extent, I need to average the multiple simulated performances. Otherwise, if, for example, in a paper I say method A beats method B, but I trained just once, it would look very suspicious, as the randomness can be easily manipulated. Hence I am looking for a more thorough approach. The most convenient one seems my last paragraph in the post. — independentvariable, Jun 17 '22 at 22:37
You might find this post about optimism corrected bootstrap interesting. — dipetkov, Jun 18 '22 at 22:26

Statistical comparison of two probabilistic classifiers

0 Answers0