TL;DR There are well-known tests to compare classifiers. How can we generalize to classifiers with random training steps?
I am comparing two classification algorithms (A and B). I can see that typically algorithm A is better than algorithm B in terms of the out-of-sample error (error meaning misclassification rate). I would like to test whether the difference is statistically significant -- A strictly better than B with a some significance level.
I found good answers on this at Cross Validated SE. However, my setting is different in the following sense: these classifiers have random components in the training stage, and running them on the same training set several times would result in different results.
What I have for a given dataset is the following:
- For a given train/test split of this dataset, I have 1,000 out-of-sample errors for each method (A and B), with "out-of-sample error" meaning training these methods on the training set (this includes the random component) and then testing the classifiers' errors on the test set.
- I have the above results for 100 random train/test set splits of this dataset.
What would be the best way to test significance in this setting? Thank you for your time!
EDIT: "probabilistic classifier" looks misleading. I meant, classifiers with random training components -- e.g., a decision tree where the variable to split is taken randomly out of top-x important features).
Thoughts: A naive approach is the following:
- Fix a train/test split.
- Compare 1,000 errors and statistically test A < B.
- Return the largest significance level where we fail to reject A < B. If we then iterate the above approach for all 100 train/test splits, we will then get 100 significance levels. I can then plot a histogram of these levels. Or, is there instead a way to uniformly compare all the $100 \times 1,000$ errors?
Another approach would instead be checking if the expected error of algorithm A is smaller than the expected error of algorithm B. I can take the average of the 1,000 errors for a fixed split, and then compare these averaged errors over 100 train:test splits via the methods in the literature.