Paired T-Test to assess validity of Algorithm Selection after Nested Cross Validation

Question

I am interested in selecting the best Algorithm (SVM, XGBoost, Random Forest, ...) for a classification problem.

At present, my company is using an algorithm (let's call it ALG-A and let's say it's XGBoost), and asked me to test other algorithms, possibly to find a better one.

Based on what I learned (see this and this) I used Nested (or Double) Cross Validation to perform a comparison between many algorithms.

Let's say that the outer loop of nested CV is made of 10 fold, this results in 10 different (I suppose) independent estimates of the classification performance (let's say F1-Score). To summarise the results I computed average between 10 folds (as in standard CV process) as long as standard deviation and added a Boxplot to show distribution of 10 estimates.

Now let's say that, based on average F1-Score, ALG-A ranked second, while ALG-B (let's say it's SVM) ranked first, so I am only interested to these two algorithms now:

ALG	F1_avg	F1_std
ALG-B	0.9515	0.0113
ALG-A	0.9445	0.0182
.....	......	......

I showed this results to my surpervisor asserting that ALG-B would be a better choice instead of ALG-A due to higher average score and lower standard deviation, also by looking at the Boxplot one can see that ALG-B distribution is less scattered than ALG-A.

He said "Seems reasonable, but the choice of ALG-B over ALG-A is statistically significative?". In other words, he asked me to assess validity of Algorithm Selection.

My approach was the following:

Storing, for both ALG-A and ALG-B, F1-Score values over 10 folds.
Checking for Normality assumpition (with a Shapiro-Wilk Test) of the difference between ALG-A and ALG-B metric values (the difference is indeed Normal with a pvalue of 0.182).
Conducting a Paired Samples T-Test to detect if there's a difference between algorithms' performances (unluckly the test resulted not significative with a high pvalue of 0.321).

I have two question:

Is this the correct way to proceed? I am confident about Nested CV part and summary of results, but I am definitely more concerned about T-Test part. I read many posts (for exemple this) asserting that is not appropriate to conduct a Paired T-Test over CV results, mainly because of the violation of independency assumpition of the Test. But they all refer to a normal CV procedure, I was not able to find anything about Nested CV + significance Test. As far as i understand, the esitmates from each (outer) fold should be independent, hence, would be reasonable to use Paired T-Test like I described above? Or, basing on what literature apparently suggests, should I follow this scheme:

Perform inner loop CV on the whole dataset in order to select set of Hyperparameters for both ALG-A and ALG-B (now CLF-A and CLF-B).
Conduct a Paired T-Test 5x2CV (Dietterich) or a Combined F-Test 5x2cv (Alpaydin) to compare performance of CLF-A vs CLF-B?

This suggested scheme doesn't look fair to me, as I feel that the Nested CV part is thrown away, and also because we end with comparing two classifiers and not two algorithms, that is different from the original aim. Any thought to let me change my mind?

Both schemes (simple T-Test and 5x2CV T-Test/F-Test) end with a rejection of null hypotesis (pvalue for 5x2CV T-Test is almost identical to the previous one, 0.319). That means that regardless of the choice of the scheme, I will end with the same answer: "ALG-B (CLF-B) has slightly better performance than ALG-A (CLF-A), both in terms of higher average score and lower standard deviation (and also almost 40 times lower avg. fitting time) , but there is not enough statistical evidence to chose ALG-B over ALG-A". Could be this a good final answer to give?

t tests are devised for continuous outcomes. In classification, the outcome is categorical. This is a use case for Mc Nemar's test. — cdalitz, Jan 09 '23 at 09:34
@cdalitz Could you please motivate your comment? All the article I read (see this for example) say that McNemar’s test is for situations where the learning algorithms can be run only once, or in general for big datasets, but this is not my case since there are less than 1000 samples. Also the documentation page of 5x2cv paired t-test (http://rasbt.github.io/mlxtend/user_guide/evaluate/paired_ttest_5x2cv/) report an example based on well-known (multi) classification problem Iris — AngelMarcos, Jan 09 '23 at 16:44
Sorry, I had not noticed that you measure multiple performance measures with dependent test and training sets. This rules out McNemar's test, because it assumes independent test smaples. The 1998 paper by Diettrich that you indirectly mention gives a good overview. — cdalitz, Jan 12 '23 at 08:49

Paired T-Test to assess validity of Algorithm Selection after Nested Cross Validation

0 Answers0