p-value test for comparing classifiers

Question

I have to use p-value test to compare two classifiers. I am getting the error vectors of 2 hypothesis and I am using their difference as an input in ttest function of MATLAB

[H,p] = ttest(error,'Tail','both');

I am getting H as 0 and p as 0.34 As far as I read, this means that we cannot reject null hypothesis in this case. I am not sure I don't know how to interpret that? Can anybody help? And again what will one tailed test do in this case?

It is unclear how you are comparing them. Are you using multiple data sets / cross-validation / ...? — Marc Claesen, Nov 10 '14 at 10:49
my dataset is the same. I am using two classifiers to get two different error vectors — KingJames, Nov 10 '14 at 17:45

score 1 · Answer 1 · answered Nov 10 '14 at 10:42

1

The $t$-test is not a very good choice to compare classifiers, since their performance is not normally distributed. I suggest considering the Wilcoxon signed-ranks test instead.

For a thorough read on this topic please refer to this paper (available here):

Demšar, Janez. "Statistical comparisons of classifiers over multiple data sets." The Journal of Machine Learning Research 7 (2006): 1-30.

The above paper lists multiple arguments against using the $t$-test for classifier comparison.

answered Nov 10 '14 at 10:42

Marc Claesen

18,401

is there any update on the answer after knowing it is the same dataset and using AUC as the evaluation in cross-validation? – RockTheStar May 25 '16 at 23:50
The t test does not assume that the sampling distributions are normally distributed. – Stephan Kolassa May 10 '21 at 09:01
@StephanKolassa That link seems to argue that the sampling distributions are normal (ish), just that original populations don’t have to be normal (ish). – Dave May 10 '21 at 10:25
@Dave: we may be talking about different things here. The underlying distribution do not need to be normal (like classifier performance here, or the uniforms I use in my simulations in the linked answer). What is (asymptotically) normal is parameter estimates, like means (again, see my answer at the link). But means are precisely what t tests compare. – Stephan Kolassa May 10 '21 at 10:28

ASantosRibeiro · Answer 2 · 2014-11-10T10:42:12.567

In the example you provided of the ttest, the hypothesis is whether your vector "error" came from a Gaussian distribution with null mean (if you use only one input), or in other words, if both classifiers provide the same accuracy. If you use 2 inputs than you can also specify the Tail. If you choose 'Tail','Right' (ttest(classifier1,classifier2,'Tail','Right')), then your hypothesis is that classifier1 is better than classifier2, and if 'Tail','Left' classifier2 is better than classifier1. In all cases, the hypothesis is true if h=1 (at a pval that you defined: typically pval=0.05).

Take into consideration however that ttest is a parametric test and is not be suitable to all cases. You should first perform a Chi-square test on each variable: h = chi2gof(classifier). If both tests fail (h=0), than your data is normal (gaussian) and you can use the ttest, otherwise you should use a non-parametric test such as : p = ranksum(classifier1,classifier2). The interpretation is the same.

p-value test for comparing classifiers

2 Answers2

Linked