I am looking for some comprehensive instructions and ideally out of the box solutions (ideally for python) for evaluating different classifiers (which are already trained) for a multiclass classification problem on an unbalanced dataset.
To illustrate further: I have about a dozen classifiers that are trained on the same unbalanced dataset of a hand full of categories. Now I would like to
1) compare the classifiers against the ground truth:
How well do they perform on classifying on a per class basis (compared to a chance based model) and what is a sensible average of the per class performances?
2) compare the classifiers against each other:
Are they significantly different in what they classify data instances as? Are they significantly different in their overall performance (e.g. in accuracy per class)?
I looked into many test statistics now, some are
- overall accuracy (bad for imbalanced datasets)
- Cohen's kappa
- Chi square goodness of fit
- McNemar
- AUROC
- Brier score
- Youden Index
- Informedness
- F-Score
I encountered different accounts whether these are suited for the imbalanced multiclass scenario and under which conditions they can be used, however. Most of the guides and explanations I read limited themselves to cases of binary classification.
I found the pycm package though, which computes many statistics (and most of the above), also for multiclass problems. But the documentation is kind of sparse, and I am not sure if it handles the unbalanced multiclass scenario correctly.
Now I am looking for some clear instructions on which tests I can apply to my case or how I need to format my data to be suited for some given test (I read about binarization of multiclass labes and "one vs all" a couple of times, for example, but these involved retraining the models (e.g. here), which is not an option for me.).
edit:
I am not asking about why accuracy is not a good metric. I am asking for which tests are suited for unbalanced multiclass.
score(predictedClassesModelA, predictedProbsModelB)instead ofscore(groundTruthClasses, predictedProbsModelB)? Is that valid? Because directly comparing two models can be done with McNemar (at least for dichotomous variables, not sure about multiclass) and Cohen's Kappa. And I would like to end up with a test statistic like the kappa coefficient or a p-value. – lo tolmencre Apr 15 '19 at 07:25sklearn.metrics.brier_score_loss. But under the hood it does a one-vs-all test. Is that the only way to do the brier score? If you have $n$ classes, do you need to do $n$ tests, where in test $i$ you compare class $i$ against classes ${1 \leq j \leq n | j \neq i}$ in a binary way? Or is there a version of the brier score that does not require this binarization? – lo tolmencre Apr 16 '19 at 10:55np.mean(np.sum((probs - targets)**2, axis=1))wheretargetsis a vector of one hot vectors:[[0. 1. 0. 0. 0.] [1. 0. 0. 0. 0.]]andprobsis a vector of vectors summing to one:[[0.07 0.41 0.35 0.11 0.06] [0.03 0.33 0.29 0.03 0.32]], I get a value between 0 and 2, it seems. That is mentioned on the Wikipedia page for Brier score. But that does not mean what I am doing there is valid. – lo tolmencre Apr 16 '19 at 10:56