Unbalanced dataset - ROC curve to compare classifiers?

Question

I use the machine learning software WEKA for data mining on biological data. I would describe my dataset as unbalanced: It comprises around 2000 instances, splitting in classes of 900, 500, 350, 160 that are very important to have in the dataset and some less important smaller classes that are nice to have but can be removed from the dataset if they confuse the learning to much.
Currently I am comparing many different classifiers. I am not a very experienced statistician, but I read that ROC curves are commonly used to evaluate the performance of machine learning classifiers. However, I also read that ROC has drawbacks when it comes to unbalanced datasets. Is there a better measure among the ones the WEKA output features (or can be calculated from them) for my dataset? Thats how the output looks like (here with the iris dataset):

=== Stratified cross-validation === 

Correctly Classified Instances         144               96      %   
Incorrectly Classified Instances         6                4      %   
Kappa statistic                          0.94  
Mean absolute error                      0.035 
Root mean squared error                  0.1586
Relative absolute error                  7.8705 %
Root relative squared error             33.6353 %
Total Number of Instances              150    


=== Detailed Accuracy By Class ===

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.98      0          1         0.98      0.99       0.99     Iris-setosa
                 0.94      0.03       0.94      0.94      0.94       0.952    Iris-versicolor
                 0.96      0.03       0.941     0.96      0.95       0.961    Iris-virginica
Weighted Avg.    0.96      0.02       0.96      0.96      0.96       0.968


=== Confusion Matrix === 

  a  b  c   <-- classified as
 49  1  0 |  a = Iris-setosa
  0 47  3 |  b = Iris-versicolor
  0  2 48 |  c = Iris-virginica

Not an answer to your question, but please note that while there are extensions, ROC curves are conventionally plotted for binary classification (e.g. Dead/Alive), whereas in your case the outcome variable seems to be multinomial (e.g. Blue/Red/Yellow/Brown/ etc.). — Zhubarb, Aug 11 '14 at 16:18
@Zhubarb Interesting point. As I said I am not an experienced statistician. I edited my question, so that the example also shows the extended, detailed output. I think the problem you mention is "solved" by evaluating every class against all others and calculating the Average. — aldorado, Aug 11 '14 at 16:23
a good answer might be very long. I'll just add (1) ROC curves are used in practice. Good to be aware of norms even if they don't make sense. (2) The c-statistic is a measures of discrimination rather than calibration - incomplete measure (3) most machine learning algorithms are not probabilistic. They will give you ROC curves, but I would not use them (but many do - random forests vs. boosted model great example) (4) c-statistic may be high even model when performs poorly at cost-function your interested in. — charles, Aug 11 '14 at 17:45

score 1 · Answer 1 · edited Dec 14 '15 at 18:03

In some formulations of multi-class ROC AUC, it is the case that the AUC estimate is sensitive to relative class frequencies, but this is not true of all mutli-class ROC AUC formulations. Moreover, the ROC AUC formulation in the binary classification case is not sensitive to relative class frequencies. There are numerous performance measures which are sensitive to imbalanced data, such as accuracy, but insensitivity to class imbalance is one of the very appealing advantages to ROC AUC.

This paper develops the idea that binary ROC AUC is insensitive to class compositions, with extended discussion, with the basic idea being that ROC is all about the rates, rather than the absolute numbers of each class. Because ROC analysis measures the relative ranking of examples, class imbalance doesn't change the ROC curve.

In the multi-class case, there are a couple of ways to represent the problem. The class-reference formulation, for example, is sensitive to relative class frequencies. Alternatively, there is a method of combining all 1 vs 1 ROC AUC estimates that is not sensitive to class compositions. This is developed by Hand & Till (2001).

Tom Fawcett, "ROC Graphs: Notes and Practical Considerations for Data Mining Researchers" 2003. Intelligent Enterprise Technologies Laboratory, HP Laboratories (Palo Alto). (If I recall correctly, this paper was also eventually published in a peer-reviewed journal a few years later. Can't find the reference right now. Same author. Similar title.)

David Hand, Robert Till. "A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems." Machine Learning. November 2001, Volume 45, Issue 2, pp 171-186.

This is also developed in A. P. Bradley, "The use of area under the ROC curve in the evaluation of machine learning algorithms." Pattern Recognition, 30:1145-1159, 1997.

On p20 of your pdf (section 8.2.1), it says, "while [one vs all other] is a convenient formulation, it compromises one of the attractions of ROC graphs, namely that they are insensitive to class skew". Given that the OP has multicategory data, I wonder if this is what they are thinking of. — gung - Reinstate Monica, Dec 14 '15 at 17:37
You raise a very valid objection -- I was paying attention to the unqualified, blanket statement in OP's post about ROC AUC and class balance, and neglected the context that this is a multi-class problem. On the other hand, please note that the section you reference pertains particularly to the so-called class-reference formulation. So far as I can tell, the paper does not establish whether or not this deficiency is common to all mutli-class AUC formulations. I'll research this further when I have a moment. — Sycorax, Dec 14 '15 at 17:42
No problem, I still upvoted this. In the next section (p 21) under AUCs for multiclass ROCs, they discuss a method of combining all 1 vs 1 AUROCs (Hand & Till, 2001) that "is insensitive to changes in class distribution". — gung - Reinstate Monica, Dec 14 '15 at 17:50
Thank you for pointing this out. I think my revisions better answer OP's question, as well as providing a more nuanced explanation of the contrasts between multi-class and binary AUC computations. — Sycorax, Dec 14 '15 at 18:00

Unbalanced dataset - ROC curve to compare classifiers?

1 Answers1

Linked