I've trained a Random Forest model on a dataset of 60 protein predictors for healthy controls (label 0) and cancer patients (label 1).
I then tested this model on a dataset of at-risk patients divided into those who later got cancer (label 1), and those who didn't (label 0).
My model's performance gave an AUC-ROC of 0.4.
Other threads and papers (linked below), say that for AUC < 0.5, a classifier has useful information but is applying it incorrectly. People seem to suggest reversing the labels, to give an AUC-ROC of 0.6 Can AUC-ROC be between 0-0.5 http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf
However, would this be appropriate in this case? Reversing the test dataset labels would mean giving the at-risk individuals who stayed healthy a label of 1 (the same as the cancer patients in the training data), which doesn't seem correct to me??
The test dataset is 79 who stayed healthy and 20 who later got cancer. So the distributions of the two groups are the same.
Does this make any difference?
– David Cox Apr 30 '18 at 16:23