Comparing predictors based on ROC AUC and cross-validation error

Question

I am analysing how well some continuous variables (e.g. weight, height) predict the occurrence of a given disease after surgery. I have computed the area under the curve of the receiver-operator characteristics (AUC) and also the cross-validation (CV) error using Matlab's kfoldLoss function (for one descriptor at a time).

I was expecting 1-AUC to always be less than CV, i.e. the error labelling data in the set to be lower than the error labelling new data. That does not seem to be the case, particularly for one particular variable whose AUC is close to 0.5, but whose CV error is around 0.3, the same as for other markers, whose AUC is 0.7. How can a marker be so much poorer at fitting the actual data, but then similarly good at labelling new data?

I have fruitlessly checked my code many times for bugs, so I am now considering other explanations and would like your help for that.

My data (95 cases) has more patients that experienced the disease (60) than who didn't (35). This information was used when training the data. Is this likely to be a problem?
Would there be any benefit in using a more complicated loss function that just the frequency of correct labellings of the cases not used for training?

AUC and CV-error can not be compared. You probably computed 'accuracy' via CV. In a similar way you can compute AUC via CV. My suggestion: consult, for example, Precision and recall and decide which metric is most relevant for you. — lanenok, Apr 28 '15 at 18:45
Thank you, @lanenok. What you suggest took me a while to digest, but makes a lot of sense. It is accuracy I am interested in, so I guess I should be comparing the accuracy using the resubstitution error to the accuracy using the cross-validation error. Many thanks — Marta, Apr 30 '15 at 14:34

Comparing predictors based on ROC AUC and cross-validation error

0 Answers0

Linked