I am analysing how well some continuous variables (e.g. weight, height) predict the occurrence of a given disease after surgery. I have computed the area under the curve of the receiver-operator characteristics (AUC) and also the cross-validation (CV) error using Matlab's kfoldLoss function (for one descriptor at a time).
I was expecting 1-AUC to always be less than CV, i.e. the error labelling data in the set to be lower than the error labelling new data. That does not seem to be the case, particularly for one particular variable whose AUC is close to 0.5, but whose CV error is around 0.3, the same as for other markers, whose AUC is 0.7. How can a marker be so much poorer at fitting the actual data, but then similarly good at labelling new data?
I have fruitlessly checked my code many times for bugs, so I am now considering other explanations and would like your help for that.
- My data (95 cases) has more patients that experienced the disease (60) than who didn't (35). This information was used when training the data. Is this likely to be a problem?
- Would there be any benefit in using a more complicated loss function that just the frequency of correct labellings of the cases not used for training?