I am trying to evaluated an algorithm that supports doctors when making a diagnosis.
I have recruited 10 doctors. I have 50 training examples.
Each doctor is randomly assigned 25 cases to review alone and 25 to review with the assistance of the algorithm.
Therefore each doctor has a unique set of reads that are assisted and unassisted.
I have a data frame that is composed of the 50 predictions for each of the 10 users. The length of the data frame is therefore 500 rows long.
If I filter the data frame to review only algorithm assisted reads and compute the AUC (Area Under the ROC curve) over all 10 users (250 reads in total) I get an assisted AUC of 93%. If a repeat this process using bootstrapping to build a distribution I similarly get a distribution centered around 93%.
However if I compute each of the 10 users AUC's individually, and then average them, I get the overall AUC of 90%.
This does not make sense, given that the AUC computed for the entire sample should itself be composed of these individual users AUCs.
The only thing I can think of that might be leading to this strange behaviour is the fact that each doctor had a unique set of reads and therefore that computing each doctors AUC separately and then averaging them is somehow giving a different answer to computing the AUC over the entire sample.
Why might this behavior be occurring?