AUC Instability when averaging over individual readers

Question

I am trying to evaluated an algorithm that supports doctors when making a diagnosis.

I have recruited 10 doctors. I have 50 training examples.

Each doctor is randomly assigned 25 cases to review alone and 25 to review with the assistance of the algorithm.

Therefore each doctor has a unique set of reads that are assisted and unassisted.

I have a data frame that is composed of the 50 predictions for each of the 10 users. The length of the data frame is therefore 500 rows long.

If I filter the data frame to review only algorithm assisted reads and compute the AUC (Area Under the ROC curve) over all 10 users (250 reads in total) I get an assisted AUC of 93%. If a repeat this process using bootstrapping to build a distribution I similarly get a distribution centered around 93%.

However if I compute each of the 10 users AUC's individually, and then average them, I get the overall AUC of 90%.

This does not make sense, given that the AUC computed for the entire sample should itself be composed of these individual users AUCs.

The only thing I can think of that might be leading to this strange behaviour is the fact that each doctor had a unique set of reads and therefore that computing each doctors AUC separately and then averaging them is somehow giving a different answer to computing the AUC over the entire sample.

Why might this behavior be occurring?

score 0 · Answer 1 · answered Jan 26 '22 at 04:03

OK so I have figured out what is happening by examining the case of positive predictive value which also had deviations in outcome depending on whether the total PPV was calculated vs averaging all users PPVs.

Im using SKlearn. Organization of the confusion matrix is as follows:

[[TN, FP]
 [FN ,TP]]

CM for user 1

[[29  6]
[ 1 14]]
ppv: 0.7 (14/(14+6))

CM for user 2

[[26  9]
[ 0 15]]
ppv: 0.625 (15/(15+9))

Therefore the PPV achieved by averaging the two computed PPVs (0.7+0.625)/2 = 0.6625

However the PPV computed from the data frame containing both users 1 and 2 is below

[[55 15]
[ 1 29]]
ppv: 0.659091

Computing the entire PPV from the data frame rather than from averaging individual users PPVs gives 0.659091 rather than 0.6625

This might seem like a trivial difference but as more users are added I find the deviation (particularly for PPV) worsens.

I think the correct procedure is to compute the individual users PPVs and then average them to give an overall PPV. However that is my gut instinct. If anyone can explain the maths behind this I would be grateful.

AUC Instability when averaging over individual readers

1 Answers1