3

I know I am able to calculate Mann Whitney U tests when comparing 2 samples unequal inside but I am wondering if I am able to carry this same principle when calculating ROC AUC via the formula:

AUC = U / (n1*n2) where U is the U statistic, n1 is number of positive examples, and n2 is the number of negative examples

I am trying to compare test scores by disease status and, ideally, want to be able to say something about discrimination, beyond just association.

For example, I have 525 cases and 1770 controls. With a p-value of 0.5, I know that there is no association between the test scores and disease status. But with a U statistic of 471313.5, would it be valid to calculate

AUC = 471313.5 / (525*1770) = 0.5071977

and conclude that the test has poor discrimination this disease?

I scanned through some papers and StackExchange posts but was unable to find much about the assumptions of the AUC/MWU relationship when it comes to sample size. It was only brought to my attention as a potential issue when I...consulted ChatGPT.

1 Answers1

0

The relationship between the Wilcoxon test and the ROCAUC of a machine learning "classification" model is that AUC is related to a Wilcoxon test of the group of predictions for one group and the group of predictions for the other group. Consequently, when you make your claim about an AUC, you are acting as if the measurements are the predictions of some machine learning model, and there is minimal linear separation between the two groups.

It would be reasonable to see this as evidence that a simple logistic regression (just an intercept and a coefficient on this one feature) will have minimal ability to discriminate between the two categories, sure, as the predictions from that logistic regression would just be a monotonic transformation of the input values, and ROC curves are unaffected by monotonic transformations, as they are based on ranks.

It is possible that a more complex model would be a very good discriminator, however, and the Wilcoxon test does not comment on this. For instance, it could be that one group has values that tend to be very low or very high, while the other group has values that are in the middle. A Wilcoxon test of these two groups would show minimal differences. However, running a logistic regression on this feature and its square would give much stronger ability to discriminate between the two groups. That is discussed in more detail here, and I ask for clarification about an expansion of that discussion here.

Dave
  • 62,186