2

I have been learning the basic terminology for how to think about binary tests involving medical tests. The basic terms are here in this table

enter image description here

This is the confusion matrix.

My issue is the following. Note, I assume $T = N + P$ is the size of the total population. Then "accuracy" of the test is defined as

$$ACC = \frac{TP+TN}{T}$$

where you have essentially the cases you diagnosed correctly divided by the total cases you ran the test on.

But for many diseases, the disease will not be prevalent, as defined by

$$PREV = \frac{P}{T}$$

This means that $ACC$ can be very biased. Suppose our model is "assume no one has a disease" and suppose 98% of the population doesn't have it, then our accuracy would be fabulous because we would be right 98% of the time, but have a 0 True Positive Rate.

Is there a measure for test accuracy that essentially weights the $TP$ and $TN$ by prevalence such that

$$ACC_{2} = \frac{TP\cdot w_{TP} + TN \cdot w_{TN}}{T}$$

where $w_{TP}$ and $w_{TN}$ are somehow determined by the prevalence of the disease?

I want something that in other words gives me an estimate of the overall accuracy that uses the prevalence to weight the accuracy to avoid having low prevalence of a disease lead to a biased overall estimate of accuracy.

  • If the model predicts that no one has the disease and $98%$ of people do not have the disease, how is the $98%$ accuracy a biased measure of performance? – Dave Oct 28 '22 at 21:11
  • 2
    "Bias" has a very precise technical meaning in statistics: an estimator of an unknown quantity is biased if the estimate is systematically off, i.e., if the expected value of the estimate differs from the true value. Accuracy is thus not biased, because it does not estimate any parameter in the first place. Accuracy has major problems, especially but certainly not only in the case of "unbalanced" data. The solution is to use probabilistic predictions and proper [tag:scoring-rules]. – Stephan Kolassa Oct 28 '22 at 21:14
  • @StephanKolassa Yes, you are right. That is MUCH better said than the wording I chose. Unbalanced data is exactly the term I was looking for. I will need to think about the links you provided. not sure I understand fully after one read – Stan Shunpike Oct 28 '22 at 21:40
  • I do not really like $T$ for Total as it is also used for True (similarly $P$ for Positive and Predicted). That being said, your link to Wikipedia offers what it calls "balanced accuracy" of $\frac{TPR+TNR}{2}$ which is your expression with $w_{TP}=\frac{T}{2P}$ and $w_{TN}=\frac{T}{2N}$. – Henry Oct 28 '22 at 22:37

2 Answers2

2

You’re always allowed to compare your measure of error to that of a baseline model. In fact, I would argue that this is exactly what one of the most popular measures of model performance does: $R^2$.

Consequently, if you think your $2\%$ error rate is bad because a naïve baseline model that always predicts the majority class also gets an error rate of $2\%$, use the idea from $R^2$.

$$ R^2=1-\dfrac{ \text{Your model’s square loss} }{ \text{ Square loss of a baseline model } } $$

Do the analogous calculation and replace the numerator with your error rate and the denominator with the error rate of the baseline classifier. In your case, those values are equal, so you wind up with a model with $0$ performance: no better than baseline, which is the truth. A nice feature of this is that it will report than an improvement of $98\%$ accuracy to $99\%$ accuracy is a halving of the error rate, rather than a mere improvement of $1\%$.

Do note, however, the issues with accuracy that Stephan Kolassa mentioned in the comments. I will leave some of my favorite references for that topic, though I concede that there are situations where all you have are the predicted categories and have to use a measure like accuracy, rather than proper scoring rules.

This topic typically comes up in the context of class-imbalance like you have, but it does not have to.

Profusion of threads on imbalanced data - can we merge/deem canonical any?

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

https://www.fharrell.com/post/class-damage/

https://www.fharrell.com/post/classification/

https://stats.stackexchange.com/a/359936/247274

Proper scoring rule when there is a decision to make (e.g. spam vs ham email)

Why is it that if you undersample or oversample you have to calibrate your output probabilities?

https://twitter.com/f2harrell/status/1062424969366462473?lang=en

Dave
  • 62,186
0

There is balanced accuracy which is a mean of sensitivity and specificity. So if you have the same number of people in positive and negative class, then it is the same as accuracy, otherwise it is essentially weighted based on the size of each group. The chance level is always 0.5 and the perfect score is always 1.

rep_ho
  • 7,589
  • 1
  • 27
  • 50