Academic reference on the drawbacks of accuracy, F1 score, sensitivity and/or specificity

Question

Accuracy, as a KPI for assessing binary classification models, has major drawbacks: Why is accuracy not the best measure for assessing classification models?. The exact same issues also plague the F1 score (actually all Fβ scores), sensitivity, specificity and alternatives.

Is there a standard academic article one can point to discussing these issues?

Why am I asking this? I am thinking of reviewing a paper and wanting the author to avoid these KPIs. Or alternatively, having submitted a paper, getting reviews that recommend these flawed KPIs, and needing a paper to point to in arguing why I won't follow these recommendations. Of course, I could point to the CV thread linked above, but unfortunately, CV is not always accorded the respect a peer-reviewed article gets.

I have looked through Frank Harrell's "Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules". This kind of material is exactly what I am envisaging. Is there something like this published somewhere?

Surely this issue is brought up in numerous textbooks on the subject of Machine Learning. Why not cite one of those? — Him, Jan 31 '23 at 18:59
@Him I would expect many machine learning books to miss this idea, and that's why Stephan has to make this argument in the first place. Nonetheless, is Harrell's RMS book a valid resource? That isn't a peer-reviewed academic article, but it strikes me as a reputable source. — Dave, Jan 31 '23 at 19:01
I would be surprised if any ML textbooks missed this, and, if you find one that fails to, it is likely a bad one. The fact that a 3-free-variable confusion matrix (3 assuming everything is expressed in ratios) cannot be completely captured by a single variable should be pretty obvious. The relationship between the confusion matrix in a binary classification problem and the standard trinity of accuracy, recall, and precision (which 3 can be used to fully re-create the confusion matrix) is a basic ML topic, I feel. — Him, Jan 31 '23 at 19:11
I'm pretty sure accuracy, sensitivity and specificity together also completely determine the confusion matrix. Not sure about F1, but I wouldn't be surprised if F1 and two others can recover the matrix as well. — Him, Jan 31 '23 at 19:13
@Him Accuracy, recall, precision, confusion matrices, and $F_1$ ($F_{\beta}$) are exactly the types of performance metrics that Stephan wants to avoid. — Dave, Jan 31 '23 at 19:13
@Dave, any performance metric in a binary classification problem will be a function of the confusion matrix. In other words, any metric is a function of accuracy, precision and recall. Individually, they (or any) metric will have pitfalls, but together, they completely define the performance. — Him, Jan 31 '23 at 19:14
@Him What about log loss or Brier score? Those are the types of metrics for which Stephan wants to argue (though I admit that is not entirely clear from just the OP). — Dave, Jan 31 '23 at 19:18
Note that F1 is the harmonic mean of precision and recall. You end up mitigating some of the problems with precision and recall, but you also lose some of the information of either. Ultimately, it is one degree of freedom in a three degree system. You can't do better than accuracy (or precision or recall or sensitivity or F1 or or or) with a single metric. You can only do different. — Him, Jan 31 '23 at 19:18
@Dave the Brier Score is also a function of the confusion matrix. I don't know the exact conversion on the top of my head, but this means that for binary classification problems, it is a function of accuracy, precision and recall. I suppose one might argue that the Brier Score is "easier to interpret", or has other psychological benefits, but it does not contain more information than the full set of {accuracy, precision and recall}. — Him, Jan 31 '23 at 19:21
@Him: sorry, but your last comment is simply wrong. The Brier score is a function of probabilistic classifications, and the confusion matrix is (can be) a function of these probabilistic classifications plus a threshold. Thus, the Brier (or log) score is not just a function of the confusion matrix, or of accuracy/sensitivity/specificity. (A recurring theme in my answers here is that using a default threshold (like 0.5) is often a terrible idea.) — Stephan Kolassa, Jan 31 '23 at 19:26
@StephanKolassa it seems that your problem, then, isn't with accuracy as your question states, it is with thresholding. Possibly you will get better answers if you clarify this. — Him, Jan 31 '23 at 19:31
@Him: precisely. The $f_{ti}$ are probabilities, accuracy etc. deal with hard 0-1 classifications. I agree that the underlying problem is that of hard classifications via thresholding. The problem is that people do not see that this is the underlying problem. Contrary to your earlier comments, I believe that most ML textbooks do not teach any of this, and solely discuss accuracy and friends. At least that is the impression I get from the almost daily questions here on CV that reveal zero understanding of this issue. — Stephan Kolassa, Jan 31 '23 at 19:33
The $f_{ti}$ are the probabilities in the confusion matrix, which are exactly the inputs to "accuracy and friends". If your model necessitates thresholding, then this actually exacerbates the issue of representing performance with a single metric. My point here is that your assertion that there exist "better" metrics is fundamentally flawed: All possible metrics have flaws, and the degree to which these flaws are an issue is context dependent. Is Brier Score better to use than accuracy? This depends crucially on what you're expecting to use your predictions for. — Him, Jan 31 '23 at 19:38
I.e. there is no shortage of opinions that the accuracy metric has flaws, but this abundance is directly correlated with the ubiquity of folks using accuracy as a metric. If Brier Score were used equally indiscriminately, folks would be up in arms about the shortcomings of the Brier Score. You simply can't represent performance with a single metric: it's not possible. Is a particular metric useful in a particular context? This is a meaningful question. — Him, Jan 31 '23 at 19:41
@Him When you write about the probabilities in the confusion matrix, how do you calculate those? — Dave, Jan 31 '23 at 19:42
@Dave agreed that my assertion about the $f_{ti}$ in the above comment is not entirely accurate. I cannot edit it at this point to retract it. — Him, Jan 31 '23 at 19:48
@Him: I think we are discussing different things here. With Dave, I am confused as to where you see "probabilities in the confusion matrix". Also, there is a very simple way in which the Brier score is superior to the confusion matrix and KPIs derived therefrom: it's a proper scoring rule, i.e., optimizing it will lead us toward well-calibrated probabilistic predictions. Accuracy, in contrast, is not even a scoring rule, because it relies on hard classifications. — Stephan Kolassa, Jan 31 '23 at 19:50
@Him Then please clarify your point. How do you get probabilities from the confusion matrix? What calculation do you mean? Is it something like the following to normalize for the number of classification attempts (sample size)?
$$ \begin{pmatrix} 4 & 1\ 2 & 3 \end{pmatrix} \rightarrow \begin{pmatrix} 0.4 & 0.1\ 0.2 & 0.3 \end{pmatrix} $$ — Dave, Jan 31 '23 at 20:03
@Him: I found your point about being able to calculate the Brier or the log score from the confusion matrix, accuracy etc. a good one, so I posted a question and self-answer: Calculating the Brier or log score from the confusion matrix, or from accuracy, sensitivity, specificity, F1 score etc. If you disagree with my answer (short: you can't), I would be interested in your thoughts. — Stephan Kolassa, Feb 01 '23 at 08:19
@Him: in addition, re your point about ML textbooks teaching the issues around these KPIs, I took a look at Géron's Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow (2nd ed., 2019). He discusses evaluation on pp. 88-100 and goes into the "problem" of unbalanced data, but nowhere talks about the bias induced by hard classifications, and does not discuss probabilistic predictions and proper scoring rules at all, unless I have missed it. I would not be able to point to this book for the purposes I outlined in my question. — Stephan Kolassa, Feb 01 '23 at 08:24
@Dave my issue was that "Surely this issue is brought up in numerous textbooks on the subject of Machine Learning." However, upon looking into the matter in several (several) introductory and intermediate-level textbooks on machine learning, I saw that, in fact, most of them don't discuss how to measure the performance of a model at all. Often, they simply start measuring a thing, and the reader is expected to just assume that this metric is a performance metric of some sort. This is utterly astonishing to me, but nevertheless seems to be the case. — Him, Feb 02 '23 at 02:05
@Dave I am coming from a statistics background, where even high-school level texts e.g. OpenStax discuss how to evaluate model performance and the meaning behind the involved metrics. I never would have believed that textbooks on machine learning would so completely neglect the subject except to witness it for myself. /shrug — Him, Feb 02 '23 at 02:12
@StephanKolassa what is the "bias induced by hard classifications" - hard classifications may be part of the application, so you need to address that to meet the needs of the "client". If you want a good ML book, try those by Chris Bishop or Kevin Murphy, or David Barber, or David MacKay - all of which take a predominantly probabilistic approach. I could list dozens more if I were in my office and they were in front of me. — Dikran Marsupial, Apr 01 '23 at 06:22
@DikranMarsupial: I mean that we can maximize accuracy by classifying everything that has a predicted probability greater than 0.5 as the target class, where I would argue that we should probably treat predicted probabilities of 0.6, 0.8 and 0.999 differently. A quick skim of your answer sounds reasonable - although it does not really seem to address my question for a reference, beyond pointers to authors, does it? Perhaps you could add in at least the books themselves? I will try to give your answer a closer reading, but am going to be a bit busy in the near future. — Stephan Kolassa, Apr 01 '23 at 07:52
@StephanKolassa The point I am trying to make is that there ought not to be such a reference (other than text books) because accuracy and F1 have value in some applications. " I would argue that we should probably treat predicted probabilities of 0.6, 0.8 and 0.999 differently." in some applications, yes, in others, no. For instance using ML to target adverts - the system either shows an advert or it doesn't and it has to be automated to be practical. All metrics have advantages and disadvantages - I think it is a matter of "horses for courses". — Dikran Marsupial, Apr 01 '23 at 08:04
What I would really like would be a reference for a tutorial paper that sets out the advantages and disadvantages of all common metrics and sets out advice on how to choose a set of metrics to cover the needs of typical applications. That would require collaboration as it seems that individual statisticans tend to work on particular types of problem and perhaps have an overly narrow view on which metrics should be used. This is especially the case where machine learning methods are used in fully automated decision systems, where statisticans/domain experts are not part of the loop. — Dikran Marsupial, Apr 01 '23 at 10:34
@DikranMarsupial: thank you, you make good points. I absolutely agree with your last comment, and would be happy about any reference that discusses the advantages and drawbacks of different evaluation KPIs - just something to point to rather than blindly use accuracy etc. — Stephan Kolassa, Apr 11 '23 at 05:24
@DikranMarsupial, re your advert example: I think this is precisely the other way around. We are always constrained by budget, attention, screen size etc., so we do have to choose which of multiple "fitting" ads to show or run. If we predict a conversion rate > 0.5 for ten ads but can only show three, we still need to choose which ones we show, where both the predicted conversion probability and the likely payoff should enter. (And yes, I do still accept your larger point that accuracy may well be useful in the right circumstances.) — Stephan Kolassa, Apr 11 '23 at 05:26
@StephanKolassa the point is that it is an example where a decision must be made, it is irrelevant whether this is done by thresholding a probability or whether it is made directly. The point is that where a decision must be made, we can't measure the performance of the system without a metric that depends on the decision (and therefore the threshold if you use a probabilistic classifier). In this case, the expected loss might be reasonable (which is essentially a weighted accuracy). — Dikran Marsupial, Apr 11 '23 at 05:40
But for one where accuracy might be a better metric, there is handwritten digit/letter recognition for something like sorting mail by post code. Knowing how many digits you are getting right and how many you get wrong is a reasonable performance metric. " just something to point to rather than blindly use accuracy etc" this sums it up well - we shouldn't use anything blindly, but then again we shouldn't reject criteria blindly either - we should consider the needs of the application. — Dikran Marsupial, Apr 11 '23 at 05:44
@StephanKolassa "any reference that discusses the advantages and drawbacks of different evaluation KPIs" perhaps the Cross-Validated community should write one, we would have the breadth of perspective needed? We could use new questions and answers on the SE to make a start? — Dikran Marsupial, Apr 11 '23 at 07:14
I won’t and probably can’t even do it solo, but I’d collaborate on a journal article. — Dave, Apr 11 '23 at 10:31
One other reasonable use of these metrics is when you are developing new classification algorithms. If you can make a classification algorithm that works with accuracy, it should be straightforward to make it work with different misclassification costs etc. that you need for real applications. It shouldn't be the only performance metric, but where you don't have a real application, just benchmarks, it is a reasonable thing to do. — Dikran Marsupial, Apr 12 '23 at 09:57

usεr11852 · Answer 1 · 2023-01-31T10:32:08.550

The main one that springs to mind is "Three myths about risk thresholds for prediction models" by Wynants et al. (2019) where they argue strongly against using a "universally optimal threshold" without context. I liked they used the term "dichotomania" too (in effect meaning: "manically dichotomising continuous variables").

I like Peter Flach's work on the area of "evaluating ML model performance" too. I do not have a single definitive reference there but something like Berrar's and his: "Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them)" (2012) is a reasonable point to start. His "Precision-recall-gain curves: PR analysis done right" (2015) with Kull has been very thought-provoking too.

+1 Even the abstract of the Wynants article is quite biting. — Dave, Jan 30 '23 at 18:19

score 1 · Answer 2 · answered Apr 01 '23 at 06:10

The KPI (Key Performance Indicator) depends on the requirements of the application. For some applications (i.e. those where a hard classification must be made and we know a-priori that the misclassification costs are equal, e.g. some handwritten character recognition tasks) accuracy is a completely reasonable performance metric and it would be a mistake to recommend avoiding it because it has problems as well as advantages.

Similarly, for some applications (primarily information retrieval) where it is more natural to talk of the relative importance of precision and recall than of misclassification costs, then $F_1$ or more generally $F_\beta$ may be appropriate, especially where we need to make a decision ("do I read this article, or don't I?").

An important consideration is whether we need to make a decision. We may well implement the system using a probabilistic classifier, and then applying a threshold. However, if we need a decision, then the performance of the system depends on the setting of that threshold, so we should be using a performance metric that depends on the threshold, as we need to include the effects of the threshold on the performance of the system.

The advice I would give is not to have a single KPI, but have a range of performance metrics that provide information on different aspects of classifier performance. I quite often use accuracy (to measure the quality of the decisions), or equivalently the expected risk where misclassification costs are unequal, the area under the receiver operating characteristic (to measure the ranking of samples) and the cross-entropy (or similar) to measure the calibration of the probability estimates.

Basically, our job as statisticians is to understand the advantages and disadvantages of performance metrics so that we can select the appropriate metric(s) for the needs of the application. All metrics have advantages and disadvantages, and we shouldn't reject any of them a-priori because of their disadvantages if they have advantages or relevance for our application. I think the advantages and disadvantages are well covered in textbooks (even ML ones! ;o), so I would just use those.

Also, as I have said elsewhere, we should make a distinction between performance estimation and model selection. They are not the same problem, and sometimes we should have different metrics for each task.

Academic reference on the drawbacks of accuracy, F1 score, sensitivity and/or specificity

2 Answers2

Linked