5

TL;DR: I am working with binary classifications. I have different models I want to compare their performance out of the box. I read that accuracy is a poor metric, and Brier score or log loss should be used instead. However, I also read that the Brier score should not be used when comparing logistic regression vs. random forest, and it should be mainly used as a metric when tuning/changing the parameters of a single model. Is this statement true? Is it wrong to use Brier to compare the performance of different models/approaches?


Full background to my research question:

Hi all,

I have a dataset composed of two groups (disease type 1 vs. type 2) and 50 samples per group. For each sample, I have around 7000 features being measured. Importantly, identifying type 2 is key, and I am willing to "pay the price" of getting some type 1 as false positives.

My initial plan was to run feature selection and machine learning to classify these groups. After reading a bunch of stuff here, I realize that my approach may not be ideal for my dataset. For instance, ML with 100 samples is far from ideal. In addition, my dataset is 50/50 while the real-world prevalence of both disease types is 70/30; thus, any model I come up with will most likely underperform in the future.

I am aware of these limitations (and there are probably many more), but since the data is already in my hands right now, I wish I could "play" with it to see what I can get. I plan to run repeated k-fold cross-validation (10-fold with 10 repetitions). Inside each fold, I am performing mRMR (feature selection) and a few classification models. For example, logistic regression, random forest, SVM, XGBoost, and a few more. I want to compare the performance of each model and then spend more time optimizing the one that performed the best out of the box.

At first, I was going to compare log reg and the ML models using accuracy, but great posts by Frank Harrell, Stephan Kolassa, and others are changing my mind. Right now, I am planning to use Brier Score, at least in this initial stage where an overall screening is needed. However, I read that the Brier Score should not be used to compare logistic regression vs. random forest, as they are two different models. It seemed like Brier score should be used only for the same model under different parameters, for example, when evaluating the gains for hyperparameter tuning. How much of that is actually true?

  • 5
    Where did you read not to compare the Brier scores of different models? That makes no sense to me. Could that author have meant for in-sample assessments? // It might be best to edit that "log reg" to specify "logistic regression" instead of a regression on the logarithm of a value or a GLM with a logarithm link function. – Dave Mar 29 '23 at 19:19
  • 5
    People compare different models using proper scoring rules all the time, whether for classification or for numerical prediction. I have never hear that this should not be done, and that the models differ is not an argument - in a forecasting situation, you can absolutely compare ARIMA and RNNs using the MSE or the CRPS. So with Dave, I would be interested where you read this particular piece of advice, – Stephan Kolassa Mar 29 '23 at 19:37
  • 1
    @StephanKolassa I wonder if the suggestion comes from using a predict method on the random forest that return the category with the highest probability, and then if you calculate the MSE between a vector of numbers that represent true categories and another vector of numbers that represents the predicted categories, the calculation is not the Brier score (and not really a mean squared error, either). – Dave Mar 29 '23 at 20:05
  • I read it in a post from Medium. I understand that website might not be the most trustable source of information, but it was enough to raise a flag leading me to double-check it. In sum, it said that Brier could not compare two different models because it depends on two components: how good the model is and how well-optimized it was; hence, different models will be inherently different. Then he advised using Brier only when the model has been fixed and tuning is being performed – Luiz Gustavo Mar 29 '23 at 23:31
  • 2
    Of course, there was no reference providing the basis for his statement, so I went on to the literature to see if I could find anything close to this, and I found this publication: The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models (Assel et al; PMID: 31093548). These two pieces of information confused me a little about how and if I should apply Brier scoring. – Luiz Gustavo Mar 29 '23 at 23:34
  • Could you please post the link? That sounds so outrageous that I’m wondering if there is a subtlety of the context. // I agree with the Assel article. Brier score alone does not tell you how useful a model is. You need to put it in the context of its use. – Dave Mar 29 '23 at 23:34
  • Here is the link: https://medium.com/datalab-log/voc%C3%AA-deve-calibrar-seu-modelo-desbalanceado-ou-n%C3%A3o-3f111160653a

    The original language is Portuguese, but an option to translate to English should pop up after a moment if you use Chrome. I searched for other posts by the author and found this book (already in English), where he hints at being careful when using Brier, but the statement is not as bold as the Medium post. Link for the book: https://pibieta.github.io/imbalanced_learning/notebooks/Calibration.html?highlight=brier

    – Luiz Gustavo Mar 29 '23 at 23:58
  • I wonder if there is a translation error. I know I’ve seen some Brazilians on here. Perhaps they can asses the original Portuguese version. @Firebug – Dave Mar 30 '23 at 00:00
  • Thank you for the link to the Assel et al. article, which is even freely available. I will try to take a look in the next days, perhaps also think about how their approach fares with the log score. Perhaps someone Portuguese-speaking can look at that Medium article. – Stephan Kolassa Mar 30 '23 at 07:19
  • 1
    I looked at the Assel et al. article, and it says what @Dave agrees on and where I concur: the Brier score alone will not tell you whether a classification-decision pipeline is useful. It will only evaluate the classification model as to its calibration. Whether the entire pipeline is more useful than another one needs to be seen in its context. – Stephan Kolassa Mar 31 '23 at 14:13
  • Incidentally, Andrew Vickers was very patient in an exchange of multiple emails, which finally led me to understanding that there is a typo in the Assel et al. article: the NB should be defined in terms of TP and FP (see the earlier article by Vickers & Elkin, 2006), not TP and FP rates as written in the article - this makes a major difference and threw me off badly. – Stephan Kolassa Mar 31 '23 at 14:14
  • 1
    The conclusion of the Assel et al. article apparently is that the Brier score is not the end-all if the classification model is used in a specific decision context - hard 0-1 classifiers with specific sensitivities and specificities may yield better decisions. I have to admit that I find that rather unintuitive. Which means that I learned something over the last two days. Thank you! – Stephan Kolassa Mar 31 '23 at 14:44
  • 1
    Finally, as long as you do not have a specific decision context with clear costs of actions, I would still recommend you go with proper scoring rules. You may be interested in the resources in the scoring-rules tag wiki, especially the comparison between the Brier score and the log score (which I personally prefer). – Stephan Kolassa Mar 31 '23 at 14:46
  • @StephanKolassa That still does not make sense for why Brier score would be inappropriate for comparing a logistic regression to a random forest, does it? If you’re in a position where you have a handle on decisions based on model outputs, fine, evaluate them instead of the direct model outputs, but that seems rather independent of the particular model. – Dave Mar 31 '23 at 14:48
  • @Dave: I did think about adding something like this, but in light of the Assel et al. article, I'm not so sure any more... after all, they find that some model with a higher Brier score yields better decisions than the well-calibrated model (with a better Brier score). Doesn't that mean that the alternative model is "better" in some way, and that the Brier score was not able to identify the "better" model, thus is dubious? And if so, where does this come from? I think I'm confused. – Stephan Kolassa Mar 31 '23 at 14:52
  • @StephanKolassa This does not concern me terribly. Measure of performance disagree all the time. That’s why we care to have think about several. Brier score would be quite boring if it always agreed with the log score. – Dave Mar 31 '23 at 14:55
  • Final thought: Andrew Vickers points out that my claim of a typo in Assel et al. is incorrect. There are different definitions of the TPR/FPR, either TPR=TP/P or TP/n (where n is the total number of observations), and similarly for FPR=FP/N or FP/n. They used the latter, as in the preceding Vickers & Elkin (2016) paper - but Wikipedia gives the other definition, and it looked like an error to me. I learn something every day. – Stephan Kolassa Mar 31 '23 at 16:20
  • I am trying to wrap my head around this discussion, so bear with me if my comment is a bit far off. @StephanKolassa, when you said, "Whether the entire pipeline is more useful than another one needs to be seen in its context", what do you mean by context? From what I could understand, if I ran a repeated k-fold CV with three different models in each fold, I could compare the Brier score for each model at the end because they were applied in the same context. Is that correct? Also, thank you, Stephan and Dave, for taking the time to comment on this post! – Luiz Gustavo Apr 03 '23 at 20:30

1 Answers1

1

Brier score might not be the statistic of interest for a particular task. In that case, Brier score would not be appropriate for comparing a logistic regression and a random forest, but that is because Brier score simply is not the right value to calculate, rather than anything specific to how the values the Brier score evaluates are calculated or estimated.

However, if Brier score is what interests you, do calculate it. As long as the inputs to the score are appropriate (have an interpretation as probabilities, so not the log-odds output of a logistic regression or the predicted category that you can get by prediction methods in random forest software), go for it.

If there is an objection to doing this because random forests often give probability values that lack calibration and the Brier score will penalize this, that seems like a feature, not a bug, of Brier score. (Or maybe you don’t care about calibration, but then Brier score should not be your statistic of interest.)

If there is an objection to calculating the Brier score of a model because the model was not optimized well (mentioned in the comments), that seems like an admission that the model is not very good. If a model is making poor predictions (in terms of Brier score) because it was not optimized well, the key part of that to me is that the model is making poor predictions.

Dave
  • 62,186
  • In summary, if I have N methods classifying the same dataset, the Brier score is a valid way to compare their performance. This is a purely probabilistic approach and free of any "bias." Maybe a specific threshold for decision would give slightly different results when evaluating performance compared to the Brier, but that needs to be considered on a case-by-case basis. Did I understand it correctly? – Luiz Gustavo Apr 03 '23 at 20:43
  • @LuizGustavo I do not see where any kind of threshold comes up in my answer. Could you please explain what you mean? – Dave Apr 03 '23 at 20:44
  • Hi @Dave. I am so sorry for the late reply, life has been quite hectic lately. As for the threshold, I actually took this from one of Stephan's comment, stating that "Brier score is not the end-all if the classification model is used in a specific decision context - hard 0-1 classifiers with specific sensitivities and specificities may yield better decisions". This is why I assumed specific thresholds could still be useful in specific scenarios. – Luiz Gustavo Apr 19 '23 at 21:26