2

Let's suppose I have two models that both indicate the presence of some phenomenon:

  • Model A: Only binary results, i.e., the phenomenon is present or not,
  • Model B: Outputs class probabilities.

Of course, I could impose a decision rule for model B to derive binary results, too, but I am looking for methods to compare the performance (whatever that means, precision perhaps?) of both models, i.e., which model's decision incur a lower cost.

There are approx. 20 data points, and expert-crafted ground truth for each. The experts indicated the presence of a phenomenon on a scale of 0-10 (where 0 indicates complete absence and 10 a strong manifestation of the phenomenon).

As for model A, It was agreed beforehand that a ground truth <= 5 would mean the phenomenon is not present. As for model B, the probability for the most-likely class would be scaled by a factor of 10, so that deviation against the ground truth could be measured.

The features used for crafting the ground truth are distinct from the features used in the models. Model A uses thresholds and indicator functions to indicate the phenomenon, model B is a regression model that outputs class probabilities. Would it be fair to apply the Brier score to compare both models, by pretending model A outputs the probabilities 0/1? Looking at the definition of the Brier score, it seems appropriate to me.

Question 1: What can I actually compare between both models? Precision? Accuracy? etc.?

Question 2: How to do these comparisons between both model types? Ideally, I could derive statements such as one model outperforms (whatever that could mean) the other.

  • What's missing from your problem statement is a task. Presumably you want to choose between models A and B to serve some purpose. Once you describe the task, that in itself will suggest appropriate metric(s) for comparisons. Or perhaps it will highlight that one or the other model is not useful at all. – dipetkov Mar 20 '22 at 08:04
  • The task is to indicate the presence of the same phenomenon. The actual phenomenon is not of binary nature, it may be present to some degree only. Therefore, model B is by nature better suited to the task. However, the binary indication of model A is valuable, too, given the phenomenon is present to some minimum degree. What I want to avoid, for example, is to simply impose a decision rule on model B, and then to compare, for example, precision/recall/accuracy/etc. – user654123 Mar 20 '22 at 16:42
  • You would still have to make hard choices about what "to some minimum degree" means. This is closely related to the phenomenon you are studying which you know best. If you are not prepared to do that then model B is your only option. – dipetkov Mar 20 '22 at 16:54
  • It could be relevant to read CV posts about classification vs prediction and evaluating classification models. For example, you might start here. – dipetkov Mar 20 '22 at 17:10
  • Thanks for the link, I tried to find these linked posts as I read them a while ago and remembered that they kind of address my problem. So as I understand, it appears I would need a scoring rule for model A, and that requires a probabilistic output of this model. The problem is, model A really is a logical construct of three indicator functions, and therefore it cannot output class probabilities, just 1/0. Are my options then exhausted? – user654123 Mar 20 '22 at 18:15
  • At this point it seems that you just have two existing models, no experimental data on which to evaluate them & not enough clarity about when a measurement (taken in a lab?) indicates the phenomenon is present or not. So the options are not clear, at least to me. – dipetkov Mar 20 '22 at 18:38
  • I have approx. 20 data points, and expert-crafted ground truth for each. The experts indicated the presence of a phenomenon on a scale of 0-10. The features used for crafting the ground truth are distinct from the features used in the models. Model A uses thresholds and indicator functions to indicate the phenomenon, model B is a regression model that outputs class probabilities. Would it be fair to apply the Brier score to compare both models, by pretending model A outputs the probabilities 0/1? Looking at the definition of the Brier score, it seems appropriate to me. – user654123 Mar 20 '22 at 18:47
  • There is not much resemblance between this description and your original question! The fact that the ground truth itself is an a 0-10 scale is totally new! Why don't you update the question, with all relevant info, and start anew. And please elaborate why the experts measured the phenomenon on a scale from 0-10 and how you are planning to convert this ground truth to a binary decision, for evaluation purposes. Assuming that the goal is to determine presence or not of the phenomenon. – dipetkov Mar 20 '22 at 18:54
  • You are right, I updated the original question with all the details. – user654123 Mar 20 '22 at 19:02

1 Answers1

1

The ground-truth data is labeled on a scale from 0 to 10. For convenience and without loss of information, you can rescale the experts' labels to the [0,1] range (divide them by 10). From now on let's assume both the predictions and the labels are on the same scale.

Since the labels themselves are proportions, not yes/no labels, the measure $$ \frac{1}{n} \sum_{i=1}^n (prediction_i - truth_i)^2 $$ is more correctly called mean squared error, not Brier score. You can use it to compare the performance of the two models and choose the one with the smaller MSE.

Since the experts themselves didn't make a hard decision about the presence or not of the phenomenon, just about the strength of its presence, binary classification metrics such as accuracy don't seem appropriate. And you can't compute them unless you go back to the experts and ask them to binarize their labels.

Update: Model A can have smaller MSE by outputting fractional labels instead of 0s and 1s. This is a bit contrived. (But so is the use of a binary classifier to describe a complex phenomenon that even experts rate on a scale.)

Suppose there are expert-assigned labels for some training data. (Don't use the 20 instances set aside for comparing model A and B or you will give A unrealistic advantage.) You apply model A to these training examples and you get the TPs, FNs, FPs, TNs (true positives, false negatives, etc.) For the comparison with model B, you can adjust model A to output probability p0 = {average label of TNs & FNs} instead of 0 and probability p1 = {average label of TPs & FPs} instead of 1; these values minimize its MSE on the training examples.

dipetkov
  • 9,805
  • Thanks for the discussion and your answer, and for helping me improve mine. I was wondering one thing though, These scoring rules work if we assume model A outputs the class probabilities 0/1. However, why don't we assume, for example, 0.25/0.75 instead, and leave some room for uncertainty? Under what circumstances would that be a fair adjustment? – user654123 Mar 20 '22 at 20:22
  • You make an interesting point. Model A will have a smaller error if it outputs probabilities. I've updated my answer. – dipetkov Mar 20 '22 at 23:44