Let's suppose I have two models that both indicate the presence of some phenomenon:
- Model A: Only binary results, i.e., the phenomenon is present or not,
- Model B: Outputs class probabilities.
Of course, I could impose a decision rule for model B to derive binary results, too, but I am looking for methods to compare the performance (whatever that means, precision perhaps?) of both models, i.e., which model's decision incur a lower cost.
There are approx. 20 data points, and expert-crafted ground truth for each. The experts indicated the presence of a phenomenon on a scale of 0-10 (where 0 indicates complete absence and 10 a strong manifestation of the phenomenon).
As for model A, It was agreed beforehand that a ground truth <= 5 would mean the phenomenon is not present. As for model B, the probability for the most-likely class would be scaled by a factor of 10, so that deviation against the ground truth could be measured.
The features used for crafting the ground truth are distinct from the features used in the models. Model A uses thresholds and indicator functions to indicate the phenomenon, model B is a regression model that outputs class probabilities. Would it be fair to apply the Brier score to compare both models, by pretending model A outputs the probabilities 0/1? Looking at the definition of the Brier score, it seems appropriate to me.
Question 1: What can I actually compare between both models? Precision? Accuracy? etc.?
Question 2: How to do these comparisons between both model types? Ideally, I could derive statements such as one model outperforms (whatever that could mean) the other.