Compare two classifier performances by prediction interval and probability coverage

Question

Following a previously asked questions on prediction intervals for a logistic regression classifier, I'm currently experiencing a conundrum.

I want to test a procedure to reverse-engineer the alteration of the spatial coordinates of my plot data. For each plot in my dataset, the procedure generates a series of potential coordinates, which could or couldn't contain the real coordinates.

After this procedure I have two datasets: the real plot data/coordinates with, let's say, 2k observations (let's call it tr_data), and the one with estimated coordinates, est_data, where for each plot.ID in the tr_data there are 1 or multiple entries (let's say 20k observations total).

To keep things simple, I trained a Random Forest Classifier on the two datasets. The problem is a multiclass problem, with 7 classes and an imbalanced dataset. I'm trying to predict probabilities, not 0-1. I validate my models on an independent dataset of 500 observations. I'm not doing any feature selection/hyperparameter tuning.

OA of the model trained on tr_data is 0.72
OA of the model trained on est_data is 0.45
(I know there are better metrics than OA for probabilities, like logloss, but bear with me)

I could stop here, but to make the comparison more robust I'd like to compare prediction intervals (PI) and probability coverage. I expect for example to have wider PI for the model trained on est_data compared to the other one.

I created 100 bootstrapped samples for both training datasets, trained models and generated the 95% PI for each class from the validation set. See plot.

Thing is that for more than one class and for both types of datasets (tr and est) the lower bound of the PI ends up being 0, while the upper bound in some cases is as high as 0.95. Nothing wrong for PIs, but for probability coverage this means that more than 95% of my observations actually fall in the 95% PI. My questions then are:

Am I computing the 95% PIs correctly? They seem insanely wide.
Does it make sense to compute prediction interval coverage probability (PICP) in this case? I ended up with <1% of observations outside of the 95% PI.

(1) What is OA? (2) I don't quite understand what you are doing, especially in terms of having both classes and PIs. Are you calculating PIs for class membership probabilities? But then your observable outcome can only be 0 or 1. It does not look like PIs make a lot of sense for unobservables like membership probabilities, or at least it will be hard to evaluate them based on the actual observations. Or are you calculating PIs for observables between 0 and 1? — Stephan Kolassa, Jan 23 '24 at 07:43
@StephanKolassa (1) Stands for overall accuracy, or simply accuracy, see here. (2) Yes, PIs are calculated on class probabilities, while observations are 0 and 1. What I want to show is that is much more difficult for the estimated model than the true model to correctly classify observations, hence why I wanted to show the differences in PIs (and also why I linked that question on logistic regression). — D.K., Jan 23 '24 at 14:34
Hm. The problem I see is that if you just want narrow PIs ("the PI is narrow, so the system is very confident here"), then this invites gaming. That is why one typically assesses PIs using the Winkler score, which includes terms for both sharpness and calibration (and is consistent for PIs). The problem here is that our observations don't match the PIs, so the Winkler score does not work. — Stephan Kolassa, Jan 23 '24 at 14:56
@StephanKolassa That's a good point. How would you compare the two models (meaning, the 2 datasets) for this task to try to show that the estimated coordinate model is much worse? As I said, I'd usually rely on just model accuracy metrics, but I'd like to bring additional proofs. — D.K., Jan 24 '24 at 08:45
This is a hard problem. The standard way of evaluating probabilistic classifiers is of course via proper scoring rules, but that assess point prediction of class membership probabilities, not the prediction intervals (of unobservable probabilities) you are interested in. Given a large sample size, the Winkler score may actually work by relying on limits... but I don't know of any work in this direction. — Stephan Kolassa, Jan 24 '24 at 09:24
@StephanKolassa Thanks for your contribution so far. My plan was to split the comparison in different steps, starting with performance metrics. OA and a pseudo R2 based on logloss (formulation: 1-(model logloss/naive classifier logloss)) for overall model performances and then balanced accuracy/f1 score/R2 based on logloss to assess each class individually. PIs and probability coverage would have come later. Misclassification rate would still be important for my case as the highest probability predicted would be the model response. Should I just stop at step 1? — D.K., Jan 24 '24 at 16:56
To be honest, I would dispense with all of OA, balanced accuracy, F1 and misclassification rate, because they all suffer from the exact same issues as accuracy. (I strongly suspect they also reward uncalibrated PIs.) Instead, stick with a proper scoring rule per above. R^2 is essentially the Brier Score. See also this thread. — Stephan Kolassa, Jan 24 '24 at 17:03
@StephanKolassa Those are great resources and one of the reasons why I use probabilities for my classifiers (those will be predictions to produce maps, so each probability class will have its own map and then the user can check probability values for each pixel). A better approach would be then to make a "fusion" of my step 1 / 2: stick only to Brier Score and provide the one resulting from the 100 bootstrapped iterations (overall and per probability class, to see which class is easier/harder to map). — D.K., Jan 24 '24 at 19:49

Compare two classifier performances by prediction interval and probability coverage

0 Answers0