Interpreting the output of scoring rules

Question

Assume we have a probabilistic forecast for a continuous variable. Now we want to validate how good our estimate was. For that, we can use various scoring rules (e.g. CRPS, logarithmic score) or if we obtained a prediction interval via our probabilistic forecast, we can assess it by using the prediction interval coverage probability (PICP). For scoring rules, we normally assume that lower values indicate a better probabilistic prediction.

For example: if Model A returns CRPS_A = 0.5 and Model B returns CRPS_B = 1.5, we assume that Model A is better than Model B since CRPS_A < CRPS_B. However, do we know if CRPS_A indicates a "good" value per se? If I would have obtained 0.5 without any reference, like model B, is there any way to tell if it was a "good" performance?

The PICP may help, with that, since we can observe how much the PICP deviates from the assigned probability of the PI. For example, if Model A gives me coverage of 98% (PICP = 98%) for a 90% PI, I know it is a rather bad model. However, a PICP of 90.2% is rather good. For that, I did not necessarily have to compare it with another Model B since the PICP follows an intuitive logic.

However, the PICP also has some disadvantages see here, I am wondering if there is any other metric which could be used for the validation of probabilistic forecasts, that is intuitive in their output. Maybe something comparable to the r ² or CCC for point predictions.

I am curious to hear your suggestions!

Welcome to Cross Validated! In what other setting would you know what constitutes “good” performance? Perhaps if you draw an analogy to a setting to where you see this work, it will be easier to point you to something that will work for predicting probabilities. — Dave, Sep 06 '22 at 18:49
Regarding PI coverage, take a look at the introductory example in Gneiting, Balabdaoui & Raftery (2007). It's given in terms of uniform PITs, but I believe you can reformulate it to prediction intervals and find that objectively suboptimal forecasts can yield PIs that have the "correct" coverage. ... — Stephan Kolassa, Sep 06 '22 at 19:06
... That said, I second @Dave. You essentially know how well your model performs, and you wonder whether this is "as good as possible". How to know that your machine learning problem is hopeless? is directly applicable - although it is posed in general terms (and my answer essentially looks at point forecast accuracy), the exact same logic applies to probabilistic forecasts and scoring rules. With the problem that it's even harder to "debug" probabilistic forecasts than point forecasts. — Stephan Kolassa, Sep 06 '22 at 19:07
Hello together, thanks for your replies! To give a bit more background: I am aware that the PICP can give "correct" coverage when the upper and lower quantile are skewed by the same factor. That is exactly the reason why I am actually asking this question: In the field, I am working, the PICP is given as the only validation metric for validation of uncertainty estimates based on a single model. I find this suboptimal and was looking for another metric that can be meaningful if you do not have a reference... — Jonas S, Sep 06 '22 at 19:30
... I know scoring rules exist and can be used for the validation of probabilistic forecasts but they become just meaningful when you compare their value to a reference. Like the example, I gave with model A and model B. However, if I only have a model A and it can not be compared to something else. How do I know if a probabilistic forecast was "good"? — Jonas S, Sep 06 '22 at 19:33
You mention $R^2$ as being satisfactory, but that does implicitly use a reference model: the constant model. You can do similar things with any proper scoring rule and any acceptable (problem-specific) simple benchmark model. — Chris Haug, Sep 07 '22 at 00:37
@ChrisHaug That’s spot-on and really at the heart of my earlier comment. — Dave, Sep 07 '22 at 00:44
Thank you for your suggestion. Just to clarify if I understood you correctly: so you suggest that I would simply compare let's say Model A to a constant model using some scoring rules? Or can I do more based on that comparison? — Jonas S, Sep 07 '22 at 08:50

Interpreting the output of scoring rules

0 Answers0