5

An alternative to point, interval and density forecasts/predictions would be "predictive highest density regions (pHDRs)", i.e., HDRs for the conditional density of a yet-unknown future observable.

A natural question would be that of evaluating a pHDR once we have observed the corresponding observable. This is analogous to point forecast error measures or prediction interval scores. (Note that the interval score cannot be applied to one-dimensional pHDR, which may be the union of multiple intervals.)

Is there such a quality measure for pHDRs? The best I could think of is to test the coverage achieved against the nominal value, but this disregards the volume of the pHDR, which we want to be as small as possible.

Stephan Kolassa
  • 123,354

3 Answers3

4

Maybe a variation of the Winkler score would work. Let the $100(1-\alpha)$% HDR be given by $R_\alpha$. Then the score could be $$s_\alpha + \frac{2}{\alpha}1(y\not\in R_\alpha)$$ where $s_\alpha$ is the total size of the region (i.e., the sum of the lengths of the sub-intervals).

However, note that the HDR is the smallest region with specified coverage by definition. So just checking the coverage is probably ok.

Rob Hyndman
  • 56,782
0

Is there such a quality measure for pHDRs?

A relative quality measure is the Bayes' factor comparing two model specifications $\alpha_i$ and $\alpha_j$ for the single observation $Y_t$. Model weighting in the mixture model context is based upon the relative predictive densities of observations $Y_t$:

$$ H_t\left(\alpha_i,\alpha_j\right) = \frac{p\left(Y_t | \alpha_i,D_{t-1}\right)}{p\left(Y_t | \alpha_j,D_{t-1}\right)} $$

See West, Mike, and Jeff Harrison. "Bayesian forecasting and dynamic models." (1997) Ch 12.2 Multi-Process Models: Class I.


The above is slightly off what you requested; it compares the predictive density functions of two models, rather than provide a performance measure for the pHDR of one model.

However, it is perhaps in the spirit of remaining open to better models, as discussed in Skilling, John. "Nested sampling for general Bayesian computation." (2006).

krkeane
  • 2,190
0

If you want to have a quality measure analogous to Winkler's interval score you mentioned, then this measure is probably very complex, or might not exist at all.

Forecast specification and the term 'prediction interval'

Firstly, I would argue that the term 'prediction interval' can be misleading in this setting. If we want to report an interval instead of a single value or a predictive distribution, then we have to specify what statistical property that interval represents. Simply saying that we want interval forecasts is like saying we want point forecasts, but not specifying whether they represent the mean, a quantile, or something else. We cannot do proper statistical forecast evaluation without this information.

A prediction interval is usually understood to be an interval in which a future observation will fall with a specified probability. However, predictions in an interval format could be specified in lots of ways, e.g. the values between the mean and the mode or between the 30%- and 90%-quantile of the conditional predictive density might be of interest. Naming all predictions of these quantities 'prediction intervals' could lead to confusion.

Winkler's interval score

If we have specified the type of interval forecasts, then we need to find some metrics for their evaluation. If our aim is to compute the expected loss of a collection of interval forecasts, then we should choose the scoring/loss function such that it is consistent for the type of interval which was predicted, i.e. the true interval should minimize the loss function in expectation. This is analogous to consistency of the squared error for the mean: expected squared error is minimized by the distribution mean only. Hence, if you don't have mean forecasts, you should not use squared error. The idea of consistent loss functions can be seen as an equivalent of proper scoring rules for point forecasts.

For $\alpha > 0$ the scoring/loss function $$ L([\ell,u] , y) = (u-\ell) + \frac{2}{\alpha}(\ell-y)1(y<\ell) + \frac{2}{\alpha}(y-u)1(y>u). $$ is often called Winkler's interval score. It compares the interval forecast $[\ell, u]$ to the observation $y$ and it is consistent for the interval defined by the $\frac{\alpha}{2}$- and $(1- \frac{\alpha}{2})$-quantile. Consequently, it does not make sense to use $L$ for the evaluation of some sort of prediction interval (in the sense mentioned above) which does not meet this definition. Also, even if you are in a setting where the predicted highest density region is always an interval, evaluating it with Winkler's interval score is meaningless. It is like using mean squared error to evaluate quantile forecasts.

Loss functions for the highest density region

Let's assume for a moment that we are in a situation where the predictive highest density (p.h.d.) region is always an interval. A reasonable approach would then be to find a loss function which is consistent for this p.h.d. interval and use expected losses for evaluation. Unfortunately, there is a paper (see below) which shows that under some regularity assumptions, there are no loss functions which are consistent for the p.h.d. interval (which is called 'shortest interval' therein.)

Now if we cannot find a consistent loss function in this simplified setting, where the p.h.d. region is always an interval, then I doubt that it is possible to find one for the more general case. And if evaluation cannot be done via a loss function, then this suggests that the suitable quality measures for p.h.d. regions are rather complicated.

References: