Which Forecast Evaluation Metric To Use?

Question

It is a forecasting problem. I need an evaluation metric which penalizes under-predictions more than over-predictions. Also I want it's range in certain interval (say 0-100), so that it becomes easier for comparison of different models. What could be the possible eval metrics / solutions to this?

I've tried MSE, r2 score, MAPE, WMAPE, AMAPE, SMAPE so far.

Some people would argue that you should first understand what the purpose of your forecast is before deciding on an appropriate error metric. (OK, so I authored that.) So: what are you forecasting for? — Stephan Kolassa, Mar 04 '24 at 15:09
OK, but why? If you are forecasting for target inventories, then you want a quantile forecast (for some aggregate sales depending on your replenishment schedules), and you can elicit those using a pinball loss: https://stats.stackexchange.com/q/471224/1352 — Stephan Kolassa, Mar 05 '24 at 16:53

Dave · Answer 1 · 2024-03-05T20:33:54.807

The evaluation function and the scaling are distinct issues in my mind. To me, scaling to $0$-$100$ is straightforward: compare to a reasonable baseline model. This is what the usual $R^2$ does by comparing the square loss of your model to the square loss of a baseline model that always predicts the overall mean (I argue here to use the in-sample mean, a stance supported by the statistics literature). For your time series problem, it might be reasonable to compare to a moving target as you get more information with the longer time series, discussed here with a reference to an article from the Review of Financial Studies. I simulate something like this here. Once you have the performance of your baseline model, you do a familiar calculation.

$$ 1-\dfrac{\text{ Performance of your model }}{\text{ Performance of the baseline model }} $$$$= \dfrac{\text{ Performance of the baseline model }-\text{ Performance of your model }}{\text{ Performance of the baseline model }} $$

(It might make more sense if the measure of performance is $0$ when the predictions exactly match the true values; I discuss such an issue here (look for "...annoyingly..."). Most measures of performance will give you this, e.g., square loss, absolute loss, and crossentropy loss.)

A possible issue with an $R^2$-style comparison to a baseline is that it will be less than zero of the performance is worse than the baseline. That breaks from your range, but there is no limit to how bad the predictions can be, so I am not sure there should be a lower bound.

To do this, you have to consider an appropriate way to measure the quality of your model and the baseline. Sim of squared deviations/residuals/errors is a popular choice, as is the sum of the absolute deviations. However, you have mentioned an uneven penalty for missing high and missing low. Quantile loss and tilted square loss (not sure of a common name for it) might fit the bill, as they allow you to give different penalties for missing high by $\delta$ and missing low by $\delta$. Then you calculate the quantile or tilted square loss for your model and the baseline model, sticking those into the expression above to get your scaled score.

I encourage readers to consider the comment by Stephan Kolassa about determining what you want to forecast, too, as that can influence if you want quantile loss, tilted square loss, or something else.

Scaling is a good thing, and also used in forecasting a lot. However, it will not necessarily map into some predefined interval, like [0,100], or anything else. — Stephan Kolassa, Mar 05 '24 at 16:54
@StephanKolassa I'm struggling to see how what I wrote could lead to a value outside of $\left(-\infty, 1\right]$. Do you have an example of non-negative $x$ and $y$ such that $\frac{y - x}{y}>1?$ (I think $1$ is the least upper bound.) Have I misinterpreted your comment? — Dave, Mar 05 '24 at 20:42
As long as both "performances" are positive, I agree that your result will be no larger than 1. But the OP is asking for something between 0 and 100. Would you feed your result into something arctan-like, then scale and shift? Sure, you can do that and end up in the correct interval... it will just not be interpretable any more. — Stephan Kolassa, Mar 05 '24 at 21:00

Which Forecast Evaluation Metric To Use?

1 Answers1