1

When using a machine-learning model to predict a quantity for different sub-populations of different sizes, I'd like to compare the error between sub-populations (e.g. future population per country over a group of countries).

I want to quantify the fit of the model. When looking at prediction errors, the absolute error is higher for larger countries so this is not a good way to evaluate model fit across countries. A natural step would be to look at the % error. However, this results in larger errors for smaller countries because the sample size is lower and the variance is higher.

Is there a good measure of error which independant of both scale and sample size?

bbrame
  • 155
  • "However, this results in larger errors for smaller countries because the sample size is lower and the variance is higher." This does not make sense to me. Could you please elaborate? Shouldn't it be the case that, if the sample size for such a country is low, then you lack precision in your estimates? Sure, it would be nice to tighten-up your estimates (it would be nice to get every prediction perfect every time), but your result seems to reflect the reality of your predictions. – Dave Feb 09 '23 at 00:47
  • @Dave you're probably right for this example. My actual situation is loads of fertilizer. It's not just a sampling problem - there are fewer loads for smaller companies. So a smaller change in loads from year to year has a larger outcome on the percentage change. – bbrame Feb 09 '23 at 13:32
  • It still seems like this is just the reality of your problem. If you miss by one load of fertilizer for a company that usually gets two, your miss really is a huge deal. If you miss by one load of fertilizer for a company that usually gets thousands, they might not notice. – Dave Feb 09 '23 at 13:43

1 Answers1

2

You can scale error measures. Two ways are common:

  • Scaling by the actual level of the time series. You can, e.g., scale each absolute error by dividing it by the corresponding actual, then average - this is the Mean Absolute Percentage Error (MAPE). Alternatively, people will also scale the sum of the absolute errors by the sum of the actuals, which can be interpreted as a weighted MAPE. The question here is whether to use historical sums (or means, which make more sense if your time series have different lengths).

  • Scaling by a benchmark error. For instance, you could scale each series' Mean Absolute Error (MAE) by the MAE achieved on this series by a random walk forecast, or by the historical average. The effect is to quantify how much your focal forecasting method improves over the benchmark. A variant is the Mean Absolute Scaled Error (MASE), which scales the MAE by the MAE achieved by the one-step-ahead naive forecast in-sample.

You can also scale other error measures, like the Root Mean Squared Error, in either one of these two ways. What is important in any case:

  • Be transparent and precise about what you are doing, e.g., when you take sums or means (before or after dividing), especially if you have to communicate your errors, or if other people also calculate errors.
  • Note that series on smaller scales are often harder to forecast, simply because they have more relative noise.
  • Note that minimizing the MAPE will induce a bias. Similar effects occur if you minimize the MAE or a scaled variant thereof, but the effect is likely small if you are forecasting demographics, which presumably have nice symmetric conditional distributions.

To be honest, I don't think there is a single "good" error measure. All have hidden pitfalls and drawbacks. It's important to understand what prediction your error measure rewards. See here.

Stephan Kolassa
  • 123,354