How evaluate predictions when for each prediction there are multiple true values?

Question

My case, as it seems to me, should be quite common, yet I cannot find any information.

The situation is as follows: there is a regression model, and for each predicted value, there are multiple true values. For example:

What methods are there to evaluate the predictions?

Of course, I can evaluate predictions relative to the median value or the mean.

For example, AE would be: $$ AE_i = \mid\hat{y_i} - \frac{1}{3}({y_i}_1 + {y_i}_2 + {y_i}_3)\mid = \mid\hat{y_i} - \bar{y_i}\mid $$

where $\hat{y_i}$ is the i-th prediction, $\bar{y_i}$ is the mean of true values ${y_i}_1, {y_i}_2, {y_i}_3$.

However, errors relative to the mean or median do not reflect the variance of the true values. My idea is that if the prediction falls inside the range of the true values, the error should be significantly smaller compared to the error relative to the mean, or even equal to zero.

I can think of several metrics.

Percent of predictions falling inside the corresponding true value range. This metric, however, does not reflect how bad are predictions which do not fall inside the ranges.
"Relative absolute error". The sum of distances between the prediction and real values devided by the sum of distances between the mean and the real values. $$ e_i = \frac{\sum_j{\mid \hat{y_i} - {y_i}_j \mid}}{\sum_j{\mid{y_i}_j - \bar{y_i} \mid}} $$ where $\bar{y_i}$ is the mean of real values ${y_i}_1, {y_i}_2, {y_i}_3$.
Absolute error relative to quantiles. For example, to evaluate predictions relative to quantiles 0.1 and 0.9:

$$ E = {Q_i}_{0.1} - \hat{y_i}, \quad \text{if }\enspace \hat{y_i} < {Q_i}_{0.1} \\ E = 0, \quad \text{if }\enspace {Q_i}_{0.1} \le \hat{y_i} \le {Q_i}_{0.9} \\ E = \hat{y_i} - {Q_i}_{0.9} , \quad \text{if }\enspace {Q_i}_{0.9} < \hat{y_i} $$

where ${Q_i}_{0.1}$ and ${Q_i}_{0.9}$ are the corresponding quantiles of the true values ${y_i}_1, {y_i}_2, {y_i}_3, ...$

In my case, I have 10 to 90 true values for each prediction. Their distribution is not symmetrical (otherwise, errors relative to the median could be compared to the spread or range width of the true values).

Do you think any of the errors 1.-3. could be useful? What are the more common ways to evaluate predictions in such cases?

Some specifics of my case. I want to predict the time of computation of a piece of code. The computation time naturally fluctuates, however, mainly in the direction of larger values (longer computation times). It also has outliers. Since the code execution times vary, I make multiple measurements. Execution time depends on the hardware and some other parameters. I conduct similar sets of measurements on several machines, also for different data sizes (varying one parameter that changes the size of the data used in computations and impacts execution time) and for different code pieces (defined by a large set of parameters). Thus, for each combination of a machine, a particular piece of code and a data size, I have multiple execution time measurements. For different machines, codes and data sizes, the execution times magnitude and spread (variance) are also different. To mitigate uncertainty, I conduct more measurements when the execution time variation is large.

I want to build a prediction model, that for the given machine, piece of code and data size, would predict the execution time. To evaluate my model predictions, I want a metric that would not only tell how far/close to the median time the predictions are, but also consider the spread of the real values (measured execution times) because if the spread is large, predictions farther away from the median still can be considered as good predictions.

Where do the multiple true values come from? Can you treat them as separate observations? — shadowtalker, Jul 15 '23 at 13:18
So is it like a repeated measures ANOVA where you have three observations for Peter, three for Paul, and three for Mary? — Dave, Jul 15 '23 at 16:30
All observations are of the same nature. To be specific, they are the measured times of the same computations. Some random factors cause fluctuations in the computation time. — Peter, Jul 15 '23 at 16:44
Do you have three different stopwatches timing a process (so to speak)? — Dave, Jul 15 '23 at 17:05
Please give more detail by editing the question to elaborate on that. Depending on those details, I have several ideas, as will others. — Dave, Jul 15 '23 at 18:41
The key point (that I will keep emphasizing) is how you wound up with multiple measurements. Depending on that answer, I can think of several ways to handle this, all of which might be inappropriate for your particular situation, so please flesh out the details. — Dave, Jul 16 '23 at 00:41
@Peter if I understand you correctly, one way to think about this problem is that the "true value" is some unobservable quantity, and the observed quantities are random draws from a probability distribution centered around the unobservable true value. So one way to evaluate your model is to ask whether the distribution of outcomes that you estimate for each entity is a good fit for the distribution that you observe for each entity. — shadowtalker, Jul 16 '23 at 12:10
That said, because you are also interested in prediction, you might want to ignore the "distribution" component and focus on ensuring that your individual point predictions just minimize some loss function and accept the truth that you cannot ever account for the random variation in generating your predictions. Then you might want to ensure your predictions have minimal error with respect to the observed average or median for each entity. — shadowtalker, Jul 16 '23 at 12:14
Where "entity" in this case is some unique combination of calculation attributes (machine, data size, etc). — shadowtalker, Jul 16 '23 at 12:18
@shadowtalker To clarify my case, I do point forecasts. I have one forecast value and multiple measured true values for each entity. In the table I provided for an illustration at the beginning of my post, each row corresponds to one "entity": a combination of machine, data size and code parameters. — Peter, Jul 16 '23 at 12:32
I trained prediction models (one for each machine) and evaluated them on a range of data sizes and a large enough number of code variants. To evaluate, I used MAE and MAPE WRT execution times median. Now I want to go further and evaluate predictions with a metric that would also consider the range/variance of true values. This does not mean I want to predict their distribution parameters. — Peter, Jul 16 '23 at 12:45

score 0 · Answer 1 · answered Jul 15 '23 at 13:05

0

If you have only point predictions, choose a point forecast error measure that elicits the functional you are looking for, e.g., the MSE for the mean, or the MAE for the median, or a pinball loss for a quantile prediction. See here.

Interval predictions can be assessed using interval scores.

Full predictive densities can be evaluated using proper scoring rules. More information can be found here.

However, if you use your predictions for some specific subsequent decision, the approach of using the best forecast (chosen by one of the error measures above) and then optimizing the decision conditional on the forecast is not guaranteed to yield the best final decision. If so, it might be best to directly assess a decision you make on your data.

answered Jul 15 '23 at 13:05

Stephan Kolassa

123,354

Thank you for your informative answer. Unfortunately, I do not see how this can help in my case. I have a point forecast that I want to evaluate with respect to multiple true values. I want metrics that would reflect the distribution of real values. – Peter Jul 20 '23 at 01:24
That is my first paragraph. You will need to decide what a "good" point forecast would be. Do you want a point forecast that is incentivized to be the median of the observations? Use the MAE. Should it be the mean of the observations? Use the MSE. Should it be a specific quantile? Use a pinball loss. If you specifically want to address the spread of your outcomes, quantile predictions may be appropriate, or even multiple ones, i.e., an interval forecast. – Stephan Kolassa Jul 20 '23 at 07:17
Thanks again! I already have errors with respect to the median. Changing my model is not an option. Thus, I need a metric to evaluate point predictions that reflect the distribution of real values. – Peter Jul 20 '23 at 12:34
I think we are talking past each other. Can you explain what you mean by an evaluation "that reflects the distribution of real values"? If you calculate the MAE for a single point prediction, that will be higher if your observations are spread out farther. (And the conditional median will still minimize the expected MAE. I suspect my point about this is not yet clear. You may find this helpful.) – Stephan Kolassa Jul 20 '23 at 13:22
Thanks again for providing the link. If I understand correctly, the paper's main idea is that there is no "best" solution to a point forecast when the real data has uncertainty; and a forecast minimizing one PFEM does not necessarily minimize other PFEMs. – Peter Jul 21 '23 at 01:52
My case, however, seems to be different in that I do not seek for best predictions now. My task is to evaluate existing predictions in a way that makes it as clear as possible how good or bad they are (from the practical perspective). I am predicting a computation time (on a computer). I used AE to show how much the predictions deviate from the median value of the real values, and APE to compare errors to the real (median) computation time scale. – Peter Jul 21 '23 at 02:11
However, I want to go further and find a metric that would reflect the spread of the real values. The idea is that for a wider spread (higher uncertainty) precise prediction of the median value is not that important (or even possible); therefore, the larger the spread, the smaller the value of the metric should be. – Peter Jul 21 '23 at 02:11
And finally, I find that just comparing AE with the spread range is misleading because the distribution of real values is not symmetrical. Thus, even if AE is smaller than the spread (max-min) of the real values, predictions do not necessarily fall inside the range of real values. – Peter Jul 21 '23 at 02:18
1

OK, that helps. It does sound like you will need to experiment a bit with potential error measures to find out which one would be most valid. I would have recommended dividing the AE by some measure of dispersion (difference between max and min, some inter-quantile range, standard deviation), or a function of that, but it seems from your last comment that you don't want that. Maybe use a function of $q_{1-\alpha}-\hat{y}$ (the difference between a high quantile and the prediction) and $\hat{y}-q_\alpha$ (the difference between the prediction and a low quantile) to capture asymmetry? – Stephan Kolassa Jul 21 '23 at 05:58
Thank you for taking your time to think about my case! Your ideas are helpful! I also thought of using quantiles to measure AE. Do you think it would be right to assume that AE=0 if the prediction value falls inside the quantiles: $q_{a} \le \hat{y} \le q_{1-a}$? – Peter Jul 22 '23 at 02:22
1

That is certainly a possibility, if it works with what you want to use the predictions for. (I understand you can't change your algorithm, but this does suggest to me to model two quantiles, and perhaps afterwards take the midpoint of the two predicted quantiles.) – Stephan Kolassa Jul 22 '23 at 19:10

How evaluate predictions when for each prediction there are multiple true values?

1 Answers1

Linked