Evaluate upper bound prediction results using classic error calculation instead of PI metrics

Question

I have the following data over time:

that means data collected for a single variable like CPU usage in lowest, highest, and average mode over time every 5 mins (data granularity = 5mins) like the following data frame:

|    | timestamp           |   min cpu |     max cpu |     avg cpu |
|---:|:--------------------|----------:|------------:|------------:|
|  0 | 2017-01-01 00:00:00 |    715147 | 2.2233e+06  | 1.22957e+06 |
|  1 | 2017-01-01 00:05:00 |    700474 | 2.21239e+06 | 1.21132e+06 |
|  2 | 2017-01-01 00:10:00 |    705954 | 2.21306e+06 | 1.20663e+06 |
|  3 | 2017-01-01 00:15:00 |    688383 | 2.18757e+06 | 1.19037e+06 |
|  4 | 2017-01-01 00:20:00 |    688277 | 2.18368e+06 | 1.18099e+06 |

I sliced the dataframe and worked on a univariate time-series data problem as follows:

|    | timestamp           |     avg cpu |
|---:|:--------------------|------------:|
|  0 | 2017-01-01 00:00:00 | 1.22957e+06 |
|  1 | 2017-01-01 00:05:00 | 1.21132e+06 |
|  2 | 2017-01-01 00:10:00 | 1.20663e+06 |
|  3 | 2017-01-01 00:15:00 | 1.19037e+06 |
|  4 | 2017-01-01 00:20:00 | 1.18099e+06 |

I split data and applied PI (Prediction Interval) using a regression:

|                     |        pred |   lower_bound |   upper_bound |
|:--------------------|------------:|--------------:|--------------:|
| 2017-01-25 00:00:00 | 1.15232e+06 |   1.12482e+06 |   1.1874e+06  |
| 2017-01-25 00:05:00 | 1.14453e+06 |   1.10052e+06 |   1.18994e+06 |
| 2017-01-25 00:10:00 | 1.14033e+06 |   1.08739e+06 |   1.20795e+06 |
| 2017-01-25 00:15:00 | 1.13669e+06 |   1.0843e+06  |   1.20252e+06 |
| 2017-01-25 00:20:00 | 1.1271e+06  |   1.06837e+06 |   1.19865e+06 |

question:

Since I'm interested in upper_bound upper prediction limit only, (if you see this image from Jason for Relationship between prediction, actual value and prediction interval) then the frame of my problem changed from PI to the simple problem of target prediction. then:

1. Does it mean I can easily use normal target prediction metrics ($MSE$, $MAE$, $MAPE$, $R^2 score$ ) instead of PI metrics ($PICP$, $PINC$, $ACE$, $MPIW$, $PINAW$, $PINRW$, $score$ ) to evaluate the used regression/predictive model as they treat in skforecast package? ref.

# Prediction error
# ==============================================================================
from sklearn.metrics import mean_squared_error
error_mse = mean_squared_error(
                y_true = data_test['avg cpu'],
                y_pred = predictions.iloc[:, 0]
            )
print(f"Test error (MSE): {error_mse}")

In other words (regardless of the way is treated in skforecast package):

2. Is it fine ~~in academics and papers~~ that one treats the evaluation classic use of normal target prediction metrics ($MSE$, etc ) instead of PI metrics ($PICP$, etc. ) for evaluation if you are interested in only one target amongst upper\lower\targetin PI tasks? (It would be great if you cite an example paper)

Thanks in advance

Related materials found:

It is unclear what you're aiming for. Why are you only interested in the upper bound? Why do you produce interval forecasts if you are interested in a single value? — picky_porpoise, Mar 09 '24 at 07:21
@picky_porpoise, please point out which of Qs 1,2, or 3 are vague so that I can explain further. Why are you only interested in the upper bound? Due to the Upper bound gives us a better\fair measurement for our production. Why do you produce interval forecasts if you are interested in a single value? Apart from the fact that in data collection there are CPU Max, CPU Avg, CPU Min measurements based on domain knowledge received from domain experts we just consider avg CPU measurement for our PI task and focus on its Upper bound forecast for future CPU consumption recommendation sys. — Mario, Mar 10 '24 at 12:24
The questions are not that vague, but without understanding the context they are hard to answer. Your comment mentions 'fair measurement' and 'domain knowledge', but this can mean many things. Can it be, that you are interested in a quantile, i.e. what level of CPU usage is not exceeded with a certain high probability? — picky_porpoise, Mar 10 '24 at 14:50
If you need distinct answers to multiple questions, they should be posted separately. — Dave, Mar 10 '24 at 22:14
Aside from diverging into three questions, this post is incomprehensible on multiple levels. — Sextus Empiricus, Mar 11 '24 at 11:39
"Is it fine in academics and papers that one treats the evaluation classic use of normal target prediction metrics (MSE , etc ) instead of PI metrics (PICP , etc. ) for evaluation if you are interested in only one target amongst upper\lower\targetin PI tasks? " It is difficult to assess how fine it is. What is the background? Optimization of MSE is a decent goal. Sure, if the final goal is prediction interval coverage probability (PICP) then this might be potentially targeted more directly. But it is unclear what the situation is. Sometimes MSE can be a better goal to optimize than PICP. — Sextus Empiricus, Mar 11 '24 at 11:44
@picky_porpoise Can it be, that you are interested in a quantile,... A prediction interval (PI) is a quantification of the uncertainty of a prediction. Since in this problem, we are not interested in classic forecast (target prediction), but we are interested in range (Prediction Interval) and pick the Upper bound for our recom system. ...what level of CPU usage is not exceeded with a certain high probability? Let's say with a Confidence interval of 80% check the setup in my other post — Mario, Mar 11 '24 at 20:03
@SextusEmpiricus thanks for your input for Q1. What is the background? Cloud Resource consumption. But it is unclear what the situation is. Situation is: computer resources are not always provisioned with the appropriate amount of resources (i.e., CPU usage, etc.) ....the final goal is prediction interval coverage probability (PICP) this post aims to find/compare PI-related evaluation metrics vs. classic single-target evaluation metrics when solving the range forecast problem and picking the upper bound as the final target. — Mario, Mar 11 '24 at 20:23
@SextusEmpiricus Sometimes MSE can be a better goal to optimize than PICP by that do you mean regardless of single-target or range forecast task, sometimes MSE does a better job? Are\Is there any reference(s) for this point? If you can also further elaborate I would appreciate it. — Mario, Mar 11 '24 at 20:30
A related post is Could a mismatch between loss functions used for fitting vs. tuning parameter selection be justified? and there are several other posts discussing the idea that training of a model can be based on a different metric than the target performance metric for validation of the model. I still urge you to narrow down you question to a single question and to be more focused on a specific question. When you phrase questions in a way like "Is it fine in academics and papers that one treats...", then it becomes opinion as well. — Sextus Empiricus, Mar 11 '24 at 21:01
@SextusEmpiricus I drilled down the Qs and updated the post. I checked your referred link but I don't find much alignment with my question as you highlighted *Reversal of the question* with the focus\objective of loss function (penalizing) impact on training and model selection from my understanding. My question is about the *potential* of using PI-related and classic single-target metric evaluations for partially PI tasks (Upper bound). How comparable are these two PI-related and classic single-target metric evaluations? — Mario, Mar 11 '24 at 23:17
@Dave I drilled down the Qs and updated the post. Let me know If you have any inputs. — Mario, Mar 12 '24 at 08:31
@picky_porpoise I drilled down the Qs and updated the post. Let me know If you have any inputs. — Mario, Mar 12 '24 at 08:33
Just to clarify: your dataframe has three columns, min/max/avg cpu. Which one are you forecasting and calculating PIs for? (Also, why give all three in the post here in the first place? I wonder whether there are some differences in assumptions here.) — Stephan Kolassa, Mar 21 '24 at 07:19
@StephanKolassa in 2nd paragraph I have shown that I selected the avg cpu column for forecasting as well as PI. I explained everything about the situation as well as the data I had from scratch for better understanding since @picky_porpoise asked me for details. Let me know if you had queston — Mario, Mar 21 '24 at 11:06

score 1 · Accepted Answer · answered Mar 12 '24 at 06:03

Since I'm interested in upper_bound upper prediction limit only ... Does it mean I can easily use normal target prediction metrics ($MSE$, $MAE$, $MAPE$, $R^2 score$ ) instead of PI metrics ($PICP$, $PINC$, $ACE$, $MPIW$, $PINAW$, $PINRW$, $score$ ) to evaluate the used regression/predictive model

You can use those metrics (and especially for training as explained here: Could a mismatch between loss functions used for fitting vs. tuning parameter selection be justified?).

However to evaluate the models you might want to use a cost function that is closest to your target goal. If this is accuracy of the upper bound of a prediction interval, then use that as metric.

Is it fine in academics and papers that one treats the evaluation classic use of normal target prediction metrics (MSE , etc ) instead of PI metrics (PICP , etc. ) for evaluation if you are interested in only one target amongst upper\lower\target in PI tasks? (It would be great if you cite an example paper)

If the paper wants to treat a study about PI then it should be about PI in it's text and not about MSE.

Do you have any input about this post by any chance? – Mario Mar 17 '24 at 15:14 — Mario, Mar 17 '24 at 15:14

Evaluate upper bound prediction results using classic error calculation instead of PI metrics

1 Answers1

Linked