I'm working on a project on time series multi-step ahead forecasting in Python.
I have a time series, and I apply an ARMA model on it (statsmodels SARIMAX library). I know that ARMA models, as many other models, when forecasting tomorrow value give as output the estimate of the conditional expected value of the process for tomorrow, i.e. an estimate of the mean of the underlying process for tomorrow based on past values.
I also know that tomorrow value derives from past values and the shock (error) of tomorrow, which comes from a Gaussian distribution with mean = 0, as all other errors (errors are i.i.d.):
$\epsilon_t \sim \mathcal{N} (0, \sigma^2)$
When fitting the ARMA model on the training set, I'm estimating the ARMA parameters of the true model via maximization of the likelihood. And along the estimated parameters I obtain their confidence interval.
Since my parameters have a confidence interval, I expect the forecasted mean for tomorrow to have its own confidence interval: I'm estimating the expected value of the process with uncertain parameters, so I don't know if the estimated mean is the exact mean I'll see tomorrow, indeed I can't be sure of this, hence a confidence interval.
I don't know the formula for calculating this confidence interval. But let's move on.
Now I want to calculate the prediction interval, which is not the same thing as the confidence interval for the mean: the prediction interval combines the CI with the error variance, although I don't know the formula. I expected statsmodels ARIMA function to give me the prediction interval, but the interval given in the summary seems to be the confidence interval for the mean.
However, as this github issue reports:
In SARIMAX, we have not implemented a procedure to incorporate the uncertainty associated with estimating the parameters of the model. [...] Ultimately, the intervals produced by either SARIMAX (python) or Arima (R) don't fit either of the definitions above. In some sense they are more like the "Prediction interval" term, because they do take into account the uncertainty arising from the error term (unlike the "Confidence interval" as described above). But it is not an exact match because they don't take into account parameter estimation uncertainty.
So not only the statsmodels interval is incomplete, but it's also misleading (since it seems to be the CI for the mean).
At this point then, I would like to calculate the true prediction intervals by myself.
Looking online and in some books (like https://otexts.com/fpp3/prediction-intervals.html) I see that the prediction interval is calculated with the estimated standard deviation (standard error) of the forecast distribution.
Every step of the forecast (in a multi-step ahead forecast setting) has its own estimated standard deviation. Fine. The book cited above says that the estimated standard deviation for tomorrow is calculated as the RMSE of the past residuals adjusted by a coefficient. But as we said, shouldn't this formula take into account the confidence interval for the mean? Moreover, since the book is only taking errors into account, why calculating the RMSE of the past residuals if the errors are i.i.d. and their variance is known (by the gaussian assumption)?
$Var(e_{t+1})=Var(\mathsf{X}_{t+1}-\mathsf{\hat{X}}_{t+1})=Var(\epsilon_{t+1})=\sigma^2$
Why doesn't the book use the variance of the error distribution?
The book also says:
For multi-step forecasts, a more complicated method of calculation is required. These calculations assume that the residuals are uncorrelated.
And a little after it explains how to create prediction intervals with bootstrapped past residuals. So is there no closed formula for the prediction intervals for multi-step ahead? And also, why still the CI of the mean is not taken into account in the bootstrapping method?