I am dealing with intermittent time series data, i.e. mostly zeros. Here is the particular time series that is giving me trouble:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 60.0, 0.0, 0.0, 0.0, 0.0, 36.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 96.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.0]
This is fit with an “auto arima” model, which basically just picks the “optimal” arima orders through a backtesting procedure. In this case, ARIMA(2, 0, 2) performs best historically (when using a portion of the data as the train set, and another as the test set).
However, when I fit all of this data with an ARIMA(2, 0, 2) model from statsmodels.api.tsa.SARIMAX, I get a flat forecast with an absurdly large value of 8.278163e+14. Here is the code to reproduce the error:
import statsmodels.api as sm
import numpy as np
y = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 60.0, 0.0, 0.0, 0.0, 0.0, 36.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 96.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.0])
mod_kwargs = {'order': (2, 0, 2), 'trend': 'c', 'seasonal_order': None}
model = sm.tsa.SARIMAX(endog=y, **mod_kwargs)
model = model.fit(disp=False)
forecast = model.get_forecast(steps=12).predicted_mean
with the following versions: statsmodels==0.14.0, numpy==1.26.1.
Does anyone know what could cause something like this? Is it possibly a bug in the statsmodels package itself, or could an ARIMA model genuinely return a result like this in very specific circumstances?
UPDATE: I just repeated this experiment in R and the results look perfectly reasonable. This makes me think there is a bug in the statsmodels package.
forecast::auto.arima()will give you an ARIMA(0,0,0) model with a nonzero mean - i.e., the simple historical mean as a forecast. If you do force R to fit and forecast an ARIMA(2,0,2) model, it gives you a more useful decay towards the overall mean. – Stephan Kolassa Nov 29 '23 at 07:34