1

For short detail, the goal was to forecast using 51 monthly observations of KPI of project implementations which I aggregated by sum from 463 observations from about 4 years of data (May 2017 to July 2021). I used Python's pmdarima auto_arima() to determine the order for p, d, q and was given ARIMA(0,1,1). By the end, the predictions are in a straight line which I visualized below. enter image description here

In more detail, I am using data about KPIs or performance of project implementations with Month-Year serving as index of observations over 3 years. There are 463 observations with 12 null values (which pertain to when no one attended the event, thus, 0 value). Prior to preprocessing, the data looks like this:

enter image description here

I use the code below to use the dates as index, impute the 0 values, and aggregate the observations into per month, which aggregated 463 observations into 51 observations, with months spanning from May 2017 to July 2021. I assured that the frequency of the data was set to months, after I read a related post on straight line forecast result here.

df = df.sort_values(by="Date").reset_index(drop=True).set_index("Date")
df['CommParticipants']= df['CommParticipants'].replace(0, np.nan)
dfMonthly = df.resample('M').sum()
dfMonthly['CommParticipants'] = dfMonthly['CommParticipants'].replace(0, np.nan)
dfMonthlyIntp = dfMonthly.interpolate(method='time').astype('int')

Output:

enter image description here

More importantly, I fitted the ARIMA(0,1,1) calculated by auto_arima() both in training dataset (80% of observations) and for the entire dataset. The result is as I have shown in the visualization above, with a flat line predictions. What could be the problem?

If it is of additional helpful information, I also investigated with ADF test, ACF plot, and PACF plot to figure out the orders for p,d,q without Python's auto_arima() function.

The ADF test revealed that it was stationary after first differencing with p-value of 1.0351106036556035e-08. To my understanding, this supports the finding of auto_arima for the order of differencing, 1. I conducted it with the Python code below:

from statsmodels.tsa.stattools import adfuller
def adf_test(dataset):
     dftest = adfuller(dataset, autolag = 'AIC')
     print("1. ADF : ",dftest[0])
     print("2. P-Value : ", dftest[1])
     print("3. Num Of Lags : ", dftest[2])
     print("4. Num Of Observations Used For ADF Regression:",      dftest[3])
     print("5. Critical Values :")
     for key, val in dftest[4].items():
         print("\t",key, ": ", val)
adf_test(first_diff.dropna())

The plot of the actual dataset after first differencing is as follows:

enter image description here

The PACF plot up to first differencing is below. From what I understood in my self-study, the lags shown outside of the bands in the PACF plot of first differencing should allow me to have order of 2 for the AR term, but auto_arima came up with 0. Also I noticed how several of the lags under first differencing have a negative influence, which I am unsure of its implication but perhaps it was why AR(2) was not viable.

enter image description here

Lastly, the ACF Plot below show me one significant lag which confirm to me the ordering of 1 for MA which was chosen by the auto_arima() function.

enter image description here

I also tried an ARIMA(2,1,1) based on my observation of the ADF, PACF, and ACF plots whose predictions are plotted below, but it does not seem to be any better than the predictions of the model suggested from auto_arima(). It is also a straight, flat, line.

The ultimate goal was actually to forecast 10 years into the future but the data we have right now is minimal provided for our study in college, but I'm hoping to at least come up with an accurate forecast model. Help would be greatly appreciated!

enter image description here

0 Answers0