0

I'm using SARIMAX as a statistical model to solve my problem of predicting a cost variable (Y) based on the past history of this dependent variable. In particular, I use SARIMAX because I have additional predictor variables X that help me predict the variable, even though right now I'm working on a synthetic dataset whose data was randomly generated by me, so there is no seasonality.

My goal is to create a framework on which I will do my future analysis with the real data, so I'm not interested in the final results because they will come out wrong for obvious reasons.

My starting dataset has 2000 observations and 9 variables, in which we have a daily subscription to the service (so the dataframe index goes from 1 January 2021 to 31 December 2023, in which we can have duplicate dates.

The test dataset on which we want to make predictions, on the other hand, corresponds to the year 2024, so we will have 365 observations.

How do I parameterize SARIMAX during training?

Code:

FEATURES = ['X1', 'X2', 'X3', 'X4','X5',
                'Lag1','Lag2']
TARGET = 'Y'
SARIMAX_model = pm.auto_arima(data[TARGET], exogenous=data[FEATURES],
                           start_p=1, start_q=1,
                           test='adf',
                           max_p=3, max_q=3, m=366,
                           start_P=0, seasonal=True,
                           d=None, D=1, 
                           trace=False,
                           error_action='ignore',  
                           suppress_warnings=True, 
                           stepwise=True)

m=366 because I have data day by day. In this way the training is very slow.

In coclusion, I'm not interested in the result but in creating a framework that I will use for my real data, where I will see if there is seasonality or not.

Richard Hardy
  • 67,272
  • To clarify: you are interested in yearly seasonality, not weekly, correct? Do you also anticipate weekly seasonality? In general, SARIMA struggles with "long" seasonality; it is usually better to either use trigonometric dummies, or STL decomposition of the residuals from a regression on your predictors. That said, what is your actual objective? – Stephan Kolassa Aug 08 '23 at 08:35
  • Also: "duplicate dates" does not sound like (S)ARIMA is appropriate. ARIMA expects a single observation per time bucket. Can you give some more information of what this data is? – Stephan Kolassa Aug 08 '23 at 08:38
  • @StephanKolassa

    My goal, in this phase, is to create a framework that I will use with real data to do my real analysis.

    In general, my goal is to predict the cost of a service in the next year (e.g. the cost of this service predicted for the year 2024 for this type of person with this age range equals Y).

    To make this prediction, I want to use the past values ​​of my Y and my predictors, and in addition I want to analyze whether these predictors actually affect the prediction of my variable (for example age is important for the prediction of this service). I used also XGBoost.

    – Alessandro Pio Budetti Aug 08 '23 at 08:40
  • @StephanKolassa Predictors are: age, income, cost of service (variable to be predicted) and two lag features of cost of service for previous years. The dataframe is indexed from January 1, 2021 to December 31, 2023, while the dataset I want to test and forecast is the year 2024, made up of 365 observations (one per day). Here one could think, for example, of calculating the cost of the service for the month (I predict that the cost of the service for January 2024, for this age and this income, is equal to 300, for example. – Alessandro Pio Budetti Aug 08 '23 at 08:45
  • (S)ARIMA is not the tool you are looking for. It (along with most other "standard" forecasting algorithms) assumes you have a single observation per time bucket. However, you have multiple observations per day, along with multiple values of each predictor, and just one of these is the date. This question looks different at first glance, but the answer is precisely the same: use standard regression or ML methods, and use a transformation of time to model time dynamics. – Stephan Kolassa Aug 08 '23 at 10:24

0 Answers0