33

I am interested in model selection in a time series setting. For concreteness, suppose I want to select an ARMA model from a pool of ARMA models with different lag orders. The ultimate intent is forecasting.

Model selection can be done by

  1. cross validation,
  2. use of information criteria (AIC, BIC),

among other methods.

Rob J. Hyndman provides a way to do cross validation for time series. For relatively small samples, the sample size used in cross validation may be qualitatively different than the original sample size. For example, if the original sample size is 200 observations, then one could think of starting cross validation by taking the first 101 observations and expanding the window to 102, 103, ..., 200 observations to obtain 100 cross-validation results. Clearly, a model that is reasonably parsimonious for 200 observation may be too large for 100 observations and thus its validation error will be large. Thus cross validation is likely to systematically favour too-parsimonious models. This is an undesirable effect due to the mismatch in sample sizes.

An alternative to cross validation is using information criteria for model selection. Since I care about forecasting, I would use AIC. Even though AIC is asymptotically equiv­a­lent to min­i­miz­ing the out-​​of-​​sample one-​​step fore­cast MSE for time series mod­els (according to this post by Rob J. Hyndman), I doubt this is relevant here since the sample sizes I care about are not that large...

Question: should I choose AIC over time series cross validation for small/medium samples?

A few related questions can be found here, here and here.

Richard Hardy
  • 67,272
  • 1
    I would also imagine BIC is also equivalent to a "longer" forecast (m-step ahead), given its link to leave k out cross validation. For 200 observations though, probably doesn't make much difference (penalty of 5p instead of 2p). – probabilityislogic Feb 25 '15 at 13:11
  • 1
    @CagdasOzgenc, I asked Rob J. Hyndman regarding whether cross validation is likely to systematically favour too-parsimonious models in the context given in the OP and got a confirmation, so that is quite encouraging. I mean, the idea I was trying to explain in the chat seems to be valid. – Richard Hardy Feb 26 '15 at 07:10
  • There is this on this site about AIC/BIC v. CV. https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other . – meh Jun 15 '18 at 20:37
  • 3
    I've spend a fair amount of time trying to understand AIC. The equality of the statement is based on numerous approximations that amount to versions of the CLT. I personally think this makes AIC very questionable for small samples. – meh Jun 15 '18 at 20:39
  • @aginensky, thanks. I think interesting properties of CV are also asymptotic (aren't they?), so the question whether to choose AIC or CV is still nontrivial (though I expect the bias towards simpler models might be too big in CV such that AIC would be preferred; I wonder about the variances). – Richard Hardy Jun 16 '18 at 05:30
  • Agreed. It seems that any exact formula about the statistics of an empirical observations amounts to CLT at some point. A priori the fact that CV is empirical is clear whereas for AIC it isn't. To me, CV seems more like the MLE, but of course that can be bad for small data sets too. – meh Jun 17 '18 at 14:04
  • There are theoretical reasons for favoring AIC or BIC since if one starts with likelihood and information theory, then metric which is based on those has well known statistical properties. But often it is that one is dealing with data set which is not so large. – Analyst Jun 15 '18 at 19:41
  • I think several papers by Clifford Hurvich and others address this problem in the context of different models. If I remember well, a variant called AICc was proposed to address shortcomings of AIC in small samples --small samples are a problem not only for cross-validation. These papers are dated from 1989 onwards, in the Journal of Time Series Analysis and Biometrika I think. – F. Tusell Nov 30 '18 at 18:46
  • @F.Tusell, thank you for your insight. If I remember correctly, AICc is just a second-order asymptotic approximation as compared to AIC's first order. So it is just a more precise version of AIC, and that applies regardless of the sample size, but that mainly becomes important when the sample size is small. Just to say that even with AICc we are not getting away from the asymptotic justification for the method. – Richard Hardy Nov 30 '18 at 20:49
  • Maybe I am missing something in this thread, but wouldn't time series cross-validation already assume that you have a model selected and that you are trying to assess the accuracy of the forecasts it produces? – Isabella Ghement Dec 08 '18 at 18:13
  • 1
    @IsabellaGhement, why should it? There is no reason to restrict ourselves to this particular use of cross validation. This is not to say that cross validation cannot be used for model assessment, of course. – Richard Hardy Dec 08 '18 at 19:45
  • @RichardHardy. In the case where you wish to ensure that test folds always chronologically follow the training set and never precede it, you can still construct train and validation folds so that they are all the same size. You won't get as much reuse out of resampling, but it can be arranged. Would that allay your concern abut CV systematically favouring too-parsimonious models? Then there is also the possibility of purging and embargoing the test folds as described in Lopez de Prado's 2018 book Advances in Financial Machine Learning. What's your view on the approach taken there? – OldSchool May 08 '20 at 03:06
  • @OldSchool, thank you for the interesting ideas. Could you give a reference for the first one, or maybe even write your own answer explaining it? I have not read the book you cite, so I do not have a view on it. It will be interesting to take a look if I can find the book. – Richard Hardy May 11 '20 at 06:25

4 Answers4

7

Taking theoretical considerations aside, Akaike Information Criterion is just likelihood penalized by the degrees of freedom. What follows, AIC accounts for uncertainty in the data (-2LL) and makes the assumption that more parameters leads to higher risk of overfitting (2k). Cross-validation just looks at the test set performance of the model, with no further assumptions.

If you care mostly about making the predictions and you can assume that the test set(s) would be reasonably similar to the real-world data, you should go for cross-validation. The possible problem is that when your data is small, then by splitting it, you end up with small training and test sets. Less data for training is bad, and less data for test set makes the cross-validation results more uncertain (see Varoquaux, 2018). If your test sample is insufficient, you may be forced to use AIC, but keep in mind what it measures, and what assumptions it can make.

On another hand, as already mentioned in comments, AIC gives you asymptotic guarantees, and it's not the case with small samples. Small samples may be misleading about the uncertainty in the data as well.

Tim
  • 138,066
  • Thanks for you answer! Would you have any specific comment regarding the undesirable effect of the much smaller sample size in cross validation due to the time series nature of the data? – Richard Hardy Aug 02 '19 at 17:02
  • You mention this also in your question, and I do not see it as unavoidable. One could think of fitting an state space model and perform "leave-one-out" cross validation. The omitted value each time might be "predicted" using the Kalman smoother. Would this not be a form of cross validation with a sample size nearly that of the original set? – F. Tusell Apr 21 '20 at 11:34
  • @F.Tusell, not sure if you meant to address me here, but I did not get notified as you did not include my name preceded with @. Anyway, what you are proposing is an interesting idea. – Richard Hardy Feb 06 '21 at 08:01
5

Hm - if your ultimate goal is to predict, why do you intend to do model selection at all? As far as I know, it is well established both in the "traditional" statistical literature and the machine learning literature that model averaging is superior when it comes to prediction. Put simply, model averaging means that you estimate all plausible models, let them all predict and average their predictions weighted by their relative model evidence.

A useful reference to start is https://journals.sagepub.com/doi/10.1177/0049124104268644

They explain this quite simply and refer to the relevant literature.

Hope this helps.

  • 2
    Thanks, this is a good idea. Even so, it may make sense to discard the poorest models from the average, and for that I need estimates of predictive ability for individual models. – Richard Hardy May 11 '20 at 06:27
  • 2
    +1. Shameless piece of self-promotion: I looked at combining exponential smoothing methods for forecasting based on the kind of AIC-weighted combinations Burnham & Anderson propose (Kolassa, 2011, IJF). – Stephan Kolassa May 10 '21 at 08:26
0

My idea is, do both and see. It's direct to use AIC. Smaller the AIC, better the model. But one cannot depend on AIC and say such model is the best. So, if you have a pool of ARIMA models, take each and check on forecasting for the existing values and see which model predicts the closest to the existing time series data. Secondly check for the AIC as well and considering both, come to a good choice. There are no hard and fast rules. Just go for the model which predicts the best.

  • 1
    Thank you for your answer! I am looking for a principled way to select between the different methods of model selection. While you are right that There are no hard and fast rules, we need clear guidelines under hypothetical ideal conditions to assist us in the messy real-world situations. So while I generally agree with your standpoint, I do not find your answer particularly helpful. – Richard Hardy Jan 30 '19 at 10:19
0

Hyndman & Athanasopoulos "Forecasting: Principles and Practice" (3rd edition) suggests AIC for short time series. Section 13.7 states:

However, with short series, there is not enough data to allow some observations to be withheld for testing purposes, and even time series cross validation can be difficult to apply. The AICc is particularly useful here, because it is a proxy for the one-step forecast out-of-sample MSE. Choosing the model with the minimum AICc value allows both the number of parameters and the amount of noise to be taken into account.

Richard Hardy
  • 67,272