26

What's the best way to split time series data into train/test/validation sets, where the validation set would be used for hyperparameter tuning?

We have 3 years' worth of daily sales data, and our plan is to use 2015-2016 as the training data, then randomly sample 10 weeks from the 2017 data to be used as the validation set, and another 10 weeks from 2017 data for the test set. We'll then do a walk forward on each of the days in the test and validation set.

meraxes
  • 769

3 Answers3

24

You should use a split based on time to avoid the look-ahead bias. Train/validation/test in this order by time.

The test set should be the most recent part of data. You need to simulate a situation in a production environment, where after training a model you evaluate data coming after the time of creation of the model. The random sampling you use for validation and training is therefore not good idea.

wind
  • 428
  • Note that there's a con to ordering training and validation sets strictly in time: there will be a gap between training and testing sets, meaning the relationship between training and validation sets will not be the same as the relationship between training and testing sets. An alternative is to sample testing set as the latest x samples. Then randomly sample training and validation sets from the remaining data. – Fijoy Vadakkumpadan May 03 '23 at 18:06
17

I think the most complete way to leverage your time-series data for training/validation/testing/prediction is this:

enter image description here

Is the picture self explanatory? If not, please comment and I will add more text...

elemolotiv
  • 1,278
  • 3
    Where does this picture come from ? – manu190466 Feb 18 '21 at 14:42
  • I found your chart very useful, as I'm in trouble with the same topic. 1.Not sure I got the second step of the chart. What do the V and T squares mean? V should not always be later than T? 2.are we saying that, after best model selection, for tomorrow's forecast, we use all data and get the right parameters using the best model, but using the full data-set ? Thanks! Gioele – Gioele Feb 18 '21 at 14:17
  • 1
    @Gioele 1. T=Training samples V=Validation samples. 2. yes, you try to reuse the historical data as much as possible, first to select the best model and then to fit the best model. – elemolotiv Feb 19 '21 at 22:30
  • 2
    @manu190466 it's my personal attempt to combine available approaches: cross-validation to select the best model today, backtest to fit the selected model, walkforward analysis to check if the whole idea would have worked in history, day after day until today – elemolotiv Feb 19 '21 at 22:33
  • Hm... I think a bit of an issue is in this graph. IT assumes we have decided on the best model beforehand. If we know which model we want to use, this graph is fine and we just report its performance; if we try to use this graph/schematic to compare models though and then report the performance of the best model then this graph has an optimism issue. That is because the model performance is reported using the same dataset used to select the best model, so we have "overfitted" that dataset. – usεr11852 Oct 20 '22 at 15:07
  • In a "walk-forward" model, you would not split the validation set from the training set, but from the testing set - if at all. Since you want to predict the future, not the past. See the answer below. See the same image, but as a "walk-forward" model, tried here. – questionto42 Jan 06 '23 at 17:47
8

"walk-forward"

In the following, "validation set" was replaced with "testing set" to be aligned with the naming in this Q/A.

Instead of creating only one set of a training/testing set, you could create more of such sets.

The first training set could be, say, 6 months data (first semester of 2015) and the testing set would then be the next three months (July-Aug 2015). The second training set would be a combination of the first training and testing set. The testing set is then the next three months (Sept-Oct 2015). And so on.

This walk over time is a bit like the k-fold cross-validation where the training sets are a combination of the previous training and validation set, and that put together. Only that in this model, this happens by walking through time to check more than one prediction, and in this model, the training and testing set are put together, not the training and validation set. You have validation by default, in the model, if you walk through time with more than one prediction. The metrics of these predictions can be put against each other.

This is the walk-forward model, see a comment below. The model image mixes up testing set with validation set. Normally, this naming issue does not matter, but here, it does, since it stands against the naming of the rest. If you then allow to make an additional k-fold validation from the testing set, you have the three sets that the question asks for. And yes, you do not need that validation set if you make a "walk-forward" with enough steps.

Thus, even if this model needs only training and testing set, it can still be an answer to the question that asks for the three sets. Since the validation set can be seen as being replaced by the "walk-forward". And it still also allows small k-fold validation as splits from the testing set, so that you could see it as 3+1 sets in the end.

questionto42
  • 314
  • 1
  • 2
  • 13
aathiraks
  • 181