0

Regarding the question of whether Kfold cross-validation is applicable to time series data, I've read some answers from stackexchange and blog posts from other sources as well, and they state that we cannot apply Kfold cross-validation to time series modeling (eg. Don’t Use K-fold Validation for Time Series Forecasting, Using k-fold cross-validation for time-series model selection).

But in the Tabular Playground Series - Jul 2021 competition on Kaggle, I found some senior participants apply this approach (eg: stacked model, TPS-Jul-XGBoost Regressor optimized with Hyperopt, etc.), even set KFold(n_splits=self.n_folds, shuffle=True).

So I'm a little confused, is their approach justified? Thanks.

References:

Time series cross validation

ah bon
  • 143

1 Answers1

2

These posts refer specifically to time series forecasting - building predictive models using information about the trends and cycles in the historical data. Applying Kfold cross-validation risks (a) losing the temporal relationships due to gaps in the time series and (b) building models where future information is used in the predictions.

Neither of the two Kaggle notebooks you reference are doing time series forecasting, even though the data could be considered a (multivariate) time series. They treat each time step as an independent instance and ignore any time-dependency between the instances - in other words as regular tabular data, so there's no problem using Kfold cross-validation in this case.

Lynn
  • 1,707
  • Thank you for answering my question. From your point of view: If I want to build a model to predict the Nasdaq index or stock prices, could I use k-fold cross-validation? – ah bon Nov 01 '22 at 01:12
  • 1
    @ahbon - I know very little about forecasting models, so I can't really help you with that question. – Lynn Nov 01 '22 at 11:47