Should I randomly shuffle train and test datasets?

Question

Usually we randomly shuffle train and test datasets for machine learning problem. However, some sources say that for financial problems we should split data into train and test in chronological order without any random shuffling.

http://stats.stackexchange.com/questions/10162/how-to-apply-neural-network-to-time-series-forecasting — amdopt, Apr 05 '17 at 12:56

score 4 · Accepted Answer · answered Apr 10 '17 at 18:29

Here's the kind of problem you can run into with financial data if you select the in-sample/out-of-sample split randomly, instead of chronologically:

Suppose you're building a model to predict stock returns, and you have data on the daily returns for every S&P500 stock over several years.

Suppose stock returns are reasonably modelled by a one-factor model (so a stock's return is pretty close its beta times the S&P500 return). This is a simplification, but it's accurate enough for this example.

Now split the data randomly, and train your favorite model to predict next-day returns as a function of beta, plus whatever else you think is relevant. If the S&P500 index has gone up over your sample period, then the model will favor the highest beta stocks. If the S&P500 went down, then your model will favor the lowest beta stocks. This will look great "out-of-sample", but only because you've implicitly told your model how the S&P500 performed over the entire sample period, so it already knows the right answer for your out-of-sample data.

The problem goes away if you split chronologically.

Should I randomly shuffle train and test datasets?

1 Answers1