Does k-fold cross validation strictly require shuffling of data before splitting it into k groups?

Question

Both according to Wikipedia, and to this blog post, k-fold cross validation seems to require that you shuffle the data. I have two questions:

Firstly, why? If you have a time sequence of training examples where (on average) an individual training example is more similar to other examples that are closer to it temporally than other examples that are farther away from it temporally (such as for example a sequence of video frames, if that's what your input data consists of), doesn't it make sense to omit shuffling the data before splitting it up, since if you do shuffle it, then for each validation example, you will almost be certain to have some training example with a high degree of similar to it (which wouldn't be the case if you didn't shuffle it), which kind of defeats the purpose of doing validation from the first place (since you want to see how your model would perform on new data, which would most likely have a significantly lower degree of similarity to anything you had trained it on)?

Secondly, what if you don't shuffle the dataset before splitting it, can't you call the validation method k-fold cross validation anymore? Then what do you call it?

civilstat · Answer 1 · 2023-10-20T19:21:45.050

6

Yes and no.

If you don't shuffle the data, it's no longer $k$-fold cross-validation. Using $k$-fold CV is really only justifiable if we can reasonably treat the dataset as a random sample from the population that you want to generalize to. Then we can use $k$-fold CV to estimate the average performance of models like yours across new random training sets. See Bates et al 2023, "Cross-Validation: What Does It Estimate and How Well Does It Do It?"):

...it estimates the average [performance] of models fit on other unseen training sets [of size $n \times (k-1)/k$] drawn from the same population

If you don't shuffle, CV's training sets aren't random, and your CV performance estimates might be biased by however the dataset's rows were ordered. So I don't think there is a separate name for "$k$-fold CV but without shuffling," because it isn't really a useful technique.

If your data are meaningfully ordered by time, not a random sample, then you should use one of the CV variants specifically designed for time series data. See for example the answers to this post, which also explains why you should not just do the "$k$-fold CV but without shuffling" proposed in your question. A better simple approach, mentioned in those answers, is what's sometimes called "rolling-origin-calibration evaluation": see for example p.8 of Yi et al. (2018) or this section of Rob Hyndman's book Forecasting: Principles and Practice, 2nd ed..

(On the Wikipedia page you cited earlier, Wikipedia refers to "rolling cross validation," citing a paywalled article that distinguishes "rolling-origin-calibration" CV, "rolling-window" CV, and a few other variants. Although it's paywalled, the Yi et al. (2018) article on arXiv illustrates these variants.)

These are all still variants of CV; they are just not the same as the default version of $k$-fold CV.

If your data were sampled in some other structured way -- not a simple random sample but also not a time series -- then you should modify CV to ensure that your training sets mimic the sampling design of interest. Examples include grouped data, complex survey designs, spatial sampling, etc. See section 2 of this answer to another post for details and examples.

edited Oct 20 '23 at 19:21

answered Oct 19 '23 at 18:25

civilstat

3,785

1

Thanks for your answer. If makes sense that "k-fold CV without shuffling" can be problematic, because every fold but the last one will be validated with a model that has been trained on both prior and posterior data, while the resulting network that you will eventually use on some real data will only have been trained on data prior to the data you will use it on. – HelloGoodbye Oct 20 '23 at 13:48
For that reason, I guess it makes sense to train each fold model only on the data that is prior to its associated validation data, as described in the question you linked to and referred to as forward chaining. Is this what Wikipedia calls "rolling cross validation"? It doesn't describe how the technique works, and the only reference is behind a paywall. – HelloGoodbye Oct 20 '23 at 13:50
Also, what do you mean by "meaningfully ordered by time, not a random sample"? Note that even though my data is technically a time series, each time point constitutes one training example, so I don't treat my data as time series data when feeding it to the network, but because it is still a time series, each example still bears greater resemblance with other examples that are temporally close than with data points that are temporally farther away. – HelloGoodbye Oct 20 '23 at 14:16
By "meaningfully ordered by time" I had time series in mind. It sounds like your data is time series too, so I'm not sure what you mean by "I don't treat my data as time series data" -- do you just mean that your predictive model doesn't account for the time series nature of the data? – civilstat Oct 20 '23 at 19:00
1

I'll add a note about "rolling cross validation" to the answer! – civilstat Oct 20 '23 at 19:01
Yes, I mean that my predictive model doesn't account for the time series nature of the data. That is, each training example just consists of a single time point. – HelloGoodbye Oct 20 '23 at 19:40
Could you please link to the Yi et al. (2018) article on arXiv? I'm not sure which article you're referring to. – HelloGoodbye Oct 20 '23 at 19:45
Two observations: if time is important it’s better to nonlinearly model time and not split on time. Secondly, if N is not huge, a single 10-fold CV may be too noisy, and 100 repeats of 10-fold CV may be necessary. For each repeat you select random tenths afresh. – Frank Harrell Oct 20 '23 at 20:09
@HelloGoodbye The Yi article is linked in my answer, right before the link to Hyndman's book. Here is the link again, in case it is not working for you: https://arxiv.org/pdf/1812.07699.pdf – civilstat Oct 20 '23 at 23:34
Ah, I didn't see that you has added it; thank you! – HelloGoodbye Oct 21 '23 at 23:26
Considering that we did use it, and I’m not sure whether we will do a rolling cross validation instead because we have already analyzed our results rather extensively, what can you call the method? Is it a form of cross validation? (Of course we will describe exactly what we did.) – HelloGoodbye Oct 23 '23 at 07:47
See @cbeleites's answer -- if I understand correctly, you could call it block-wise cross validation. However, if the analysis you already did wasn't appropriate and rolling CV would be better, I strongly encourage you to do the better analysis! – civilstat Oct 23 '23 at 15:25

cbeleites unhappy with SX · Answer 2 · 2023-10-24T12:40:27.887

@civilstat's answer already explains why you should use dedicated splitting techniques for time series data.

This answer is for data that we do not think of as time series. I.e., data that may have an unwanted correlation with time, such as detector aging/drift.

In chemometrics, variants of cross validation which do not shuffle are known (but AFAIK rarely used), e.g.

venetian blinds cross validation: assigning case i to fold $(i \mod k) + 1$

The idea here is basically doing a stratified cross validation. Like for any other stratification, while it is fine to do so for fixed factors/outcome variables, doing this for a random factor (which measurement order typically is) would make the error estimate optimistically biased.
block-wise cross validation: using contiguous blocks for the folds
We do have a general recommendation to shuffle (randomize) the order in which measurement are done, so one may argue that that measurement shuffling can also serve for the cross validation.

However: the reason behind the recommendation to shuffle measurement order is that we typically need to think of detector drift, i.e. slow systematic changes of measurements over time. In addition, we often need to think and check the possibility of contamination leading to neighboring measurements being more similar to each other. Both can be detected with an appropriate design for the calibration samples, typically by randomization of the measurement order.

So, if the measurements didn't follow an appropriate design, the best randomization for the cross validation cannot break correlation that is already present in the data and the error estimate will be optimistically biased.

Also, I recommend to still randomize the CV:

so we can properly check the CV predictions for drift effects
IMHO k-fold CV should be repeated in order to check stability. For this we anyways need different splits.

update

Does that [randomization] really make sense if it means that for each fold, I will have training examples that look almost identical to each of the validation examples?

I'd say it's the other way round: if randomization causes too similar cases to end up in training and test subsets, something is wrong with your CV set-up at a higher level. And then, not randomizing the order (shuffling) is also not an appropriate solution, since you will still have neighboring examples ending up in train and test subsets. Your mistake may be lower, but OTOH appropriate solutions exist, so use them.

Remember: the basic requirement is to split so that train and test subsets are statistically independent. From a stats point of view, this independence must be satisfied for all random factors in your modeling.

For data with repeated/multiple measurements of the same physical sample or the same patient, this may be done by splitting into training vs testing patients (rather than measurements). For your example case of image sequences, it may mean excluding chunks of data at the boundary between training and testing.

OTOH, if the similarity is due to a fixed factor (influencing factors that you want the model to use to make better predictions), you don't need statistical independence and may even stratify your train and test sets. E.g., if I know detector temperature to have a systematic effect on my measurements from which I predict some analyte concentration, I may split so that folds have approximately equal coverage of concentration as well as temperature ranges.

Whether such influencing factors are present and whether they are random or fixed is very application specific, so you'll need to decide this as part of your modeling and then set up your CV splitting accordingly.

By randomizing the CV, do you mean shuffling the data before splitting it up? Does that really make sense if it means that for each fold, I will have training examples that look almost identical to each of the validation examples? What is validation good for then? — HelloGoodbye, Oct 23 '23 at 11:28
Yes, I use randomization where you use shuffle: the important point is that the original order is replaced by a random (as opposed to arbitrary) one. It is typically our best approximation to what a new case coming in for prediction looks like. — cbeleites unhappy with SX, Oct 24 '23 at 11:22
1/2: I disagree that randomizing time series data would give the best approximation for new cases. Like I said, if you for example randomize the frames in video data (if that's your data and each frame is an individual training example), for every validation example, you will almost be guaranteed that you have at least one training example that looks very much like that validation example, significantly increasing the probability that the network will perform well on that validation example. — HelloGoodbye, Oct 24 '23 at 14:50
2/2: This will not be the case when you then go out and use your trained network on new data from recordings that you don't have in your dataset. So the difference between your training data and your validation data will be much smaller than the difference between your training data and new data. In that case, validation will tell you much more accurately how your network performs on new data if you don't randomize your data, because then the difference between your training data and your validation will be much more similar to the difference between your training data and new data. — HelloGoodbye, Oct 24 '23 at 17:46
Randomization and proper splitting to get independent train/test subsets are not tied together, not even in the case of your time series example. As I wrote, you need to exclude a sufficiently long chunks/sequences of the data from both testing and training in order to get statistically independent subsets (which means, standard CV is never suitable for your data). But that is true whether you randomize or split blockwise. Iow, once you implement proper splitting, there is no reason to not draw suitably spaced points in time randomly into training or testing. — cbeleites unhappy with SX, Nov 01 '23 at 08:50
So, IMHO splitting blockwise may be less wrong than randomization without accounting for the temporal factor, but since correct splitting is possible, there's no need here to go for a less wrong technique. And btw, you anyways neet to determine the details of the temporal correlation in your data, because you have to account for this correlation also in the interpretation of your test results: e.g. effective sample size is (much) lower than the number of images, and correctly recognized consecutive images carry less information about generalization than correct predictions that are spaced far. — cbeleites unhappy with SX, Nov 01 '23 at 09:00

Does k-fold cross validation strictly require shuffling of data before splitting it into k groups?

2 Answers2