Is data leakage from time series autocorrelation actual data leakage?

Question

That's the question: Is data leakage from time series autocorrelation actual data leakage?

To explain it with an example (I will separate the example in numbers to give more structure to the questions I have):

If we have daily train data from 2012 until 2016, test data 2017, and we see that autocorrelation of features (including the dependent variable) continues until A days.

Wouldn't that be incurring in data leakage? Because the last A days from train data will contain information from the test data and that is the definition of data leakage, right?
However, If we are at 31th Dec 2018 and we will predict the following A days from 2019 with all our X features available until 31th Dec, wouldn't be perfect to have that data leakage because the information would be available at the time of using the model in production?
Data leakage is a big problem because after we train and test a model incurring in data leakage, when we put the model in production its performance will be on the floor. However, if the data leakage information is available at the time to put the model in production and we can use it (as stated in (2)), why would we be looking forward to removing it and not using it?

As a piece of additional information, I got all these questions after reading "Chapter 7 - Cross-validation in Finance" from "Advances in Financial Machine Learning - Marcos Lopez de Prado". Where he proposes removing the window of A days (or "purging", which is how he calls it). Here is a plot of his proposal:

And here you can find similar topic-related questions I have found:

Tim · Answer 1 · 2022-09-02T11:47:06.160

Yes, it can be a data leakage. Below you can find a trivial example. Imagine that you have the data below. The variable on the $y$ axis raises exponentially with $x$, but we decide to use a linear model to model it. Let's split the data into three buckets (colored on the plot). You will get different results if you take the points from the first two buckets as the training set and the last bucket as the test set (red line), then if you pick the first and third buckets as the training set and the middle bucket as test set (blue line). The second split obviously leaks the data from the future.

To prevent this, we usually split time-series data in time, so that the past data is used for training and the future data is used for testing.

score 0 · Answer 2 · answered Sep 02 '22 at 12:13

The answer depends on how you obtain the autocorrelation estimates.

Case 1: Autocorrelation estimated from only training data. In this case, there cannot be any data leakage because the autocorrelation function is only derived from information that is available during training. It does not matter if, e.g., the autocorrelation function "indicates" that the next 10 values will be similar to the last training data point, because this "indication" is only based on information that was available during training, and may in fact be totally wrong.

For example, the time series below has high autocorrelation and we can "leverage" the autocorrelation information of the training part to come up with various predictions. However, these predictions have nothing to do with the test data, as seen below.

It turns out that the autocorrelation function tells us which data are likely to come next, given the training data but not which data really come next. There is no information leakage.

Case 2: Autocorrelation estimated from all data (or across several train/test splits). In this case there certainly is an information leak. Since the autocorrelation function is based on information derived from the test set, we can plug in the last few data points of the training set to obtain optimal estimates for the test set. For example, we could obtain the following prediction (using overfitting) which is a clear case of data leakage.

Is data leakage from time series autocorrelation actual data leakage?

2 Answers2