That's the question: Is data leakage from time series autocorrelation actual data leakage?
To explain it with an example (I will separate the example in numbers to give more structure to the questions I have):
If we have daily train data from 2012 until 2016, test data 2017, and we see that autocorrelation of features (including the dependent variable) continues until A days.
Wouldn't that be incurring in data leakage? Because the last
A daysfrom train data will contain information from the test data and that is the definition of data leakage, right?However, If we are at 31th Dec 2018 and we will predict the following
A daysfrom 2019 with all our X features available until 31th Dec, wouldn't be perfect to have that data leakage because the information would be available at the time of using the model in production?Data leakage is a big problem because after we train and test a model incurring in data leakage, when we put the model in production its performance will be on the floor. However, if the data leakage information is available at the time to put the model in production and we can use it (as stated in
(2)), why would we be looking forward to removing it and not using it?
As a piece of additional information, I got all these questions after reading "Chapter 7 - Cross-validation in Finance" from "Advances in Financial Machine Learning - Marcos Lopez de Prado". Where he proposes removing the window of A days (or "purging", which is how he calls it). Here is a plot of his proposal:
And here you can find similar topic-related questions I have found:




