I have a prediction task at hand, and I'm deciding on how to sample my data and train a model with no look-ahead bias.
Given a time series $Z$, my task is to build a simple predictor of size $m$ (think of a causal autoregression $AR(m)$, or anything else), that predicts the immediate next value in the time series. I'd like to then build the data matrix $\textbf{X}$ and the ground truth output vector $\textbf {y}$, and solve the system $\textbf{X} \textbf{w} = \textbf{y}$, in which $\textbf w$ are my model parameters as a vector.
Here are my questions:
- Is it OK to have a sample (x,y pair) whose input ($x$-part) overlaps with the label ($y$-part) of another one? For example, let's say $m=5$ and assume one of my training samples (i.e., one of $\textbf X$'s rows) is $x_1 = Z[0:5]$ with the corresponding label $y_1 = Z[6]$. Wouldn't it be a look-ahead bias in my model if I have another sample like $x_2 = Z[3:8]$ with $y_2 = Z[9]$, whose input includes the true label of the first sample?
- Is it considered the look-ahead bias?
- How can I avoid this to ensure my model is not informed by the labels at all?
As pointed out in the comments, this is a textbook classical problem and is likely discussed in many references. I appreciate it if you also share these texts with me and future readers.