Do I have look-ahead bias?

Question

I have a prediction task at hand, and I'm deciding on how to sample my data and train a model with no look-ahead bias.

Given a time series $Z$, my task is to build a simple predictor of size $m$ (think of a causal autoregression $AR(m)$, or anything else), that predicts the immediate next value in the time series. I'd like to then build the data matrix $\textbf{X}$ and the ground truth output vector $\textbf {y}$, and solve the system $\textbf{X} \textbf{w} = \textbf{y}$, in which $\textbf w$ are my model parameters as a vector.

Here are my questions:

Is it OK to have a sample (x,y pair) whose input ($x$-part) overlaps with the label ($y$-part) of another one? For example, let's say $m=5$ and assume one of my training samples (i.e., one of $\textbf X$'s rows) is $x_1 = Z[0:5]$ with the corresponding label $y_1 = Z[6]$. Wouldn't it be a look-ahead bias in my model if I have another sample like $x_2 = Z[3:8]$ with $y_2 = Z[9]$, whose input includes the true label of the first sample?
Is it considered the look-ahead bias?
How can I avoid this to ensure my model is not informed by the labels at all?

As pointed out in the comments, this is a textbook classical problem and is likely discussed in many references. I appreciate it if you also share these texts with me and future readers.

It's unclear to me what you mean by "another sample". Are you simply training your model on rolling window data? That would not be a problem. Or are these predictors? Please clarify. — Stephan Kolassa, Mar 28 '23 at 06:35
I edited my question. Hope it's more clear now. @StephanKolassa — arash, Mar 28 '23 at 07:24
OK, thanks, I voted to reopen. This is not lookahead bias at all, you are completely fine. That is a textbook case of a rolling window used to fit a model. — Stephan Kolassa, Mar 28 '23 at 07:29
@StephanKolassa Thanks! It's still not fully clear to me why I'm fine. Do you have any recommendations for a textbook to look into? — arash, Mar 28 '23 at 07:40
I just looked at my usual sources but unfortunately didn't find anything explicit about this. But it's how you fit, e.g., ML models with rolling windows. Also, see Tim's answer. — Stephan Kolassa, Mar 28 '23 at 07:46

score 2 · Accepted Answer · answered Mar 28 '23 at 07:43

Is it OK to have another sample whose input ($x$-part) overlaps with the label of the previous sample $y_1$ (for example $x_2 = Z[3:8]$ with $y_2 = Z[9]$)?

This is what usually happens when you fit a time-series model that predicts the present given the past.

Is it considered the look-ahead bias?

Not really. It would if you had a model that used $Z[4:9]$ to predict $Z[3]$, in such a case, you could not use the model to make the forecast, then you couldn't use it because to make it you would need to know the future. But here you predict past from the future, so you don't look ahead.

Another example of this kind of bias would be if you had as a feature something like "weekly average", so to make prediction on Monday you would consider already known weekly average, that could give overtly optimistic training time metrics, while such a model would not be usable for forecasting because you would not know the weekly average for the future. The same applies to all the other features that would be calculated using the data "from the future", but again, does not apply to using only historical data to make a prediction.

How can I avoid this to ensure my model is not informed by the labels at all?

Say that you have ten years of data, from 2012 to 2022, if you trained on 2012-2020 to predict 2021-2022, but then used the model to make a forecast for 2023, you won't train on the most up-to-date data that is likely the most relevant to predict the neat future. Time-series models try using all the data most efficiently to avoid problems like this.

If you want to train a model that does something like "predict tomorrow given today" you can't avoid it, but also there is no reason why you would need to avoid it.

Do I have look-ahead bias?

1 Answers1