Why can I not use kfold with an AR process?
Suppose I have the following random process:
X <- rep(1,100)
for(t in 3:100)
X[t] <- rnorm(1,0.4*X[t-1]+0.6*X[t-2])
d <- data.frame(X=X[3:100],X.1=X[2:99],X.2=X[1:98])
# I now have the following AR(1) and AR(2) model, estimated by OLS:
m1 <- function(x) lm(X~X.1,x)
m2 <- function(x) lm(X~X.1+X.2,x)
par(mfrow=c(1,1))
plot(X)
lines(3:100,predict(m1(d)),col='red')
lines(3:100,predict(m2(d)),col='green')
# Let us find the best model by cross-validating them using both k-fold and growing window:
par(mfrow=c(1,2))
# kfold
d. <- d[sample(1:nrow(d)),] # shuffle
k <- 10
e <- NULL
for(i in seq(1,99-k,k)) {
ts <- d.[i:(i+k-1),]
tr <- d.[-(i:(i+k-1)),]
e1 <- sum((predict(m1(tr),ts)-ts[1])^2)
e2 <- sum((predict(m2(tr),ts)-ts[1])^2)
e <- rbind(e,data.frame(e1,e2))
}
boxplot(e$e1,e$e2,ylim=c(0,30))
# growing window
step <- 10
e <- NULL
for(t in seq(step,98-step,step)) {
tr <- d[1:t,]
ts <- d[(t+1):(t+10),]
e1 <- sum((predict(m1(tr),ts)-ts[1])^2)
e2 <- sum((predict(m2(tr),ts)-ts[1])^2)
e <- rbind(e,data.frame(e1,e2))
}
boxplot(e$e1,e$e2,ylim=c(0,30))
Some people seem to be under the impression you can never use k-fold to cross-validate timeseries models. Does that not depend on the underlying estimator being used? Is there anything wrong with my code?
What is the precise reasoning based on first-principles from which it follows that a timeseries model must be cross-validated by keeping the indexing order intact?
EDIT: I think the question is: in what cases can I create new variables such as $X_1(t):=X(t-1)$ and then use a temporal-ignorant validation method like k-fold?
I think the exception may be models such as MA ones where you need previous errors (but I'm not sure). Anyhow, what many students are taught (that for time series you must use a temporal-aware validation method) is wrong, because you can just keep the time window intact by building new variables,and then use kfold or whatever. Right?

