I'm experimenting with non-normal innovations standard GARCH(1, 1) model $$\epsilon_t = \sqrt h_t z_t$$ $$h_t = \omega + \alpha \epsilon_{t-1} + \beta h_{t-1}$$ Where $E[z_t] = 0$, $E[z_t^2] = 1$, but $z_t$ could be non-normal.
I've worked through Bai, Russell, and Tiao's 2003 paper "Kurtosis of GARCH and stochastic volatility models with non-normal innovations", which provides a helpful theorem (2.1) on the various moments of the variables in the GARCH process. I took the time to verify these results on my own, so I'm fairly sure they're correct.
To test the moments, I generate an ensemble of 200,000 samples, where each sample is an array of $N$ realizations of $\epsilon$. So I end up with a 200,000 by $N$ array. This allows me to quickly compare various sample moments with their theoretical values. For example, I can take the mean of a column, which should be near the theoretical $E[e_t] = 0$. This in particular works without issues.
However, on some of the moments, something is happening that I can't formally explain. For example, consider the first autocovariance of the squared realizations $$\gamma_1 = E[(\epsilon_t^2 - \mu)(\epsilon_{t-1}^2 - \mu)]$$ where I've defined $\mu = \omega / (1 - \phi)$, and $\phi = \alpha + \beta$. By using the ARMA representation of $\epsilon_t^2$ (eq 2 in Bai et al) along with some tricks from Chapter 3 of Hamilton's "Time Series Analysis" (his 3.3.18 in particular), I was able to derive: $$\gamma_1 = \alpha \frac{1 - \phi^2 + \alpha \phi}{1 - \phi^2} Var(u_t)$$ where $u_t = \epsilon_t^2 - h_t$. Since I used this result later to derive Bai et al's (15), I'm pretty confident it's correct.
But when I check this via simulation, comparing the sample covariance of adjacent columns to this theoretical $\gamma_1$, I only get an approximately correct result if I calculate the sample covariance among the last few columns of the array. If I instead use some of the first few columns, the sample covariance can be less than half of the value of the theoretical $\gamma_1$.
I suspect this has to do with how I'm initializing the simulation. Following this discussion, I'm setting the initial value of $h$ to be $\mu$, its unconditional expectation. I wonder if there's a more informed choice I can make, unfortunately, literature seems to be quite sparse on this topic. In "Small sample properties of GARCH estimates and persistence", Hwang and Pereira simulate a GARCH. They mention that "the first 1000 observations are not used to avoid any problems from initial values". But what are these problems, formally? Is there any way I can avoid having to generate 1000 extra samples, only to discard them?