23

Duplicates disclaimer: I know about the question

Time series forecasting using Gaussian Process regression

but this is not a duplicate, because that question is only concerned with modifications to the covariance function, while I argue that actually the noise term has to be modified too.

Question: in time series model, the usual assumption of nonparametric regression (iid data) fails, because residuals are autocorrelated. Now, Gaussian Process Regression is a form of nonparametric regression, because the underlying model is, assuming we have a iid sample $D=\{(t_i,y_i)\}_{i=1}^N$:

$$y_i = \mathcal{GP}(t_i)+\epsilon_i,\ i=1,\dots,N$$

where $\mathcal{GP}(t)$ is a Gaussian Process, with a mean function $mu(t)$ (usually assumed to be 0) and a covariance function $k(t,t')$, while $\epsilon\sim\mathcal{N}(0,\sigma)$. We then use Bayesian inference to compute the posterior predictive distribution, given the sample $D$.

However, in time series data, the sample is not iid. Thus:

  1. how do we justify the use of such a model?
  2. since the mean function for time series forecasting with GPs is usually assumed to be zero, when I compute a forecast sufficiently far in the future, I expect it will revert to the mean of the data. This seems a particularly poor choice, because I would like to be able (in principle) to forecast as far in the future as I want, and the model manage to get the overall time trend right, with just an increase in the prediction uncertainty (see the case below with an ARIMA model):

enter image description here

how is this taken care of, when using GPs for time series forecasting?

DeltaIV
  • 17,954
  • 6
    The key is that if you knew the underlying GP process, then we could subtract this off the observed values and all that would be left is independent residuals. Without knowing the underlying GP process, the observations are strongly correlated through the GP process. – Cliff AB Nov 20 '18 at 18:10
  • I am confused by the comment on 1): Let us assume that for every $t$ you have at most one example of some $y$ at $t$. The only assumption is that for every finite subset $t_1, ..., t_n$ the values $y_{t_1}, ..., y_{t_n}$ come from a multivariate Gaussian. However, you can never test that because you essentially have only one example... The situation is different from supervised learning in which you get a lot of examples $y = f(x)$ for the same 'rule' $f$ while here you get only one data point for every new $t$... What do you mean by 'the sample needs to be iid'? – Fabian Werner Nov 20 '18 at 18:58
  • @FabianWerner in "supervised learning" (regression) you don't get any example $y=f(x)$. You only get examples $y=f(x)+\epsilon$, otherwise it would be an interpolation (or, more generally, function approximation) problem, not a statistical one. Exactly the same happens here, with two key differences: 1) you can get (and you do get) new examples only by waiting more time, which means that test points will have $t_{test}>t_{train}$ 2) the errors are correlated. Concerning our, do you not know what the acronym mean (in which case you cannot usefully contribute here), or are you questioning 1/ – DeltaIV Nov 21 '18 at 08:46
  • 2/ my usage of the terminology for this problem? – DeltaIV Nov 21 '18 at 08:46
  • Maybe we are talking about different things here... ok, given that you get examples for $y = f(x) + \epsilon$: whenever you get a new example, you get a new example for the same distribution while in the GP-setting you change the distribution every time you get a new example (you only keep the weird total distribution incooperating all infinitely many random variables...) but ok, you don’t want me here, I’ll shut up :-) – Fabian Werner Nov 21 '18 at 08:59
  • I didn't say I don't want you there. I have nothing personally against you (I don't even know who you are). I just said that, if you don't know what iid means, you cannot contribute to the discussion, since that is the main part of my question. If you know the definition of a random sample, or of iid observations, and you're objecting to my usage of the terminology, that's absolutely fine: just, please explain your objection, because I believe I used the definition appropriately, even if somewhat informally. – DeltaIV Nov 21 '18 at 09:31
  • @CliffAB very interesting. If I understand you correctly, you're saying that, when using a GP model, we are automatically assuming that observations are correlated. I was actually talking about correlation of errors, not of observations, but in the usual iid setting it doesn't make a difference because, if observations are correlated, then even subtracting the conditional mean, the resulting errors will still be correlated. It sounds convincing. Will you turn it in an answer? – DeltaIV Nov 21 '18 at 10:02
  • 1
    I am jumping into the discussion without quite being familiar with the context, but your statement if observations are correlated, then even subtracting the conditional mean, the resulting errors will still be correlated is not convincing to me. Say, you have $x_t$ generated by an AR(1) process: $x_t=\varphi_1 x_{t-1}+\varepsilon_t$ with $\varepsilon_t\sim i.i.d(0,\sigma_{\varepsilon}^2)$. Once you subtract the conditional mean $\varphi_1 x_{t-1}$ from $x_t$, what is left is $\varepsilon_t$ and that is i.i.d. – Richard Hardy Nov 21 '18 at 12:58
  • I am totally questioning 1: A GP is a collection $(Y_t)_{t \in T}$ of (possibly infinitely many) random variable such that every finite slice of them follows a multivariate Gaussian distribution. When we decide whether or not data was drawn independently from a rv $X$ we observe much data that was sampled by "the same" (aka iid copies of the same) random variable $X$. However, what you have is one single sample for $Y_1$, one single sample for $Y_2$, ... i.e. the data never repeats for the same variable but rather every new example is an example for a different random variable... – Fabian Werner Nov 21 '18 at 14:34
  • @FabianWerner you are totally misunderstanding: the iid samples are from the joint distribution of the two variables $t$ and $x$. In the regression setting we have multiple copies of "the same" (aka iid copies of the same) random vector $(\mathbf{x},y)$. It looks like you're not familiar with the usual setting of statistical regression. – DeltaIV Nov 21 '18 at 15:14
  • @RichardHardy if $x_t$ is a random variable, its mean conditional on $t$ is a real number $\mathbb{E}[x_t\vert t]=\mu(t)$, not another random variable. We're probably using the term "conditional mean" in a different way. However this is tangential to the actual question: can GPs be used for forecasting? Yes or no? If yes, then isn't the assumption iid in conflict with the usual assumptions for time series forecasting? – DeltaIV Nov 21 '18 at 15:20
  • 1
    We are simply conditioning on different things: I am conditioning on the past values of $x_t$ (thus its lags) while you are conditioning on time $t$. So both you and I make sense in our own contexts. I have never worked on GPs, so I cannot answer your actual question, though. – Richard Hardy Nov 21 '18 at 15:34
  • @RichardHardy thanks for the clarification: I thought $x_{t-1}$ was another random variable, but instead in this context it's a realization of a random variable (a bit like conditioning on seen data in Bayesian Inference). +1 (to the comments :-) – DeltaIV Nov 21 '18 at 19:10
  • @DeltaIV What do you mean by 'in time series data, the sample is not iid'? If it is because of correlation, then isn't this covered by the covariance function $k(t,t')$? – Sextus Empiricus Nov 26 '18 at 10:38
  • @MartijnWeterings well, the usual assumption in time series is that errors aren't independent, isn't it? For example, the signal in a financial time series can be approximated (very crudely) as some trend + brown noise. To get to iid errors, we use differencing in, e.g, ARIMA models. Anyway this is only part of the question. Even if you prove that the introduction of the covariance function plays the same role as differencing in nonseasonal models, there is still the problem of forecasting beyond the range of historical data (point 2). ARIMA models can forecast well beyond this range, 1/2 – DeltaIV Nov 26 '18 at 11:22
  • 2/2 reverting to the mean of the historical data. Also, and this is perhaps even more important!, the uncertainty of an ARIMA forecast will keep increasing, the further in the future you predict. Instead, the predictive uncertainty of a GP is bounded. – DeltaIV Nov 26 '18 at 11:24
  • I guess that the independence of the errors is dependent on how you define them (pun intended). For arima, I believe, you also assume independent and identical distributed errors $\epsilon_t$, but for some measurement at a particular time you may have multiple errors, at different times, affecting it. Possibly this is what 'creates' the correlation and why a different viewpoints of the 'errors' might consider them related/correlated/dependent.
  • – Sextus Empiricus Nov 27 '18 at 11:28
  • I do not know much about forecasting and especially not about forecasting with Gaussian Processes. But, I imagine that there will be certain special kernels involved that allow you to project the trend outside the domain of the sample (like in the link of your questions the periodic kernel is used to project the period to outside the range). If you construct some confidence/credible interval then wouldn't this interval increase in size for predictions further away from the measurement points?
  • – Sextus Empiricus Nov 27 '18 at 11:34
  • @MartijnWeterings Re: 2; I'm not sure. Numerical experiments seem to indicate that this is not the case, i.e., the prediction interval is bounded for $\vert\vert\mathbf{x}^*\vert\vert\to\infty$ . However, since the expression of the posterior variance contains the inverse of a matrix whose norm is not necessarily bounded away from zero, it's not so easy to perform an asymptotic analysis of the width of the prediction interval for a GP. If you can do that, by all means feel free to submit an answer. – DeltaIV Nov 29 '18 at 09:21
  • @MartijnWeterings I stand corrected. The prediction interval doesn't increase indefinitely in width for $\vert\vert\mathbf{x}^\vert\vert\to\infty$, because the only matrix which is inverted doesn't depend on $\mathbf{x}^$. Here is the variance of the posterior predictive distribution: $k(\mathbf{x}^,\mathbf{x}^) +\sigma^2_n-\mathbf{k}(\mathbf{x}^, \mathbf{X})^T[K(\mathbf{X},\mathbf{X})+\sigma^2_nI]^{-1}\mathbf{k}(\mathbf{x}^, \mathbf{X})$. Since $k(\mathbf{x}^,\mathbf{x}^)$ is bounded for periodic kernels, it's pretty clear that the prediction interval is bounded. – DeltaIV Nov 29 '18 at 09:30