5

I'm studying a book about Gaussian processes in machine learning, and I do not know how to exactly compute the predictive distribution. In this question: Understanding the predictive distribution in gaussian linear regression said it is like $(1)$, but I found in Bishop's Pattern Recognition and ML that this predictive distribution is like $(2)$:
\begin{align} f_*\mid x_*,\, X,\, y\quad &\to\quad \mathcal{N}(\sigma_n^{-2}x^T_* A^{-1}Xy,\; \hspace{13mm} x_*^T A^{-1}x_*) \tag{1} \\ f_*\mid x_*,\, X,\, y\quad &\to\quad \mathcal{N}(\sigma_n^{-2}x^T_* A^{-1}Xy,\; \sigma_n^2 I + x_*^T A^{-1}x_*) \tag{2} \end{align} (That is, with a different variance.) What happened? Why are they not equal?

Tomas
  • 6,173
  • 14
  • 59
  • 105
Skullgreymon
  • 140
  • 9
  • looks like minor difference, just a notation. – Tomas Aug 09 '19 at 20:42
  • @Curious there's actually a meaningful difference; see below – user20160 Aug 10 '19 at 16:40
  • Thanks! Skullgreymon, please include full reference to the Bishop's paper. Anyway, this is interesting question and answer! – Tomas Aug 10 '19 at 17:43
  • Very good question and I bumped into the same myself. I'm reading a book about Gaussian Process in Machine learning by Rasmussen et al. https://gaussianprocess.org/gpml/chapters/ chapter 2, page 12. The variance term is also missing from the posterior predictive variance even though I derived it myself with the variance. I spent some time finding where I made the mistake, but it seems my derivation was correct all along. The book has the noise variance missing in the posterior predictive variance it seems. – jjepsuomi Jun 15 '22 at 21:38

2 Answers2

5

Short answer

Equations (1) and (2) are different because they give the posterior predictive distribution for different quantities. (1) is for the noiseless linear function output, whereas (2) is for the noisy observed output.

Long answer

Recall that the model is:

$$y_i = w^T x_i + \epsilon_i \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2_n)$$

Each observed output $y_i$ is given by a linear function of the input $x_i$ plus i.i.d. Gaussian noise. The equations listed in the question assume a Gaussian prior on the coefficients $w$, and treat the noise variance $\sigma^2_n$ as a fixed parameter.

Notice that we could also write:

$$y_i = f_i + \epsilon_i$$

where $f_i = w^T x_i$ is the noiseless output of the linear function. This is a latent variable, since it's not directly observed. Rather, we observe the noisy output $y_i$.

Now, suppose we've fit the model to training data $(X,y)$, and want to predict the output for a new set of inputs $x_*$. The posterior of the noiseless function outputs $f_*$ is the Gaussian distribution in equation (1). The derivation is described in chapter 2 of Gaussian processes for machine learning (Rasmusen & Williams 2006), and summarized here.

The posterior of the noisy observed outputs $y_*$ is the Gaussian distribution in equation (2) (but there may be a typo; the variable should be called $y_*$, not $f_*$). Notice that (2) is identical to (1), with the exception that $\sigma^2_n I$ has been added to the covariance matrix. This follows from the fact that the noisy observed outputs are produced by adding independent Gaussian noise to the noiseless function outputs (with mean zero and variance $\sigma^2_n$).

user20160
  • 32,439
  • 3
  • 76
  • 112
  • Exactly. Thank you very much. I had a missunderstanding between the latent function and the noisy observed variable. And I thought that it was the same. – Skullgreymon Aug 10 '19 at 16:41
  • @Skullgreymon I get the impression that more sources refer to (2) as "the posterior predictive distribution" when describing Bayesian linear regression. So the confusion is quite understandable. GPML might be somewhat unique in emphasizing a latent function in this context. I suppose this makes sense given its focus on Gaussian processes rather than linear regression. – user20160 Aug 10 '19 at 16:50
-1

It is the residual noise(error) term, you can refer to Distribution Theory: Normal Regression Models part of https://ocw.mit.edu/courses/mathematics/18-655-mathematical-statistics-spring-2016/lecture-notes/MIT18_655S16_LecNote19.pdf

You may also want to check out the concepts of heteroscedasticity and assumptions of the linear regression

  • 1
    I don not understand your point. I think is not about heteroscedasticity, both approaches supposes homocedasticity. And I think it does not make sense that the error term do not appear on the variance of the predictive distribution. – Skullgreymon Oct 23 '18 at 08:51