3

In the chapter 2 of [Gaussian Process], equations (2.22-2.24) gives the predictive equations for Gaussian process regression, shown as follows. My question is how to derive f|X,y. It seems that the book does not give the derivation process for f|X,y, the posterior distribution of f given existed observation data.

enter image description here

user3125
  • 3,027

1 Answers1

2

$\newcommand{\0}{\mathbf 0}$$\newcommand{\y}{\mathbf y}$$\newcommand{\f}{\mathbf f}$$\newcommand{\e}{\varepsilon}$If we have $y = f(x) + \e$ for $\e\sim\mathcal N(\mathbf 0, \sigma^2 I)$ and $f \sim \mathcal {GP}(0, k)$ (I'm using a mean function of $0$ here) then from the GP prior we have $\f \mid x\sim \mathcal N(0, K)$ and for the observed vector $\y \mid \f, x\sim \mathcal N\left(\f, \sigma^2 I\right)$.

By Bayes we have $$ \pi(\f\mid x, \y) \propto f(y \mid x, \f) \pi(\f \mid x) $$ This is a Gaussian likelihood with a Gaussian prior, so since the Gaussian is conjugate to itself we know $\f \mid \y, x$ is also Gaussian. We can multiply the likelihoods and work out the mean and covariance of the posterior.

$$ f(y \mid x, \f) \pi(\f \mid x) \propto \exp\left(-\frac 1{2\sigma^2}\|\y-\f\|^2 - \frac 12 \f^TK^{-1}\f\right) . $$ Working on the inside term and factoring out $-\frac 1{2}$, we have $$ \frac 1{\sigma^2}\|\y-\f\|^2 + \f^TK^{-1}\f = \frac 1{\sigma^2}\y^T\y - 2\frac 1{\sigma^2}\y^T\f + \frac 1{\sigma^2}\f^T\f + \f^TK^{-1}\f \\ = \f^T\left( K^{-1} + \sigma^{-2}I\right)\f - 2\sigma^{-2}\f^T\y + \sigma^{-2}\y^T\y. $$ Letting $A = \left( K^{-1} + \sigma^{-2}I\right)$, we can complete the square in $\f$ to get $$ \f^TA\f - 2\sigma^{-2}\f^T\y = (\f - \sigma^{-2}A^{-1}\y)^TA(\f - \sigma^{-2}A^{-1}\y) - \sigma^{-4}\y^TA^{-1}\y. $$ Since we know this is Gaussian we can read the mean and covariance from this to get $$ \f\mid\y, x \sim \mathcal N\left(\sigma^{-2}\left( K^{-1} + \sigma^{-2}I\right)^{-1}\y, (K^{-1} + \sigma^{-2}I)^{-1}\right) $$

This can be rearranged in several ways if there is a different form you prefer.

jld
  • 20,228
  • how do you work out these equations in case fo noise free . I have observed gaussian process posterior prediction distribution being talked about in absence of any noise as well. But this approach using likelihood to get posterior (not posterior prediction distribution) , how will that work if there is no noise , in that scenario your likelihood is just deterministic and discrete with value 1 or 0 right ? – user179156 Sep 05 '20 at 23:41
  • also curious what is the intuition and relevance or importance of the posterior you have derived ? Usually the posterior of predictive distribution is important and useful , but this has also confused me a lot that why we dont use or derive the posterior that you have worked out in almost all online references about GP i have read – user179156 Sep 05 '20 at 23:42
  • It would be helpful if you could derive the posterior for predictive distribution using similar bayes rule. Most derivation i have seen work out the joint prior and then divide by marginal to get the conditional (or posterior) . – user179156 Sep 06 '20 at 21:23
  • @user179156 if there's no noise you're probably not interested in the posterior distribution of $\mathbf f$ itself. I'd guess you're more likely looking at the distribution of a new point $f_$ given $\mahtbf f$, and for that I'd use the fact that ${\mathbf f \choose f_}$ is jointly normal and then we can use the conditional distribution of a multivariate normal to get $f_*\mid\mathbf f$. I think that's more direct and insightful than Bayes for the posterior at a new point – jld Sep 08 '20 at 10:39