Predictive Distribution in Gaussian Process Derivation

Question

In Gaussian Process for Machine Learning (Rasmussen and Williams), pg 11, we are given the following predictive distribution:

$$p\left(f_{*} | \mathbf{x}_{*}, X, \mathbf{y}\right)=\int p\left(f_{*} | \mathbf{x}_{*}, \mathbf{w}\right) p(\mathbf{w} | X, \mathbf{y}) d \mathbf{w}$$

We are also given the definition: $f_{*} \triangleq f\left(\mathbf{x}_{*}\right)$ where $f(\mathrm{x})=\mathrm{x}^{\top} \mathrm{w}$ and $y=f(\mathbf{x})+\varepsilon$.

I understand that one may compute the distribution shown above by doing the following:

$$E[f_{*}] = \mathrm{x}^{\top}E[\mathrm{w}]$$ $$V[f_{*}] = \mathrm{x}^{\top}V[\mathrm{w}]\mathrm{x}$$

We then plug in the expressions for the mean and variance of our posterior over w. Essentially, we are computing the mean and variance of a scaled Gaussian random variable as we already established that the posterior distribution over w is Gaussian.

However, I have a problem with the integral shown above. I understand how to arrive at this expression (sum rule, product rule, and conditional independence). However, the $p\left(f_{*} | \mathbf{x}_{*}, \mathbf{w}\right)$ term is bothering me.

I have seen derivations where this term is replaced with $p\left(y_{*} | \mathbf{x}_{*}, \mathbf{w}\right)$. This makes more sense to me as, due to the Gaussian noise, this term would be a Gaussian distribution and the integral of the product of two Gaussian distributions is a Gaussian distribution (provided proper normalisation). However, wouldn't $p\left(f_{*} | \mathbf{x}_{*}, \mathbf{w}\right)$ be a delta function at $f_{*} = \mathrm{x}^{\top}_{*} \mathrm{w}$? Am I missing something here?

Any advice would be greatly welcomed! Cheers!

I think you should add the self study tag. – Michael R. Chernick May 26 '19 at 03:45 — Michael R. Chernick, May 26 '19 at 03:45
@MichaelChernick Just added it, thanks for the suggestion. – JKB May 26 '19 at 15:08 — JKB, May 26 '19 at 15:08

rileyx · Answer 1 · 2023-05-12T07:04:56.100

Actually it is a delta function. We're basically finding the pdf of $f(\mathbf{x}_*, \mathbf{w})$ given the derived posterior pdf for $\mathbf{w}$. For example see here. So $$ p\left(f_{*} | \mathbf{x}_{*}, \mathbf{w}\right) = \delta(f_{*} - \mathbf{x}_*^T \mathbf{w}) $$ Our integral becomes $$ \begin{align} p\left(f_{*} | \mathbf{x}_{*}, X, \mathbf{y}\right) &= \int p\left(f_{*} | \mathbf{x}_{*}, \mathbf{w}\right) p(\mathbf{w} | X, \mathbf{y}) d \mathbf{w} \\ &= \int \delta(f_{*} - \mathbf{x}_*^T \mathbf{w}) \mathcal{N}(\mathbf{w} \mid \mu, \Sigma) d \mathbf{w} \\ &= \ldots \\ &= \mathcal{N}(f_{*} \mid \mathbf{x}_{*}^T \mu, \mathbf{x}_{*}^T \Sigma \mathbf{x}_{*}) \\ \end{align} $$ The integral is easy to take in the 1D case as a sanity check. The multivariate case is a bit more annoying but basically the delta function constrains one dimension of $\mathbf{w}$, and the rest is doable by completing the square to get something of the form

$$ p\left(f_{*} | \mathbf{x}_{*}, X, \mathbf{y}\right) = \mathcal{N}(f_{*}) \int d\mathbf{u} \mathcal{N}(\mathbf{u}) $$ where $\mathbf{u}$ is $n-1$ dimensional and the integral manifestly just goes to 1.

How to get the first equality ? – Rémy Hosseinkhan Boucher Oct 27 '23 at 18:09 — Rémy Hosseinkhan Boucher, Oct 27 '23 at 18:09

Predictive Distribution in Gaussian Process Derivation

1 Answers1

Linked