Understanding the predictive distribution in gaussian linear regression

Question

I'm reading through the Gaussian Process book http://www.gaussianprocess.org/gpml/chapters/RW2.pdf and there's one section here I don't quite understand (page 11). The author says:

"the predictive distribution is given by averaging the output of all possible linear models wrt the Gaussian posterior"

$$ \begin{aligned} p(f_*|x_*,X,y) &= \int p(f_*|x_*,w)~p(w|X,y)~dw \\ &=\mathcal N\left(\frac{1}{\sigma_n^2}x^T_*A^{-1}Xy,~x^T_*A^{-1}x_*\right) \end{aligned} $$

What does this mean, exactly? I understand that the purpose of using Gaussians is to be able to calculate uncertainty for a prediction, but I'm unclear how the "averaging of the output" doesn't end up with just a mean value of the weight. And how were the parameters for the mean and covariance derived?

Type "conjugate bayesian gaussian" in Google to find some informations about the derivation. — Stéphane Laurent, Nov 26 '12 at 10:30

jerad · Answer 1 · 2012-11-26T09:30:18.657

The Posterior predictive distribution is a weighted average over your hypothesis space where each hypothesis is weighted by it's posterior probability. In Bayesian analysis, beliefs are expressed as entire distributions rather than point estimates. In your example, you have a posterior distribution over all possible weights. The fully Bayesian way to make a prediction is to marginalize out the weights by integrating over your posterior. There are alternatives to this such as taking the MAP value, which is the most probable weight value under your posterior, however that is not strictly Bayesian. It might help to review a smaller, more introductory Bayesian inference problem (e.g. inferring the mean of a Gaussian with known variance) to get your head around the concept, because it is rather fundamental to the entire approach.

score 2 · Answer 2 · answered Mar 13 '17 at 17:45

2

The proof is simple. Ignore the Integral for a bit and check the definition of the Expectation (mean) and Covariance (Variance):

$f_{*}=x_{*}^Tw \implies \mathbb{E}[f_{*}]=x_{*}^T\mathbb{E}[w] \implies \mu_{f_{*}}=x_{*}^T\bar{w}=\sigma_n x_{*}^TA^{-1}Xy$
In the same sense: $\mathbb{E}[f_{*}f_{*}^T]=\mathbb{E}[x_{*}^Tw \,w^Tx_{*}]=x_{*}^T\mathbb{E}[w \,w^T]x_{*}=x_{*}^TA^{-1}x_{*}$.

I stumpled across the same point when reading the book.

answered Mar 13 '17 at 17:45

Pantelis

129

still one has to show the distribution is a gaussian with this method... – Rémy Hosseinkhan Boucher Oct 27 '23 at 17:19

score 2 · Answer 3 · answered May 02 '17 at 15:13

This is my understanding which did not follow the thought of marginalization. Hope it is correct.

$f_* = x_*^Tw$ (This is equation 2.1 in the book)

Therefore,

$p(f_*|x_*,X,y) = p(x_*^Tw|x_*,X,y) = x_*^Tp(w|x_*,X,y) = x_*^Tp(w|X,y)$

(remember that $w$ is independent of $x_*$)

Therefore, $p(f_*|x_*,X,y)= x_*^Tp(w|X,y)$, which is a linear transformation of a MVD $p(w|X,y)$.

We know that $p(w|X,y)\sim\mathcal{N}(\frac{1}{\sigma_n^2}A^{-1}Xy,A^{-1})$ (This is equation 2.8 in the book)

Hence, $p(f_*|x_*,X,y)\sim\mathcal{N}(\frac{1}{\sigma_n^2}x_*^TA^{-1}Xy,x_*^TA^{-1}x_*)$ (see the general theory below)

Here is more details on linear transformation of a MVD

Given $p(x)\sim\mathcal{N}(\mu_x, \Sigma_x)$ and $y=Zx$,

$p(y)\sim\mathcal{N}(Z\mu_x, Z\Sigma_xZ^T)$

score 1 · Answer 4 · answered Mar 31 '22 at 09:06

The general idea of the "prediction distribution" has been expressed in @jerad's answer. In short, Bayesians regard everything as random, hence seeking for their distribution which preserves all the information. (Of course you can always derive a point estimator from it at the last step, for example by maximum vote principle)

Return to the formula, there are actually two equations:

Marginal equation: $$ \begin{aligned} p\left(f_{*} \mid x_{*}, X, y\right) &=\int p\left(f_{*} \mid x_{*}, w\right) p(w \mid X, y) d w \end{aligned} $$

Proof: By marginal calculation \begin{aligned} p\left(f_{*} \mid x_{*}, X, y\right) &=\int p\left(f_{*} \mid x_{*}, w, X, y\right) p(w \mid x_{*}, X, y) d w \end{aligned}

where by independance, $p(f_{*} \mid x_{*}, w, X, y) = p(f_{*} \mid x_{*}, w)$ and $p(w \mid x_{*}, X, y) = p(w \mid X, y)$

Derivation of normal distribution

Notice that $f_{*} = w^T x_{*}$, so $f_{*} \mid x_{*},w$ is actually deterministic. i.e. $p(f_{*} \mid x_{*},w) = \delta(w^T x_{*})$, integrate any distribution with a delta function is just an evaluation. Or you can just refer @Pantelis' answer which is simpler.

Understanding the predictive distribution in gaussian linear regression

4 Answers4

Linked