Understanding the Predictive Distribution of Bayesian Linear Regression

Question

So there are a few questions that have asked this before here and here, but I seem to be missing a step.

$$ \begin{aligned} p(f_*|x_*,X,y) &= \int p(f_*,w|x_*,X,y)~dw \quad \text{(marginalise $w$)}\\ &= \int p(f_*|x_*,X,y,w)p(w|x_*,X,y)~dw \quad \text{(chain rule)}\\ &= \int p(f_*|x_*,w)p(w|X,y)~dw \quad \text{($f_* \mathrel{\unicode{x2AEB}} X, y$ given $w$ and $w \mathrel{\unicode{x2AEB}} x_*$)}\\ \end{aligned} $$

Now, I don't understand what distribution $p(f_*|x_*,w)$ is, isn't it just a constant when both {$x_*, w$} are given? There seems to be some step I'm missing after substituting $f_* = x_*^Tw$ and then solving the integral:

$$ \begin{aligned} p(f_*|x_*,X,y) &= \int p(f_*|x_*,w)p(w|X,y)~dw\\ &= \int p(x_*^Tw|x_*,w)p(w|X,y)~dw \quad \text{($f_* = x_*^Tw$)}\\ &= \textbf{what goes here?}\\ &= x_*^T\mathcal{N}\left(\frac{1}{\sigma_n^2} A^{-1}Xy, A^{-1}\right) \quad \text{(is this right?)}\\ &= \mathcal{N}\left(\frac{1}{\sigma_n^2} x^T_*A^{-1}Xy, x_*^TA^{-1}x_*\right) \end{aligned} $$

logan · Accepted Answer · 2022-11-12T15:47:03.333

So I ended up coding it from scratch to see if I could figure it out, here's the notebook.

Basically, I was getting confused with the actual model predictions $x_*^Tw$ and the probability density of a predicted value $p(f_*|x_*,w)$, which is actually a degenerate distribution with probability of 1 at $f_*$ when ${x_*,w}$ are given.

The way I solved it in the notebook (very inefficiently), was to create a giant 3D array, of $x_*$ x $f_*$ x $w$, and then created a binary mask on the $w$-dim for whenever $x_*^Tw$ equalled $f_*$. I did this by bucketing the result using the grid approximation.

Then I used the mask to multiply another 3D array of the same size, but with $p(w|X,y)$ filled in. This essentially gives me $p(f_*|x_*,w)p(w|X,y)$ (the mask x the densities), and then I can sum down the $w$-dim, to now have a $x_*$ x $f_*$ array of probabilities, which is not a joint, but rather each row $x_*$ is now gaussian distributed along the entire range of $f_*$.

Would be happy to hear if I've done something wrong or it can be improved, but the results look pretty similar so it can't be too far off! :)

Understanding the Predictive Distribution of Bayesian Linear Regression

1 Answers1