Understanding an equation in Pattern Recognition and Machine Learning from Bishop

Question

I just started with Machine Learning and the statistics behind it, thus I am trying to understand as much derivation as possible when I see some formulae or resulting variables. Today I stumbled upon Ridge Regression and I am stuck how Bishop derives the following result. I will also try to incorporate my current level of knowledge.

So far, my understanding is that we can describe some output variable $t$ with

$t = h(x,w) + \epsilon$

where $h(x,w)$ is a basis function, like the very simple polynomial function of degree $d$:

$h(x,w) = w_0 + w_1x_1 + w_2x_2^2 + ... + w_nx_n^d $ which can be written as $h(x,w) = w^Tx$.

We could now assume that the error term $\epsilon$ follows a Gaussian distribution, and thus we could also try to compute $t$ with the likelihood function of a Gaussian distribution.

$p(t|h(x,w),\sigma^2) = N(t|w,x,\sigma^2)$

This is sort my understanding I have going into the next part (and please correct me if I am wrong with something !). Now, Bishop introduces a prior to calculate a posterior distribution, which is more robust to overfitting. For simplification he uses the prior of an isotropic Gaussian:

$p(w|\alpha) = N(w|0, \alpha^{-1}I)$

I think $\alpha$ is a precision parameter, and precision is defined as $\frac{1}{\sigma^2}$. This would explain the $\alpha^{-1}$, and we use the Identity matrix $I$, because we want a diagonal Covariance? I am not sure about both statements.

Nevertheless, he claims that using the corresponding posterior distribution (which should be some likelihood function multiplied with the prior above) we arrive at

$m_N = \beta S_N \Phi^T t$

$S_N^{-1} = \alpha I + \beta \Phi^T \Phi$

and I would love to know how we can get from our Likelihood multiplied with the prior at that result. Again, sorry for any mistakes. Lastly, the whole thing can be found at page 153.

basically $\Phi = X$ also called the design matrix sometimes, I think. It basically contains the basis function for each $x_n \in X$ — kklaw, May 28 '22 at 14:00
I guess he is maximising posterior distribution with respect to the parameters ( ie by differentiating (the log) of posterior. (same maxima whether log or not, but easier calculation) — seanv507, May 28 '22 at 14:32
Indeed, I think so too, but I would just like to see the whole derivation, since I could not come up with it and the different parameters are also confusing — kklaw, May 28 '22 at 14:33

Understanding an equation in Pattern Recognition and Machine Learning from Bishop

0 Answers0