Ridge regression derivation from Murphy Machine Learning

Question

Ridge regression, used to prevent overfitting, penalizes the coefficients $w_i$ of linear regression if they are too large. It is the solution to the problem

$$\arg\max_\textbf w \sum_{i=1}^N \ln \mathcal N(y_i|w_0+\textbf w^T\textbf x_i, \sigma^2)+\sum_{j=1}^D\ln\mathcal N(w_j|0,\tau^2)$$

(note the offset term $w_0$ is not regularized, since this just affects the height of hte function, not its complexity).

Expanding I get that we want to maximize wrt $\textbf w$

$$\begin{split}-\frac 1{2\sigma^2}\sum_{i=1}^N (y_i-w_0-\textbf w^T\textbf x_i)^2-\frac 1{2\tau^2}\sum _{j=1}^Dw_j^2&=-\frac 1{2\sigma^2}(y-w_0\textbf 1-\textbf X\textbf w)^T(y-w_0\textbf 1-\textbf X\textbf w)\\ &-\frac 1{2\tau^2}\textbf w^T\textbf w\end{split}$$

which is the same as minimizing

$$(y-w_0\textbf 1)^T\textbf X\textbf w+\textbf w^T\textbf X^T\textbf X\textbf w+\lambda\textbf w^TI_D\textbf w$$

where $\lambda=\sigma^2/\tau^2$. However, the Murphy book has a $\frac 1N$ as in minimizing $$J(\textbf w)=\frac 1N\sum_{i=1}^N (y_i-(w_0+\textbf w^T\textbf x_i))^2+\lambda \|\textbf w\|^2_2$$

which my derivation does not have. Taking the derivative (of my equation), and setting to $0$, we get

$$-2\textbf X^T(y-w_0\textbf 1)+2\textbf X^T\textbf X\textbf w+2\lambda I_D\textbf w=0$$

or

$$\textbf w=(\textbf X^T\textbf X+\lambda I_D)^{-1}\textbf X^T(y-w_0\textbf 1)$$

However, in the textbook the ridge regression coefficients are

$$\hat {\textbf w}_{ridge}=(\lambda \textbf I_D+\textbf X^T\textbf X)^{-1}\textbf X^T\textbf y$$

leaving out the intercept. I would like to be as precise as possible in these derivations. Is there any reason why there is a $\frac 1N$ term in $J(\textbf w)$ and why $w_0\textbf 1$ is left out in the final estimate - which is correct?

The $1/N$ term is more of a computational thing. It prevents the loss from becoming overwhelmingly large due to having lots of data. It also makes the loss between two models more or less comparable. — Demetri Pananos, Jun 23 '22 at 02:01

score 0 · Answer 1 · answered Jun 23 '22 at 03:55

It is because it is later stated in an exercise to assume that the $\textbf X$ is centered, i.e. the mean is subtracted from each variable. So we have $\textbf X^T(y-w_0\textbf 1)=\textbf X^Ty-\textbf 0=\textbf X^Ty$ for that part of expression of the ridge regression coefficient estimates.

Ridge regression derivation from Murphy Machine Learning

1 Answers1