Ridge regression, used to prevent overfitting, penalizes the coefficients $w_i$ of linear regression if they are too large. It is the solution to the problem
$$\arg\max_\textbf w \sum_{i=1}^N \ln \mathcal N(y_i|w_0+\textbf w^T\textbf x_i, \sigma^2)+\sum_{j=1}^D\ln\mathcal N(w_j|0,\tau^2)$$
(note the offset term $w_0$ is not regularized, since this just affects the height of hte function, not its complexity).
Expanding I get that we want to maximize wrt $\textbf w$
$$\begin{split}-\frac 1{2\sigma^2}\sum_{i=1}^N (y_i-w_0-\textbf w^T\textbf x_i)^2-\frac 1{2\tau^2}\sum _{j=1}^Dw_j^2&=-\frac 1{2\sigma^2}(y-w_0\textbf 1-\textbf X\textbf w)^T(y-w_0\textbf 1-\textbf X\textbf w)\\ &-\frac 1{2\tau^2}\textbf w^T\textbf w\end{split}$$
which is the same as minimizing
$$(y-w_0\textbf 1)^T\textbf X\textbf w+\textbf w^T\textbf X^T\textbf X\textbf w+\lambda\textbf w^TI_D\textbf w$$
where $\lambda=\sigma^2/\tau^2$. However, the Murphy book has a $\frac 1N$ as in minimizing $$J(\textbf w)=\frac 1N\sum_{i=1}^N (y_i-(w_0+\textbf w^T\textbf x_i))^2+\lambda \|\textbf w\|^2_2$$
which my derivation does not have. Taking the derivative (of my equation), and setting to $0$, we get
$$-2\textbf X^T(y-w_0\textbf 1)+2\textbf X^T\textbf X\textbf w+2\lambda I_D\textbf w=0$$
or
$$\textbf w=(\textbf X^T\textbf X+\lambda I_D)^{-1}\textbf X^T(y-w_0\textbf 1)$$
However, in the textbook the ridge regression coefficients are
$$\hat {\textbf w}_{ridge}=(\lambda \textbf I_D+\textbf X^T\textbf X)^{-1}\textbf X^T\textbf y$$
leaving out the intercept. I would like to be as precise as possible in these derivations. Is there any reason why there is a $\frac 1N$ term in $J(\textbf w)$ and why $w_0\textbf 1$ is left out in the final estimate - which is correct?