1

When I read The Elements of statistical learning, in the section of "5.5 Automatic Selection of the Smoothing Parameters", equation (5.26) gives a LOOCV formula for corresbonding $\lambda$

Here the background is fitting a smoothing spline using the data $(y_i, x_i)\ i = 1,2,...,N$. In order to select a smoothing parameter, we use LOOCV creteria.

\begin{equation} \text{CV}(\hat{f}_{\lambda}) = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{f}_{\lambda}^{(-i)}(x_i))^2=\frac{1}{N}\sum_{i=1}^{N}\left(\frac{y_i - \hat{f}_{\lambda}(x_i)}{1-S_{\lambda}(i,i)}\right) \end{equation}

where $\hat{f}_{\lambda}^{(-i)}$ means fitting value by using model without using $(y_i, x_i)$, and $S_{\lambda}(i,i)$ is the diagonal element of the smoothing matrix $\mathbf{S}_{\lambda}$ based on whole data. ($\mathbf{S}_{\lambda} = (\mathbf{I} + \lambda \mathbf{K})^{-1} = \mathbf{N}(\mathbf{N}^T\mathbf{N} + \lambda \Omega)^{-1} \mathbf{N}^T)$.

(the notation is consistent with chapter 5 in the textbook The Elements of statistical learning)

I wonder how to prove the second equation, so I search online and find the following post is helpful. But when I try to prove the above formula for smoothing matrix, I've run into difficulties and it's hard to carry on.

I can only prove that

\begin{equation} (\mathbf{N}_{(t)}^T\mathbf{N}_{(t)} + \lambda \Omega)^{-1} \mathbf{N}_t^T = (\mathbf{N}^T\mathbf{N} + \lambda \Omega)^{-1} \mathbf{N}_t^T \left(\frac{1}{1-S_{\lambda}(t,t)}\right) \end{equation}

where the supscript $_{(t)}$ means except $t$th one, while $_t$ means the $t$th one.

Could anyone help finish the proof, or using some other method to prove the formula? I'd be grateful.

Helpful post:

@MISC {164223, TITLE = {Proof of LOOCV formula}, AUTHOR = {Clarinetist (https://stats.stackexchange.com/users/46427/clarinetist)}, HOWPUBLISHED = {Cross Validated}, NOTE = {URL:Proof of LOOCV formula (version: 2015-08-04)}, EPRINT = {https://stats.stackexchange.com/q/164223}, URL = {https://stats.stackexchange.com/q/164223} }

Li Xin
  • 53

1 Answers1

7

I do not know if you are still interested in this but what follows might help you in proving the results you mention in the question (...hope get everything right...it is a slippery notation ^_^).

Here I will use a bit more general notation (since the results you mention hold not only for the smoothing use case)

  • We can consider a ridge regression problem $\displaystyle \min_{\beta} S(\lambda) = \| y - X \beta \|^{2} + \lambda \|\beta\|^{2}$ (if $\lambda = 0$ you will get the same results as the answer you cite).
  • The LOOCV ctiretion in this case tells us to select the optimal regularization parameter as $LOOCV(\lambda) = \displaystyle \min_{\lambda} \sum_{i}^{N}(y_{i} - X_{i} \hat{\beta}_{-i} (\lambda))^{2}$
  • We define $X_{i}$ as the $i$th row of the model matrix $X$
  • We define as $\hat{\beta}_{-i} = (X_{-i}^{\top} X_{-i} + \lambda I)^{-1} X_{-i}^{T} y_{-i}$ the coefficients estimated when the $i$th obs is left out
  • Define as $X_{-i}$ the the model matrix when the $i$th element is left out (analogous definiiton holds for $y_{-i}$)

To derive the result you are interested in we need to find an 'analytic' expression for $\hat{\beta}_{-i}$. The proof proceeds more or less as follows (I will skip some details):

  • The inverse appearing in $\hat{\beta}_{-i}$ can be shown to be equal to (this is an application of the Woodbury identity) $$ (X_{-i}^{\top} X_{-i} + \lambda I)^{-1} = (X^{\top} X + \lambda I)^{-1} + (X^{\top} X + \lambda I)^{-1} X_{i}^{\top} [1-H_{ii}(\lambda)]^{-1} X_{i}(X^{\top} X + \lambda I)^{-1} $$ where $H_{ii}(\lambda) = X_{i}(X^{\top} X + \lambda I)^{-1} X_{i}^{\top}$

  • Plugging the result above in the equation for he LOO estimator, after some algebra, we get: $$ \hat{\beta}_{-i}(\lambda) = \hat{\beta}(\lambda) - (X^{\top} X + \lambda I)^{-1} X_{i}^{\top} [1-H_{ii}(\lambda)]^{-1} (y_{i} - X_{i} \hat{\beta}(\lambda)) $$

  • Plugging the result in the equation for the LOOCV criterion, after some algebra, we can notice that $$ y_{i} - X_{i} \hat{\beta}_{-i}(\lambda) = [1-H_{ii}(\lambda)]^{-1} (y_{i} - X_{i}\hat{\beta}(\lambda)) $$ and hence the results we are looking for follows: $$ LOOCV(\lambda) = \sum_{i=1}^{N} \frac{(y_{i} - X_{i}\hat{\beta}(\lambda))^{2}}{(1 - H_{ii} (\lambda))^{2}} $$

Again, here I skip some detail (mainly tedious algebra) but these are the lines you could follow to prove the results you are interested in. I hope this helps.

Gi_F.
  • 1,161
  • 1
    Thank you! There's only a small typo in the fomula using Woodbury identity. I think there should be a $X_i$ after $[1-H_{ii}(\lambda)]^{-1}$. – Li Xin Jan 03 '21 at 13:15
  • 1
    Again, there's another small typo in the last formula, I think it should be $\sum_{i=1}^{N} [1-H_{ii}(\lambda)]^{-2}(y_i - X_i\hat{\beta}(\lambda))^2$ – Li Xin Jan 03 '21 at 13:38