Suppose we know our data came from a linear model, and that it is well specified. Assume our evaluation criteria is L2 loss, between the predictions and truth data. Normally we would use least squares as the estimation procedure. I am wondering if there exists other methods through which we can beat least squares, in terms of L2 loss?
-
1Does this answer your question? A universal measure of the accuracy of linear regression models My "EDIT 2" gives a counterexample with real data. – Dave Jul 15 '22 at 18:34
1 Answers
Yes, in general this is the idea behind the area of regularized regression. It is actually quite instructive to see how this works in the simplest case.
Assume for simplicity that the independent variable $x$ is one-dimensional, and distributed according to some known distribution $x\sim p(x)$. And assume that the dependent variable satisfies $y_i=x_i\beta+\epsilon_i$ for some $\beta$ and some normal random noise $\epsilon$ with variance $\sigma^2$. For notational convenience, define $G:=\sum_i x_i^2$.
Given $n$ observations, the estimated ridge coefficient is given by $\hat{\beta}={\frac {\sum_i x_iy_i}{\lambda+G}}=\beta{\frac {G}{\lambda+G}}+{\frac {\sum_i x_i\epsilon_i}{\lambda+G}}$, where $\lambda$ is the regularization strength. And the test error (empirical risk) is given by
$$MSE(y)=E_{x\sim p(x)}(\hat{\beta}x-\beta x)^2=(\beta-\hat{\beta})^2 V_p$$
where $V_p$ is the second moment of the distribution $p$. WLOG, we may assume that $V_p=1$.
Note that $MSE(y)$ depends on the actual observed values $y_i$; averaging over the sample noise, we obtain the average estimation error of the ridge estimator:
\begin{eqnarray*} MSE&=&E_{\epsilon}MSE(y)\\ &=&E_{\epsilon_i} (\beta-\hat{\beta})^2\\ &=&E_{\epsilon_i} \left(\beta \left(1-{\frac {G}{\lambda+G}}\right)-{\frac {\sum_i x_i\epsilon_i}{\lambda+G}}\right)^2\\ &=&E_{\epsilon_i}\beta^2\left(1-{\frac {G}{\lambda+G}}\right)^2- 2\beta\left(1-{\frac {G}{\lambda+G}}\right){\frac {\sum_i x_i\epsilon_i}{\lambda+G}}\\ &+& E_{\epsilon_i} \left({\frac {\sum_i x_i\epsilon_i}{\lambda+G}}\right)^2\\ & = & \beta^2\left(1-{\frac {G}{\lambda+G}}\right)^2+E_{\epsilon_i} \left({\frac {\sum_i x_i\epsilon_i}{\lambda+G}}\right)^2\\ & = & \beta^2\left(1-{\frac {G}{\lambda+G}}\right)^2+\sigma^2\sum_i \left({\frac {x_i}{\lambda+G}}\right)^2\\ & = & \beta^2\left(1-{\frac {G}{\lambda+G}}\right)^2+\sigma^2{\frac {G}{(\lambda+G)^2}}\\ & = & \beta^2-2\beta^2{\frac {G}{\lambda+G}}+{\frac {\beta^2G^2+\sigma^2 G}{(\lambda+G)^2}} \end{eqnarray*}
By elementary calculus, the minimizer of this function is given by $$\lambda_{min}=\sigma^2/\beta^2$$, with the corresponding error given by
$$MSE_{\lambda=\lambda_{min}}= {\frac {\sigma^2\beta^2}{\sigma^2+G\beta^2}}$$
By contrast, the MSE for OLS is given by taking $\lambda=0$, i.e. $$MSE_{OLS}=MSE_{\lambda=0}=\sigma^2/G$$
Therefore, unless there is no observation noise ($\sigma=0$), the optimal $\lambda$ will be strictly positive, and the corresponding ridge estimator will attain a strictly lower MSE than the OLS estimator.
Of course, in practice, the above formula for $\lambda$ is not very helpful, since it requires knowledge of $\beta$, which is what we want to estimate in the first place. However, there are practical techniques for estimating $\lambda$ such as cross-validation.
- 1,974
-
-
@Dave, no i don't think so-I actually found that a bit surprising – Simon Segert Jul 16 '22 at 00:51