1

The claim is that, for a regression task, the conditional regression function $f(x) = \mathbb{E}[\mathbf{Y}|\mathbf{X}=x]$ minimizes the L2 loss $\arg\min(\mathbb{E}[\mathbf{Y} - f(\mathbf{X})]^2)$. I can see why it's true for a normal distribution. But why is it true in general?

Valentin Iovene
  • 257
  • 3
  • 9

1 Answers1

6

To find $\arg\min(\mathbb{E}[\mathbf{Y} - f(\mathbf{X}|\theta)]^2)$ you get the first order condition $$\frac \partial {\partial \theta} \mathbb{E}[\mathbf{Y} - f(\mathbf{X}|\theta)]^2=0$$ The differentiation is with respect to the parameters $\theta$ of your model $f(x|\theta)$. Proceed with differentiationg: $$ \mathbb{E}\left[\frac \partial {\partial \theta}(\mathbf{Y} - f(\mathbf{X}|\theta))^2\right]= \mathbb{E}\left[-2(\mathbf{Y} - f(\mathbf{X}|\theta))\frac \partial {\partial \theta}f(\mathbf{X}|\theta)\right]= $$ $$\mathbb{E}\left[(\mathbf{Y} - f(\mathbf{X}|\theta))\right](-2\frac \partial {\partial \theta}f(\mathbf{X}|\theta) ) $$ This holds when the following condition holds: $$\mathbb{E}\left[(\mathbf{Y} - f(\mathbf{X}|\theta))\right]=0$$ $$\mathbb{E}\left[\mathbf{Y}\right]= f(\mathbf{X}|\theta)$$

Aksakal
  • 61,310