Elements of Statistical Learning - Statistical Decision Theory : step 2.10 to 2.11 minimization of EPE

Question

Questions about that section of the book (see image below) were already asked and they cover how to get

from 2.9 to 2.11, see here

and from 2.12 to 2.13, see here

but do not cover how to get from 2.11 to 2.12!

2.11 can be written as

\begin{align*} & E_{X}E_{Y|X}([Y - f(X)]^2| X = x) \\ &= \int_x \left( E_{Y|X}([Y - f(X)]^2|X = x) \right) p(x)dx\\ &= \int_x\left( \int_y [y - f(x)]^2p(y|x)dy \right)p(x)dx \end{align*}

They just say "we see that it is sufficient to minimize EPE point-wise" and drop the integration over x and THEN do a point-wise minimization.

But how come they can just drop the whole $x$ integration? Why is it sufficient to minimize 2.11 it point-wise?

You want to minimise $\text{E}X \text{E}{Y|X}([Y-f(X)]^2|X)$. All that $(2.12)$ says is that this happens when $\text{E}_{Y|X}([Y-f(X)]^2|X)$ is minimised for each $X$ and then $(2.13)$ says this is when $f(x) = \text{E}(Y|X=x)$ — Henry, Aug 14 '22 at 15:03
You mean minimized for each x? I get that for one fixed x, $f(x)=E(Y,X=x)$ minimizes 2.11. So you say because the choice for x is arbitrary f minimizes 2.11 in general? — newandlost, Aug 14 '22 at 15:17
OK, so $\text{E}{Y\mid X=x}([Y-f(X)]^2\mid X=x)$ is minimised for that $x$ when $f(x) = \text{E}(Y\mid X=x)$ and so $\text{E}{X}\text{E}_{Y\mid X}([Y-f(X)]^2\mid X)$ is minimised when $f(X) = \text{E}(Y\mid X)$ — Henry, Aug 14 '22 at 21:31

score 2 · Accepted Answer · answered Aug 15 '22 at 03:21

Focusing on the step of interest: $E_X\left[E_{Y|X} \left( [Y-f(X)]^2 | X \right) \right]$ can also be written $E_X\left[E_{Y|X} \left( [Y-f(x)]^2 | X=x \right) \right]$. By conditioning on $X=x$, $f(x)$ becomes a constant with respect to $Y|X$. At each value of $X$: ${x_1, x_2, x_3, \dots}$, if we can calculate a $f(x_1), f(x_2), f(x_3), \dots$ such that the desired quantity is minimized, then we can eventually determine a point-wise function $f$, or possibly an $f(x)$ that can be expressed analytically as we normally do with a function.

We also recognize that it is sufficient to minimize $g(z)$ if we are minimizing $E_Z(g(Z))$. There is nothing else that can be changed in the $E_Z()$. We must integrate over the entire domain, and we take the p.d.f. of $Z$ as a given.

Putting it all together,

$$f: min_f EPE(f) = min_f E_X \left[ E_{Y|X} \left([Y-f(x)]^2|X=x\right) \right]$$

$EPE(f)$ is minimized when $E_{Y|X} \left([Y-f(x)]^2|X=x\right)$ is minimized.

$E_{Y|X} \left([Y-f(x)]^2|X=x\right)$ is minimized at $f(x) = E_{Y|X}(Y|X)$

Elements of Statistical Learning - Statistical Decision Theory : step 2.10 to 2.11 minimization of EPE

1 Answers1