Elements of Statistical Learning - Statistical Decision Theory : Doubt regarding Minimization of EPE

Question

With reference to Expected Prediction Error derivation - page 18, section 2.4 in Elements of Statistical Learning. Please refer text below:

I have been able to follow up to step 2.11. I am struggling to understand step 2.12 and 2.13.

My understanding:

We intend to minimize EPE, as E_X is constant, we focus on minimizing E_Y|X([Y-f(X)]²|X).
Doubt in step 2.12: To minimize EPE, the difference between Y i.e. actual value and f(X) i.e. predicted value should be minimized. However, equation 2.12 is minimizing "c".
Guide me on understanding this - my understanding is that with small c, [Y-c]² will become larger.
Additionally, I could not figure out development of 2.13 from 2.12.

Please correct me wherever my assumptions and/ or understanding is incorrect.

P.S.: I studied Probability (Tsitsiklis) and Linear Algebra (David C. Lay) before moving to ESL.

added screen-shot for reference - as suggested by Christoph Hanck — Santo, Jun 20 '17 at 10:01
See https://stats.stackexchange.com/questions/262837/elements-of-statistical-learning-expected-prediction-error-derivation — Christoph Hanck, Jun 20 '17 at 14:08
And https://stats.stackexchange.com/questions/92180/expected-prediction-error-derivation/102662#102662 — Christoph Hanck, Jun 20 '17 at 14:08
Danke! presumably got it, from above links and https://stats.stackexchange.com/questions/92180/expected-prediction-error-derivation/102662#102662 — Santo, Jun 20 '17 at 16:41
Christoph/ Juho and Members, my final understanding is that argmin c in 2.12 (ref. above) means: "determine c so that [Y-c]^2 is minimized." Further, I'm still not sure about 2.13; I assumed that E([Y-c]^2 | X = x) - in 2.12 - is a function of Y, and so, replaced by E(Y|X=x) in 2.13. Please correct my understanding wherever faulty - request consideration; opening books after about 2 decades. — Santo, Jun 22 '17 at 03:25
I don't get how to get from 2.10 to 2.11! $E_X$ is not a constant that can be dropped, see here your own question / answer https://stats.stackexchange.com/questions/92180/expected-prediction-error-derivation/102662#102662 — newandlost, Aug 14 '22 at 14:12

David Epstein · Answer 1 · 2017-07-02T11:29:16.797

3

Let $H$ be any set of functions of $x$. Then, for each $h\in H$, $\int h(x)\,dx \ge \int \inf_{g\in H} g(x)\,dx$. Sometimes, as in the current situation, the function $\lambda$, given by $\lambda(x)=\inf_{g\in H}g(x)$, is already in $H$, in which case the least value of the integral of $h$, as $h$ varies over $H$, is given by taking $h=\lambda$. We will define $H$ in the current situation, then compute $\lambda$ and then check that $\lambda\in H$.

Here $H$ is the set of all functions $h$ given by $h(x) = E_{Y|X}((Y-f(x))^2 | X=x)$, where $f$ is some measurable function of $x$. Measurability imposes no restriction on possible values of $c=f(x)$, so $$ \lambda(x) = \inf_{h\in H} h(x) = \inf_c E_{Y|X}((Y-c)^2|X=x),$$ which is the minimum of a quadratic function of $c$. We write this quadratic function as $A-2B.c +c^2$, where $A$ and $B$ are independent of $c$. This has its minimum when $c=B=E_{Y|X}(Y|X=x)$. So we certainly cannot do any better than defining $f(x)=c=E_{Y|X}(Y|X=x)$. Since $f$ is a measurable function, we do have $\lambda\in H$, and this choice of $f$ achieves the required minimum in 2.11.

edited Jul 02 '17 at 11:29

answered Jul 01 '17 at 20:31

David Epstein

1,147

1

(+1) Totally unrelated but I have to ask based on your location: Are you "really, really good" in Geometry? – usεr11852 Jul 02 '17 at 00:26
@usεr11852 Not that good any more (too old) – David Epstein Jul 02 '17 at 11:31
I still don't fully understand it: Assuming $E_{Y|X} ([Y-c]^2|X=x) = \int dy (y^2+c^2-2yc)p(y|x)$, we can get the minimum by taking the derivative, setting to zero and solving c yields: $c = \frac{E_{Y|X} (Y|X=x) }{\int dy; p(y|x)}$ -> ??? Where is my mistake? – newandlost Aug 14 '22 at 13:53
I still don't understand how to get from 2.11 to 2.12, $E_X$ is not a constant, if it the expectation value of everything the follows to the right of that symbol if I understand it correctly. I mean expression 2.11 is $\int dx p(x) \int dy (y-f(x))^2 p(y|x)$ , so how come not to considere the integration over $x$? – newandlost Aug 14 '22 at 14:01

score 0 · Answer 2 · answered Aug 18 '22 at 16:45

The integral of a probability distribution function is always 1, and so the integral of p(y|x) with respect to y is equal to 1. Your formula is not a mistake; it is the same answer as in the text. However, your discussion is a bit too special: in general, for example probabilities when throwing a dice, there is no function p(y|x) and you have to argue without assuming there is such a function.

score 0 · Answer 3 · answered Jan 18 '23 at 06:55

Here is my attempt although it is 5+ years late.

By conditioning on X, we can write EPE as ... (2.11)

due to the law of total probability or iterated expectation, a topic covered in your Math Stats I.

and we see that it suffices to minimize EPE pointwise:

because for all values of x, the inner expectation is non-negative, and therefore, minimizing the inner expectation at each value of x minimizes the outer integral of all the inner expectations, aka the outer expectation. And (2.12) is just that.

The solution is ... (2.13)

I bet you have seen this argument over and over again, but here it is yet again:

$E_{Y|X}([Y-c]^2|X=x)$

$=E_{Y|X}([Y-E_{Y|X}(Y|X)+E_{Y|X}(Y|X)-c]^2|X=x)$

And then you expand out the square and you have two square terms and a cross term. The cross term turns out to be zero (it is easy to show but too messy to type it out here). What is left are two square terms, and one is the conditional variance of Y (there is no way to reduce this term), and the other one is $[E_{Y|X}(Y|X=x)-c]^2$. So the solution of (2.12) is (2.13).

Elements of Statistical Learning - Statistical Decision Theory : Doubt regarding Minimization of EPE

3 Answers3

Linked