9

I'm reading the the first chapter of Understanding machine learning from theory to algorithms and they said that:

Let $H_B$ be the set of "bad" hypotheses, that is

$H_B=\left\{h \in H:L_{(D,f)(h)}\gt e\right\}$ ($e$ is the accuracy parameter)

Let

$M=\left\{S|_x : \exists \ h \in H_B, L_s(h) =0 \right\}$

be the set of misleading samples: Namely, for every $S|_x \in M$, there is a "bad" hypothesis, $h \in H_B$, that looks like a "good" hypothesis on $S|_x$. Now, recall that we would like to bound the probability of the event $L_{(D,f)} \gt e$. But, since the realizability assumption implies that $L_s(h_s)=0$, it follows that the event $L_{(D,f)}(h_s)\gt e$ can only happen if for some $h \in H_B$ we have $L_s(h) = 0$. In other words, this event will only happen if our sample is in the set of misleading samples $M$. Formally, we have shown that

$\left\{S|_x : L_{(D,f)}(h_S)\gt e \right\} \subseteq M$

I'm so confused about this conclusion. Can someone please explain this to me? Thanks for your time!

Cypherius
  • 191
  • I know that learning all MathJax/Latex syntax can be a bit daunting, but give a look at my edit to see how I translated all your math symbols properly. – Firebug Dec 30 '19 at 15:22

2 Answers2

5

Let's formulate the problem in a clearer way.

ASSUMPTIONS:

  • $h_s = \underset{h\in\mathcal{H}}{\arg\min} \, L_{S}(h) \qquad$ (definition of ERM)
  • $\exists h^{*} \in \mathcal{H}: L_{(\mathcal{D}, f)}(h^{*}) = 0 \qquad$ (realizability assumption)
  • $\mathcal{H}_B = \{ h \in \mathcal{H}: L_{(\mathcal{D}, f)}(h) > \epsilon \}$
  • $M = \{ S\vert_x: \exists h\in \mathcal{H}_{B}, \, L_{S}(h) = 0 \}$

PROVE: $\{S\vert_x: L_{(\mathcal{D}, f)}\left(h_{S} \right) > \epsilon \} \subseteq M$

PROOF: To prove a set A is a subset of set B, or $A \subseteq B$, we need to prove that every element in set A is in set B. Here, given the definition of $\mathcal{H}_{B}$, we can rewrite set M as

$M = \{ S\vert_{x}: h \in \mathcal{H}, \, L_{(\mathcal{D}, f)}(h) > \epsilon, \, L_{S}(h) = 0 \}$.

Thus, to be an element in set $M$, it needs to satisfy the 3 conditions specified in set M.

Every element of the set on the right-hand side already satisfies two conditions:

  • $h_{S} \in \mathcal{H}$
  • $L_{(\mathcal{D}, f)}\left(h_{S} \right) > \epsilon$.

Hence, proving $L_{S}(h_{S}) = 0$ would complete the proof. This can be done by employing the definition of ERM and the realizability assumption.

From the realizability assumption combined with the fact that $S$ is a sample from $\mathcal{D}$:

$L_{(\mathcal{D}, f)}(h^{*}) = 0 \implies L_{S}(h^{*}) = 0$.

From the definition of ERM:

$L_{S}(h_S) \le L_{S}(h^{*}) = 0 \quad \implies L_{S}(h_S) = 0$.

Cuong
  • 441
  • I don't understand your assertion : $$L_{(\mathcal{D}, f)}(h^{}) = 0 \implies L_{S}(h^{}) = 0$$ For exemple if $D$ is Lebesgue it doesn't work because in that case $f$ and $h$ could be equal almost everywhere on $\mathbb{R}^{d} - S$ – CechMS Mar 20 '20 at 08:54
  • @CechMS This implication is a part of the definition 2.1 (page 17) in the same textbook. – Cuong Jun 19 '20 at 06:36
  • @CechMS, since $S \sim \mathcal{D}^m$, any hypothesis $h^$ that satisfies $L_{(\mathcal{D}, f)}(h^{}) = 0$ also holds $L_{S}(h^{*}) = 0$. – sayan Nov 05 '20 at 15:58
1

I actually got stuck on this for a bit too ...

Remember how $h_S$ is defined: $$h_S \in \underset{h\in H}{\mathrm{argmin}} (L_S(h))$$ The realizability assumption will tell you there's a perfect $h^*$. (Very strong assumption). This $h^*$ will have $L_S(h^*)=0$ by definition.

So when $h_S$ is defined as the argmin - it is necessarily 0 with probability 1. (This is not the same as the Loss function achieving 0 on every sample).

  1. So we are looking for a class of predictors $h_S$
  2. The realizability assumption tells us that $L_S(h^*)=0$
  3. if $\exists h \in H_B$ s.t. $L_S(h)=0$, this implies that $h \in argmin$

The rest follows that the set we want to define is contained in $M$. I don't think this isn't a constructive proof - it doesn't tell you how to find a $h$, it ultimately shows that there's an upper bound on the size of the samples that would create the conditions to create a bad predictor. They don't really talk about measurability but there was a footnote suppressing those details and $H$ was finite, so I'm assuming everything works !