1

Can somebody explain me why this classifier is giving a loss equal to cero? I don't get the example

enter image description here

SourceText

Stephen
  • 257

1 Answers1

2

The basic setup from this problem is that we draw $n$ data points, where $(x_i, y_i)$ is the $i$th data point from our underlying distrbution $\mathcal{D}$. $x_i$ is the vector of features of the $i$th sample and $y_i$ is its associated label. The point of the passage is that minimizing the empirical risk seems to be an intuitive choice to create a "good" classifier. However, $h_S$ gives an example of where it can go wrong.

Indeed, $h_S$ is just memorizing the training data; i.e. if the input to $h_S$ is $x_i$, which is the vector of features of $i$th data point, then the output will be $y_i$ which is always the correct label. So, the empirical risk when using $h_S$ is always zero, as it outputs the correct label by design. However, it's not hard to see why this is not a good classifier; on unseen data, we always predict 0, without any regard for the data!

  • Good answer! Could you expand on a numerical example, please ? Like, replacing the values into h(s)? – Stephen Dec 21 '20 at 06:49
  • Where that "1/2" is coming from in L(h_{s}) = \dfrac{1}{2} – Stephen Dec 21 '20 at 06:55
  • 1
    I'll use a more canonical classification example. Let's say that we are trying to predict whether an email is spam ($y_i = 1$) or not ($y_i=0$) based off of the number of words in the email $x_i = \text {number of words in email } i$. Say I only have two data points $(92, 1)$ and $(133, 0)$. Then $h_S(92) = 1$ and $h_S(x) = 0$ for all $x \neq 92$. – user303375 Dec 21 '20 at 06:57
  • 1
    $L_{\mathcal{D}}(h_S)$ is the "true" loss of $h_S$ not just on our training set. Read the section on area in the passage, it is very clear. – user303375 Dec 21 '20 at 07:02
  • That means the error is cero since (x=92, y=1) and the same applies to (x=133, y=0), the label for the latter is 0 thus not spam. This is the training sequence. Why is it 0 otherwise? – Stephen Dec 21 '20 at 07:06
  • By definition of $h_S$. – user303375 Dec 21 '20 at 07:08
  • We've got a total of 18 instances for the training we use 6 and we are left with 12 (test set). I'm a little confused as to where the 1/2 is coming from :( – Stephen Dec 21 '20 at 07:12