Scenario where minimizing 0-1 loss is different than minimizing hinge loss

Question

Suppose we're using linear predictors. I'm trying to conceptually understand how minimizing hinge loss and 0-1 loss aren't necessarily the same. For instance I was told that one can choose a set of points such that the minimum possible 0-1 loss over all of these linear predictors (i.e $p_w(x) = \langle w, x \rangle$) where $w$ lies in the two-dimensional plane) is below $\frac{1}{10}$, but for that same set of points the predictor which minimizes the hinge loss has 0-1 loss that is above a certain threshold like $\frac{1}{2}$. That is, seeking to minimize hinge loss could have consequences for 0-1 loss for that same predictor (compared to the predictor which has optimal 0-1 loss). Can anyone help me gain some intuition and understand this question? How does one intuit what the predictor that minimizes the hinge loss for a given set of points is? For 0-1 loss, that is very simple for me to understand, but hinge loss is a bit harder.

Suppose we had some 10 collinear points, where 5 blue points are clustered closely to the right of 4 red points, and a final blue point is a massive outlier, at some coordinate approaching $(-\infty, 0)$. These points aren't linearly separable, and it's easy to see the minimum 0-1 loss is comprised of one point, which is a line that divides the two large cluster. But for the hinge loss minimizer, Is it valid to conclude that it would move to the left of the blue points, thereby sacrificing those points' correct classification in this pursuit? Is this thinking on the right track for understanding the tradeoffs made by predictors minimizing different loss functions?

If not, could someone give me a configuration of points to illustrate this scenario please

Example of 0-1 loss being 0.1 and hinge loss being 0.5: Consider the one dimensional case. Suppose you have 9 blue points at $x = -0.2$, and 1 red point at $x = -0.25$. Similarly, you have 10 red points at $x = 0.2$ and one blue point at $x = 0.25$. In this case if you use predictor red if $x \geq 0$ then the 0-1 loss is $0.1$ because 2 of 20 points are misclassified, but the hinge loss is $0.5$ because both the misclassified points are $0.25$ unit length away from the decision boundary. — blooraven, Jan 31 '24 at 20:10
@blooraven Thank you. However, my question was more about comparing 0-1 losses of the different predictors, specifically, a configuration s.t the the 0-1 loss (as opposed to the hinge loss itself) of the predictor that minimizes the hinge loss can be greater than the 0-1 loss of the predictor that simply minimizes the 0-1 loss. In essence, how can seeking to minimize the hinge loss have consequences for the 0-1 loss. — redbull_nowings, Jan 31 '24 at 20:26
Related question: https://stats.stackexchange.com/q/222585/36229 — shadowtalker, Feb 01 '24 at 18:34
@shadowtalker That helps with intuition, but I would really like to visualize this, preferably in terms of the scenario im outlining — redbull_nowings, Feb 01 '24 at 19:13
@redbull_nowings are you asking for someone to help you set up your example in code, or are you asking for a conceptual explanation? — shadowtalker, Feb 01 '24 at 19:16
@shadowtalker More so asking if my example is correct and if not (or perhaps even if it is), what the simplest example of such a situation might look like. Without code. I think being able to visualize such a situation would help my intuition — redbull_nowings, Feb 01 '24 at 19:45
Possibly also interesting: https://math.stackexchange.com/q/327807/117452 - soft-margin SVM with 0-1 loss is equivalent to a linear classifier with hinge loss, which is why we often say that SVM "is" a linear classifier with hinge loss. — shadowtalker, Feb 01 '24 at 22:24

shadowtalker · Answer 1 · 2024-02-08T15:45:44.970

I think your intuition is correct. The hinge loss adds an arbitrarily large penalty for being "more wrong".

An extremely simplified example

Consider trying to learn the following function:

$$ y = \begin{cases} 1 & x \in \left( - \infty, 0 \right) \\ 0 & x \in \left[ 0, 1 \right) \\ 1 & x \in \left[ 1, \infty \right) \\ \end{cases} $$

And consider the simplest linear classifier, parameterized by decision threshold $c$:

$$ \hat{y} = \begin{cases} 0 & x < c \\ 1 & x \ge c \end{cases} $$

We happen to know that it's impossible to learn the true function with this classifier, but it's illustrative to consider the results if we try it anyway. Remember that, in the real world, we usually do not know the real data-generating process. At best, we might have a theoretical model with a closed parametric form, but there's no guarantee that the model is correct.

To simplify even further, consider a training set that consists of (in R):

x <- c( -0.75, -0.25, 0.25, 0.75, 1.25, 1.75 )
y <- c(     1,     1,    0,    0,    1,    1 )

With 0-1 loss, we can see that the optimal $c^*$ is anything in $\left( - \infty , -0.75 \right] \cup \left( 0.75, 1.25 \right]$, resulting in a minimized loss of 2. Let's choose $c^* = 1.25$.

For class labels 0 and 1 we can set up a simplified hinge loss as

$$ L^{\text{hinge}} \left( y, x; c \right) = \begin{cases} 0 & y = 0, x < c \\ 0 & y = 1, x \ge c \\ | c - x | & \text{otherwise} \end{cases} $$

That is, the loss is 0 for a correct classification, but the loss of misclassification is equal to the distance between the threshold $c$ and the data $x$.

Using the same $c = 1.25$, we can compute the hinge loss for our training set to be 3.5. However, this is not the optimal $c^*$ for hinge loss, which you can easily check by comparing to another value like $c = 0.5$ (loss of 2.25).

Visualizing the loss functions on our sample data

Being an extremely simple 1-dimensional problem, we can visualize this "loss landscape" easily to see what's going on:

l_01 <- function(c) {
  sum(
    ifelse(
      (y == 0 & x < c) | (y == 1 & x >= c),
      0,
      1
    )
  )
}
l_hinge <- function(c) {
  sum(
    ifelse(
      (y == 0 & x < c) | (y == 1 & x >= c),
      0,
      abs(c - x)
    )
  )
}
c <- seq(-1, 2, by = 0.1)
plot(c, sapply(c, l_01), type = "l", col = "blue", ylab = "loss", xlab = "c")
lines(c, sapply(c, l_hinge), type = "l", col = "red")
legend(
  "topleft", legend = c("0-1", "hinge"), col = c("blue", "red"), lwd = 1
)
dev.off()

The optimal classification thresholds don't line up at all! It's obvious that the optimal threshold with 0-1 loss is nowhere near the optimal threshold with hinge loss.

The trade-off between correct classification and error magnitude

And if we plot the actual data along with the classification threshold, we can see precisely the effect you predicted in your question:

plot(0, type = "n", xlim = c(-1.0, 2.0), ylim = c(-0.01, 1.01), main = "Data with optimal regions")
blue_alpha <- adjustcolor("blue", alpha.f = 0.25)
red_alpha <- adjustcolor("red", alpha.f = 0.25)
rect(0.75, -0.05, 1.25, 1.05, col = blue_alpha)
rect(-0.25, -0.05, 0.25, 1.05, col = red_alpha)
points(x, y, cex = 2.5, pch = 21, bg = "gray")
legend(
  "left", legend = c("0-1", "hinge"), col = c(blue_alpha, red_alpha), lwd = 8
)

The optimal threshold for 0-1 loss places some misclassified points very far from the classification threshold, which results in a suboptimal hinge loss.

The optimal threshold for hinge loss misclassifies more points, which results in a suboptimal 0-1 loss, in exchange for reducing the total distance from the threshold of those misclassified points. That is, it prefers putting the classification boundary closest to the misclassified data points. In this particular case, the effect is strong enough to completely overwhelm the benefit of actually classifying the data correctly.

In fact, in this particular example, the optimal threshold for hinge loss is actually pessimal for 0-1 loss! That is, the optimal range of thresholds under hinge loss gives you the worst possible range of thresholds under the 0-1 loss. (That was actually not intentional when I set up the problem, but I think it works out well to illustrate the difference in behavior).

This of course is a very, very simplified example. But I think it helps to isolate the behavior of the hinge loss from all other considerations.

A very important caveat

This analysis hinges (!) on the highly-simplified optimization problem that I set up.

In the traditional setup of the SVM, we do not use hinge loss for finding a separating line between classes. If we are interested in the SVM optimization problem of finding the maximum-margin separating hyperplane (or our best attempt at it, if the data is not separable), we use 0-1 loss, not hinge loss, to perform that optimization.

The hinge loss arises as the result of reinterpreting the soft-margin SVM as an L2-regularized linear model with a particular decision boundary. Therefore while I think this example is illustrative of the specific effect in question (trading off correct classification for smaller errors), it does not illustrate the behavior of a maximum-margin classifier in general, nor does it illustrate the SVM-based intuition for how the hinge loss is derived. For a succinct writeup of that topic, see Cornell CS4780 lecture 9.

This is great, thank you! Out of curiosity, if the data were linearly separable, would we still be able to say the same things about the classifier that minimizes hinge loss, or would it always perfectly classify the data? — redbull_nowings, Feb 01 '24 at 21:05
@redbull_nowings in this example, you'd get a perfect 0 loss with both loss functions once you find the separation point. Note that this example is very highly stylized to emphasize the trade-off between correct classification and error magnitude. — shadowtalker, Feb 01 '24 at 23:06

Scenario where minimizing 0-1 loss is different than minimizing hinge loss

1 Answers1

An extremely simplified example

Visualizing the loss functions on our sample data

The trade-off between correct classification and error magnitude

A very important caveat