22

Hinge loss can be defined using $\text{max}(0, 1-y_i\mathbf{w}^T\mathbf{x}_i)$ and the log loss can be defined as $\text{log}(1 + \exp(-y_i\mathbf{w}^T\mathbf{x}_i))$

I have the following questions:

  1. Are there any disadvantages of hinge loss (e.g. sensitive to outliers as mentioned in this article) ?

  2. What are the differences, advantages, disadvantages of one compared to the other?

Sep
  • 103
user570593
  • 1,119

3 Answers3

33

Logarithmic loss minimization leads to well-behaved probabilistic outputs.

Hinge loss leads to some (not guaranteed) sparsity on the dual, but it doesn't help at probability estimation. Instead, it punishes misclassifications (that's why it's so useful to determine margins): diminishing hinge-loss comes with diminishing across margin misclassifications.

So, summarizing:

  • Logarithmic loss ideally leads to better probability estimation at the cost of not actually optimizing for accuracy

  • Hinge loss ideally leads to better accuracy and some sparsity at the cost of not actually estimating probabilities

In ideal scenarios, each respective method would excel in their domain (accuracy vs probability estimation). However, due to the No-Free-Lunch Theorem, it is not possible to know, a priori, if the model choice is optimal.

Firebug
  • 19,076
  • 6
  • 77
  • 139
  • 1
    +1. Minimizing logistic loss corresponds to maximizing binomial likelihood. Minimizing squared-error loss corresponds to maximizing Gaussian likelihood (it's just OLS regression; for 2-class classification it's actually equivalent to LDA). Do you know if minimizing hinge loss corresponds to maximizing some other likelihood? I.e. is there any probabilistic model corresponding to the hinge loss? – amoeba Mar 28 '18 at 15:51
  • 1
    @amoeba It's an interesting question, but SVMs are inherently not-based on statistical modelling. Having said that, check this answer by Glen_b. The whole thread is about it, but for the epsilon-insensitive hinge instead. – Firebug Mar 28 '18 at 16:03
  • @Firebug if you have a good illustrative example of "Hinge loss leads to better accuracy", it would be a good answer for my question here https://stats.stackexchange.com/questions/568821/is-there-a-good-illustrative-example-where-the-hinge-loss-svm-gives-a-higher-a (and would be much appreciated!) – Dikran Marsupial Mar 23 '22 at 17:24
6

@Firebug had a good answer (+1). In fact, I had a similar question here.

What are the impacts of choosing different loss functions in classification to approximate 0-1 loss

I just want to add more on another big advantages of logistic loss: probabilistic interpretation. An example, can be found here

Specifically, logistic regression is a classical model in statistics literature. (See, What does the name "Logistic Regression" mean? for the naming.) There are many important concept related to logistic loss, such as maximize log likelihood estimation, likelihood ratio tests, as well as assumptions on binomial. Here are some related discussions.

Likelihood ratio test in R

Why isn't Logistic Regression called Logistic Classification?

Is there i.i.d. assumption on logistic regression?

Difference between logit and probit models

Haitao Du
  • 36,852
  • 25
  • 145
  • 242
1

Since @hxd1011 added a advantage of cross entropy, I'll be adding one drawback of it.

Cross entropy error is one of many distance measures between probability distributions, but one drawback of it is that distributions with long tails can be modeled poorly with too much weight given to the unlikely events.

aerin
  • 859