Is there a Good Illustrative Example where the Hinge Loss (SVM) Gives a Higher Accuracy than the Logistic Loss

Question

Vladimir Vapnik wrote:

“When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need but not a more general one.”

Vapnik wrote this in the context of transductive learning, but the principle could also be applied to statistical pattern recognition tasks. If we just need to perform a hard classification, then we should aim to determine the decision boundary directly, rather than estimating the posterior probability of class membership and then threshold that at 0.5 (or some other threshold depending on the misclassification costs). In principle, we may be able to get better generalisation in terms of accuracy by following this principle as it focusses on the needs of the application, and avoids modelling compromises that benefit modelling of aspects of the data that are not relevant to the application at the expense of other aspects that are.

The performance of the SVM on real-world problems suggest that there may be some merit in this, but it is not easy to find a clear example where this is demonstrably the case. Can anyone suggest an example where a SVM will give better performance than an equivalent logistic regression model (i.e. kernels and regularisation should be available for the logistic regression model, so we are just comparing the loss used, rather than the other aspects of the models)? Preferably the example should demonstrate why this is the case.

Obviously the no-free lunch theorem suggests that no classifier system is going to be superior for all datasets, and the theoretical justification for the SVM is largely a worst-case analysis, so the SVM may be better in pathologically difficult problems, but not necessarily the average case.

Personally I prefer kernel logistic regression over the support vector machine, largely because the hyper-parameter tuning is much easier, and I think they are likely to give broadly similar results. However, I have an open mind where the needs of the application mean that the limitations of the SVM (e.g. having to know misclassification costs a-priori) are not a problem.

Please can we avoid discussions of whether accuracy is a good performance metric. It has issues, but there are some applications where accuracy is the relevant statistic of interest, and there have already been good discussions of this topic.

Note the answer here is the sort of thing I am looking for, but unfortunately it isn't correct (see my comment on the answer).

on the other hand Gelman says that often it's better to solve the more general problem. E.g., if you want to know who wins a football game, it's better to model a difference in scores, than just win yes/no. — rep_ho, Mar 23 '22 at 12:12
According to Wikipedia it is in his book "Estimation of Dependences Based on Empirical Data", I'll see if I can find a page reference. I vaguely recall hearing him say it, but it was a long time ago, and memory is less reliable than a page reference! — Dikran Marsupial, Mar 23 '22 at 12:12
Found it, it is on page 477 of the English language translation, and the context makes it clear that it would also apply to finding the decision boundary rather than estimating the probability of class membership everywhere in the input space. — Dikran Marsupial, Mar 23 '22 at 12:20
What about this answer? https://stats.stackexchange.com/questions/388910/can-linear-svm-classify-samples-if-there-is-no-difference-in-means-of-predictors — rep_ho, Mar 23 '22 at 12:24
@rep_ho A large part of this is knowing what you really want to do. I don't think there is a "one-size-fits-all" optimal policy on these things. — Dikran Marsupial, Mar 23 '22 at 12:25
Does this answer your question? hinge loss vs logistic loss advantages and disadvantages/limitations — rep_ho, Mar 23 '22 at 12:26
@rep_ho, in the first case, a kernel logistic regression model would be able to solve that problem and there is no obvious reason why an SVM would be better, and the second doesn't provide examples. — Dikran Marsupial, Mar 23 '22 at 12:27
... wonders if genetic programming (or similar) could be used to generate such an example - the fitness function is pretty straightforward ... — Dikran Marsupial, Mar 23 '22 at 12:48
@downvoter - some feedback would be appreciated. I would hope that a good pedagogical/andragogical example would be quite useful for deciding whether we should be using SVMs in some circumstances rather than [kernel] logistic regression. — Dikran Marsupial, Mar 23 '22 at 14:47
1st of all, I don't know. That said (ha!), I've understood a major advantage of SVM over logistic is based on finding a dividing line where there's separation. Regularization can help logistic w/ separation, though, & you state you want to compare SVM to logistic w/ regularization. But, geometrically, the path of parameter estimates as the penalty increases must follow a certain trajectory (depending on L1, L2 or both). It seems to me it's possible the optimal margin doesn't align as well w/ those, but could be findable by SVM, & that SVM's dividing line would generalize better. Dunno, maybe? — gung - Reinstate Monica, Mar 23 '22 at 15:42
@gung-ReinstateMonica "Dunno, maybe?" sums up my view at the moment. I've read Vapnik's big grey book, and vaguely understood it at the time, but it seems odd that it is difficult to find a clear illustrative example. I think the main reason for the initial success of the SVM was that it encouraged people to think about model complexity and regularisation. I've not really used classical SVMs in anger for more than a decade. — Dikran Marsupial, Mar 23 '22 at 16:12
@RichardHardy sorry, it is a British idiom meaning to do something for real rather than as a test or practice run https://english.stackexchange.com/questions/30939/is-used-in-anger-a-britishism-for-something — Dikran Marsupial, Mar 23 '22 at 17:20
@DikranMarsupial, so cool! Had never encountered it before, but I am glad I have now! — Richard Hardy, Mar 23 '22 at 17:23
@RichardHardy, I had the impression you were British, or at least lived in England. — gung - Reinstate Monica, Mar 24 '22 at 16:21

Dikran Marsupial · Answer 1 · 2022-03-30T10:05:08.153

The example attributed to Olivier Bousquet doesn't work as clearly as depicted in the blog post mentioned in an answer to a related question, but I have been able to make it work to some extent in a more realistic setting, so I'll post it here in case it stimulates further (hopefully simpler or more informative) examples.

The (of course) adversarial learning task is shown below. The probability of membership of the positive class is given by

$$p(\mathcal{C}_+|x) = \frac{0.5}{(1 + \exp(-100*x))} + 0.25 + 0.24*\sin(20*x)$$

Note there are features of the true probability of class membership that do not affect the decision boundary, so any model may be distracted by modelling those irrelevant undulations at the expense of accurately determining the optimal decision boundary. To demonstrate that Bousqet's example builds a series of logistic regression-style models, of increasing complexity, based on Legendre polynomials (for numerical considerations). Here are the first seven basis functions:

The blog example is implemented in Mathmatica, which is a language I don't know, but I have been able to replicate their results in MATLAB tolerably well. What I think they have done is to fit these Legendre polynomial models firstly using the cross-entropy metric, and then using the hinge loss, but rather strangely they have fitted it directly to the true (sampled) probability of class membership, i.e. the response values lie between 0 and 1. Using the cross-entropy, I get this result:

Which is broadly the same as the Mathmatica implementation. Note that in the attempt to model the undulations in the probability of class membership, the model has overshot on occasion and so the accuracy is lower than we would get by simply placing a threshold at $x = 0$.

It wasn't completely clear how to implement the model with the hinge loss, as the logistic function clips the output to lie in the range of 0 to 1, so instead I used the hinge loss on the weighted sum of the Legendre basis functions and then afterwards applied the logistic function. This is the result:

All the models now achieve the optimal accuracy, although the estimates of the posterior probability of class membership are clearly inferior (if not actually plain wrong!).

HOWEVER this is not what we actually do when we have a classification task. If we knew the optimal posterior probability of class membership to determine the targets for the training data, we probably wouldn't need to build a classifier in the first place! So I then modified the code so that instead of the response values being the sampled true probability, I generated random x values (uniform distribution from -1 to +1) and then generated binary responses according to the probability of class membership.

This is the result for the cross-entropy error metric, which is pretty much the same as before.

Here is the result for the hinge loss, which is very different.

So why the difference? Well in the blog version, if we set the weight "linear" term to a very high value, then the weighted sum will be less than -1 for all of the data to the left of $x = 0$ and a value greater than +1 for all of the data on the right. In which case, the hinge loss will be zero, and we will get a classifier with minimal error regardless of model complexity. The hinge loss cannot be negative. However, if we sample the labels from a conditional Bernoulli distribution, we will have data on both sides of $x=0$ that are both positively and negatively labelled, and if they are the wrong side of $x=0$ they will have a non-zero hinge loss, and hence we will start penalising models with a large linear term increasingly harshly. It does have some excess error caused by trying to model the right-most undulation, but it does see to be more robust in terms of accuracy than the cross-entropy loss. So it is an example of how trying to classify the data directly, rather than estimate probability and then threshold, it just isn't as clear cut as the example in the blog suggests.

Update #1: Here are the results with a larger dataset (so sampling noise is less likely to be a factor). For the previous results I used 1024 training patterns, and for these I used 65536 (I work in a computer science department ;o). It seems to improve things a bit for the hinge loss, but the cross-entropy results look broadly similar.

Cross-entropy loss:

Hinge loss:

It is interesting (i.e. worrying) that for some of the simpler models, the output does not go through $(0, 1/2)$...

FWIW, this is the most complex of the hinge-loss models without the logistic transformation (but with an offset of 0.5 to make it easier to compare with the probabilities).

Nice work and improvement on my answer. There is a nice paper regarding practical applications of this phenomenon. https://www.ejournals.eu/pliki/art/9009/pl . Especially squared hinge loss seems to be performing much better. — Cagdas Ozgenc, Mar 30 '22 at 05:27
Also have you used a regularizer for both cases? For example the standard SVM regularizer of second norm of weights? — Cagdas Ozgenc, Mar 30 '22 at 05:35
@CagdasOzgenc No regularization was used, I don't think it was used in the blog version either, but I don't know Mathmatica, so I may be missing some subtlety. I tried using a larger amount of data, which would have a regularising effect and it made things better for the hinge loss, but lot the cross-entropy - I'll add the results to the answer now. I'll have a think about implementing regularisation. — Dikran Marsupial, Mar 30 '22 at 07:45
@CagdasOzgenc I don't think that the regulariser will cause the output to become more confident rather than less, but I will give it a try and post the results here. I think the current behaviour is correct in that it gets high accuracy and gives more reasonable probability estimates than one that saturates at 0 and 1. — Dikran Marsupial, Mar 30 '22 at 08:39
Of course there is always the discussion of the flexibility of using a different domain loss function after training when trained with a proper loss whereas you are stuck with 0-1 loss when trained with a hinge. By the way how did you plot the probabilities after hinge training? — Cagdas Ozgenc, Mar 30 '22 at 09:32
I fitted just the weighted sum of the basis functions with the hinge loss and then put that through the logistic transformation. I wasn't sure how the hinge loss would work on an output that was already truncated to 0-1 (and I couldn't tell from the Mathmatica code) — Dikran Marsupial, Mar 30 '22 at 09:37
Maybe simple bucketing on x axis (and simply calculating the ratio of 1s to total) would be a more reasonable approach rather than applying an ad-hoc logistic on top. You have a lot of samples. — Cagdas Ozgenc, Mar 30 '22 at 09:53

user20160 · Answer 2 · 2022-03-31T16:49:23.533

The results you shared are fascinating. Here's some more exploration in a similar direction.

The idea behind the Bousquet example is to create a situation where optimizing the logistic loss prioritizes fitting the underlying probability distribution at the expense of accuracy. But, it's not clear to me that accuracy and fit to the distribution would have to be opposed here. For example, it seems like a model that exactly matches the underlying distribution should yield both optimal accuracy and optimal logistic loss (at least in expectation).

I'll build a similar example that tries to simplify things, with only two models in the hypothesis space. One gives better accuracy but worse fit to the underlying distribution, and the other does the opposite. The tension between these two objectives is explicitly baked into the problem. I'll work directly with expected losses, so issues related to finite samples and/or optimization won't play any role.

True distribution

Suppose each point $x$ is drawn i.i.d. from the uniform distribution on $[-1, 1]$ and its class label $y \in \{-1, +1\}$ is drawn from a Bernoulli distribution $p(y \mid x)$. Similar to the Bousquet example, the conditional probability of the positive class is a 'wavy step function' (see plot below):

$$p(y=1 \mid x) = .598 \ \sigma(100 x) - .201 + .2 \sin(20 x)$$

where $\sigma$ is the logistic sigmoid function.

Models

Suppose our hypothesis space contains only two models (with no free parameters so 'fitting' means choosing one or the other):

The first model is a step function:

$$\hat{p}_1(y=1 \mid x) = \begin{cases} \sigma(1) & x \ge 0 \\ \sigma(-1) & x < 0 \\ \end{cases}$$

The second is a wavy step function, similar to the true distribution, but with slightly different parameters:

$$\hat{p}_2(y=1 \mid x) = .398 \ \sigma(100 x) - .301 + .3 \sin(20 x)$$

Where needed (e.g. for computing the hinge loss), 'raw' classifier outputs are computed as $f(x) = \operatorname{logit}(\hat{p}(y=1 \mid x))$. Since we're interested in accuracy, point predictions are computed as the mode of the predicted distribution over class labels (equivalent to the sign of the 'raw' output), which is the optimal decision under the 0-1 loss.

Note that the hypothesis space doesn't contain the true distribution. We're forced to choose between two approximations that make different tradeoffs. The first model (step function) is designed to make more accurate point predictions, at the expense of fit to the underlying distribution. In contrast, the second model (wavy step) is designed to match the underlying distribution better, at the expense of accuracy. To confirm that these tradeoffs are indeed present, the table below shows the expected 0-1 loss (better for model 1) and expected KL divergence from the true distribution to the model (better for model 2).

Hinge vs. logistic loss

Here are the expected losses for each model, where the expectation is taken w.r.t. the true data generating process (calculated by numerical integration). The best model according to each loss function is shown in bold+parentheses:

$$\begin{array}{rc} & \text{Model 1 (step)} & \text{Model 2 (wavy step)} \\ \text{0-1} & \mathbf{(.208)} & .269 \\ \text{KL} & .113 & \mathbf{(.044)} \\ \text{Hinge} & \mathbf{(.416)} & .580 \\ \text{Logistic} & .521 & \mathbf{(.474)} \\ \end{array}$$

Suppose we choose a model from our hypothesis space by minimizing the expected loss (i.e. what empirical risk minimization tries to do by proxy). As the table shows, minimizing the hinge loss gives the first model (prioritizing accuracy), whereas minimizing the logistic loss gives the second model (prioritizing fit to the underlying distribution).

Notes

Contrary to this example, accuracy and fit to the underlying distribution aren't always opposed. Even when such a conflict exists, the hinge and logistic losses may not necessarily behave as shown above.

For example, the amplitude of model 1's step function matters. It doesn't affect the accuracy (which only depends on the sign), but it does affect the hinge loss. Increasing the amplitude too far (overconfidence) incurs increasing penalties for misclassified points. And, shrinking it too far (underconfidence) penalizes correct predictions, which increasingly fall inside the margin. In both cases, the hinge loss will eventually favor the second model, thereby accepting a decrease in accuracy. This emphasizes that: 1) the hinge loss doesn't always agree with the 0-1 loss (it's only a convex surrogate) and 2) the effects in question depend on the hypothesis space.

In practice, I'd bet that regularization plays an important role too, together with the model selection algorithm. For example, regularization strength is often chosen to maximize validation set accuracy. Even if the logistic loss is used to fit the parameters, using the 0-1 loss for model selection might sacrifice fit to the underlying distribution in favor of accuracy.

Dikran's argument was related to the fact that we use hinge loss over the actual data set but not over the probabilistic distance. So actually changing the amplitude of the step function will not change the situation as you stipulate. — Cagdas Ozgenc, Mar 31 '22 at 22:01
"Contrary to this example, accuracy and fit to the underlying distribution aren't always opposed. " from the perspective of the loss, they are never opposed as the posterior probability of class membership is the optimal information for finding the Bayes optimal decision region. The problem is that either (i) the hypothesis space doesn't include the true model (it doesn't for the original example); or (ii) we have a limited sample with which to estimate the parameters. — Dikran Marsupial, Apr 01 '22 at 05:28
It is about how compromise affects the outcome and for the logistic loss, there is nothing special about the $p(C_+|x) = 0.5$ contour, but that contour may be special for the needs of the application. That is why the "just calculate the probabilities" approach can be sub-optimal for the application and we should also try discrete classifiers (e.g. SVM) for some problems (and include classification accuracy in performance evaluation if it is important for the application). — Dikran Marsupial, Apr 01 '22 at 05:30
" For example, regularization strength is often chosen to maximize validation set accuracy. " it shouldn't be - I've found using the Brier score for model selection is better than accuracy, it is less brittle and it is also a smooth function and hence easier to optimise numerically. For SVMs the radius-margin or (regularised) span bounds are often better for similar reasons (but sadly I don't think they are implemented in most toolboxes) — Dikran Marsupial, Apr 01 '22 at 05:33

Is there a Good Illustrative Example where the Hinge Loss (SVM) Gives a Higher Accuracy than the Logistic Loss

2 Answers2

True distribution

Models

Hinge vs. logistic loss

Notes

Linked