Softmax regression or $K$ binary logistic regression

Question

For a multi-class classification problem, we can use $K$ binary logistic classifiers, or one softmax regression classifier, so how to make the choice between the two?

IMHO, the $K$ binary logistic classifiers is just the 1-vs-all scheme for multi-class, but softmax classifier inherently handles multi-class problem. Why should I prefer one over the other?

one thing to consider is the data: are labels mutually exclusive (softmax would probably fit better) or not (e.g., {animal, dog, cat}, here you might want to assign a single example to multiple labels) — Alex Kreimer, Nov 11 '19 at 15:00

Josh · Answer 1 · 2015-04-22T16:49:49.253

4

The softmax function gives a proper probability for each of the possible classes:
$$ P(y=j|x,\{w_k\}_{k=1...K}) = \frac{e^{x^\top w_j}}{\sum_{k=1}^K e^{x^\top w_k}} $$

This is nice if you want to interpret your classification problem in a probabilistic setting. Benefits of using the probabilistic formulation include being able to place priors on the parameters and obtaining a posterior distribution over classes.

That said, maybe you can imagine a really good classifier that isn't of this form. Perhaps it is of a form that is generally difficult to express (e.g. SVM -- here for multi-class details). If some such complicated classifier works well for you on a given task, perhaps you don't want to use the [potentially weaker] softmax classifier. In such a setting, there may not be a clear all-way output, so you have to settle for repeated one-vs-others classification schemes.

One more counterpoint...you could also augment the expressive power of the softmax-style approach by changing the input to the exponential. For example, it would be straightforward to replace each linear component $x^\top w_j$ with a quadratic expression $x^\top w_j + x^\top A_j x$. Other such augmentations are conceivable.

edited Apr 22 '15 at 16:49

answered Apr 21 '15 at 22:39

Josh

191

1

It is important to call things by their proper names. This is just the multinomial (polytomous) logistic model. Joint optimization justing maximum likelihood will give the same parameter estimates as running separate binary logistic models, though perhaps not the same standard errors. – Frank Harrell Nov 14 '15 at 15:08
@FrankHarrell Care to expand on that here? – Dave Oct 23 '23 at 11:09
Not sure what to expand on. If you look up multinomial (polytomous) logistic regression you’ll see the mathematical form of the model which is the same as your expression. Logistic regression is old, but not forgotten. – Frank Harrell Oct 23 '23 at 12:17
@FrankHarrell Wouldn't multinomial logistic regression require the predicted probabilities to sum to one? That's not required in what is being discussed at the link, possibly to the detriment of the modeling (but possibly not). – Dave Oct 23 '23 at 18:32
The formula given at the top of the post here embodies the probabilities summing to 1.0. – Frank Harrell Oct 23 '23 at 19:15
@FrankHarrell Right, and that would not be required if you model as multiple "this or not" binary models instead of one "this or that or the other thing" multinomial model, the former of which seems to be the k binary logistic regression models mentioned here and discussed at the link. – Dave Oct 23 '23 at 19:18
The k models that allow one to duplicate the multinomial model's coefficients are those binary models predicting one category at a time leaving out all others. If you have a true multivariable problem with non-mutually exclusive choices you'll need a choice model fro econometrics or other multivariate logistic model. But the softmax function at the top is for the mutually exclusive category case. – Frank Harrell Oct 23 '23 at 20:58
@FrankHarrell I mean a situation where we use a choice model, as if multiple choices were possible, even though the choices are mutually exclusive. You seem to be saying that multiple logistic regressions (a possible choice model that ignores possible correlations between choices, such as frequently picking ketchup with fries but not with ice cream) will give the same predictions as the multinomial model. Is that what you mean? – Dave Oct 23 '23 at 21:15
That's the exact opposite of what I said. First of all, multiple choice = non-mutually exclusive. You can use a series of binary models in two ways: (1) to mimic the multinomial model (except with wrong standard errors, I believe), or (2) to model non-mutually exclusive choices that doesn't mimic the non-applicable multinomial model and that doesn't try to put things together into a unified model. In general look to multivariate binary models for unified non-mutually-exclusive choice modeling. – Frank Harrell Oct 24 '23 at 14:42

Softmax regression or $K$ binary logistic regression

1 Answers1

Linked