Mutual exclusive classes for deciding Softmax Regression vs. k Binary Classifiers

Question

I realize that a similar question is asked here also, but my concern is related to the last section of this article from Stanford It says the decision will depend on mutual exclusivity of classes and if they are mutually exclusive then prefer softmax else k binary classifiers.

Can anyone provide any rigorous explanation for that statement directly relating the criteria of mutual exclusivity of classes to the performance of the algorithm, because that article only has given just one line explanation : "This way, for each new musical piece(each class), your algorithm can separately decide whether it falls into each of the four categories."

softmax produces a probability distribution vector: sum of its elements is 1 (i.e., it may confidently predict only a single category). This is a poor choice when you want a single example assigned multiple labels. — Alex Kreimer, Nov 11 '19 at 15:05
(Bounty message part 1) I think the core of the question is why to model with softmax at all, instead of going for multi-label ($k$ binary models). That is, why not model the MNIST digits as ten binary classifiers—Is it zero? Is it one? Is it seven?—so that if you show the network a picture of a boot, the network can come back and say that the boot does not belong to any of the MNIST digit categories? — Dave, Oct 26 '23 at 16:46
(Bounty message part 2) It’s clear why softmax is inappropriate when multiple categories can occur simultaneously and we want a model to be able to give 0.9 probability of a dog being in a photo and also 0.9 probability of a human being in that same photo (people take photos with their dogs all the time), as softmax regression would require the probabilities to sum to one. — Dave, Oct 26 '23 at 16:46
(Bounty message part 3) However, it is not clear why a multi-label approach would be so problematic when the labels are mutually exclusive, and Harrell’s comment here makes it sound reasonable. — Dave, Oct 26 '23 at 16:46
@Dave for an MLP type network, if you showed a network trained on digits a picture of a dog, there is no guarantee that it will say that doesn't belong to any of the categories because the decision boundaries are not closed. It could just as well confidently predict that it is one of the digits because none of the outputs have been discriminatively trained with pictures of a dog. It would be free to extrapolate away from the digits data in any way that was "convenient". I think the advantage of the softmax is just efficiency, where the modelling assumptions are correct. — Dikran Marsupial, Oct 27 '23 at 16:38

score 3 · Answer 1 · answered Oct 20 '23 at 17:38

3

One reason to prefer softmax is that it may be more efficient in terms of modelling resources, such as hidden units in a multilayer perceptron network. This is because it enforces a constraint that all of the probabilities must sum to one "for free" rather than this having to be modelled/learned by the network. The optimal solution of the cross-entropy loss, for multiple exclusive classes, requires this constraint to be met, so it is better if you can get it for free. It also means that if you have a hidden unit that encodes "class A is common in this area of the attribute space", it will automatically inhibit the outputs corresponding to the other classes, again without consuming modelling resources.

I learned this while trying to construct some simple networks (with only one input) for illustrating how neural networks learn. Using softmax outputs I found it was more difficult to construct "pathological" examples for use in teaching.

answered Oct 20 '23 at 17:38

Dikran Marsupial

54,432
9
139
204

It makes sense that forcing the probabilities to add up to one instead of letting the model figure it out would ease the job the model has to do. However, I question if we should demand that the probabilities add up to one. Especially in my MNIST example in the bounty message (which I'll concede is not part of the original question), I don't want to show my MNIST CNN a picture of a boot and have it come back with something like [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]. I would like all of those to be just about zero. +1 nonetheless. – Dave Oct 20 '23 at 17:44
1

@Dave in that case I would have a "not a number" class and still use the softmax activation function (or use some form of novelty detector to pre-screen the images actually presented to the CNN). – Dikran Marsupial Oct 20 '23 at 17:51
I'm not sure that a set of k binary classifiers, given sufficient resources, would necessarily give a different output to the softmax case if the classes are mutually exclusive in the training set. In both cases the training criterion is optimised by the model(s) outputting the true probability of class membership, so if they differ it is only in the way they fail to be optimal for the problem as stated. Another approach would be to have a "reject" option if the probabilities of the classes were essentially the same. – Dikran Marsupial Oct 20 '23 at 18:02
1

Interesting discussion, as usual. +100 – Dave Oct 26 '23 at 16:47

Ggjj11 · Answer 2 · 2023-10-20T17:46:13.523

One reason to prefer k binary classifiers is that this can be used for Multi-Label classification. Here your likelihood function reads like this: $L=\prod_{i=1}^n (\prod_k (\prod_{j_k}(p_{k,j_k}(x_i)^{\delta_{y_{i,k},j_k}}))) $ where index $i$ runs over the samples, index $k$ runs over the labels, $j_k$ indicates the binary outcomes 0 or 1, $\delta_{a,b}$ indicates the Kronecker delta,$y_{i,k}\in {0,1}$ indicates the multiple hot encoded labels of sample $i$, see https://en.m.wikipedia.org/wiki/Multi-label_classification

For multi class classification there may only be one class assigned and the likelihood is

$L = \prod_{i=1}^n P(Y_i=y_i) = \prod_{i=1}^n \left( \prod_{k=1}^K P(Y_i=k)^{\delta_{k,y_i}} \right)$, compare https://en.m.wikipedia.org/wiki/Multinomial_logistic_regression.

Multilabel classification is (in a sense) a generalization of the multiclass classification.

If you already know that classes are exclusive then the multi-label setting would allow you for too much flexibility (the event takes place either in city A or in city B - but not both). A multi label model might output (0.7,0.8) while a multi-class model would have to trade off, outputting (0.3, 0.7) due to the constraint of class exclusivity and normalization of the output.

PS: in a neural net for multi class classification the last activation function would typically be a softmax (which makes sure that the predicted class probabilities sum up to 1 for each sample). For the binary classification setting, you could simply change that to an (elementwise) sigmoid function.

I think the point is why we would ever not go with the multi-label model. At least to me, there is an obvious advantage to the multi-label approach. What, if anything, is the drawback? — Dave, Oct 20 '23 at 17:39
Please also note that there is alternatives to softmax, like e.g. entmax, also providing normalized outputs summing to 1. These alternatives tend to focus the probability weight more - it has sparser output structure — Ggjj11, Oct 20 '23 at 17:50
Sure, there are alternatives to softmax, but they all have the same theme of forcing the probabilities to sum to one because the categories are mutually exclusive. At least for the bounty (and I think this was the spirit of the original question, which I concede is not my question), what would be the advantage of forcing the probabilities to sum to one instead of letting them be whatever? $70%$ chance of dog and 80%$ chance of human? Why not? If you get a calibrated model with mutually exclusive outcome categories, this seems like it would be impossible. — Dave, Oct 24 '23 at 14:56

Mutual exclusive classes for deciding Softmax Regression vs. k Binary Classifiers

2 Answers2

Linked