How does a gradient pass through argmax in classification?

Question

I just realized I have not given this issue much thought. In a classification task, there is an argmax happening after the softmax to get the most likely class. So how does backpropagation go through that operation?

Indeed, as Tim mentioned, typically we would use a softmax rather than an argmax. Though I feel as though I should mention that it is possible to compute a subgradient of the "hard"-maximum function, which acts like the identity function wrt the maximizing input, and the zero function for other inputs (almost everywhere). — John Madden, Feb 25 '23 at 18:04

Tim · Accepted Answer · 2023-02-25T08:11:59.310

13

It does not. Almost always the model is trained to predict real-valued score (e.g. probability) by minimizing some smooth loss function. While training such a model never makes hard classifications, so never use argmax. The final layer of most of the classification neural networks is softmax rather than hardmax. You can make the hard classification only after the training, by using the score returned by the model.

See also the related thread Why is accuracy not the best measure for assessing classification models? that discusses problems with accuracy as a metric, but many of the points apply as well to other non-smooth metrics. Training the model to minimize such metric would actually be harder.

edited Feb 25 '23 at 08:11

answered Feb 25 '23 at 07:53

Tim

138,066

When computing categorical cross-entropy loss, the negative classes are ignored. Isn't that in some sense doing argmax? – Sam Feb 25 '23 at 09:08
@Sam no it isn't. – Tim Feb 25 '23 at 10:36
1

@Sam Where is the argmax in that? It literally does not consider the magnitude of the outputs – Firebug Feb 25 '23 at 13:41
ok I get it. It only considers the magnitude of the probability of the relevant class, not which class. But it does care about the 'magnitude of the outputs' – Sam Feb 25 '23 at 14:40
2

@Sam No, cross-entropy very much cares about which particular output should be maximized! The negative classes do get involved through softmax, where the score ("logit") of the target class has to compete with the rest, which is expressed in the denominator. Fortunately, the Jacobian turns out to be pretty nice :-) – dedObed Feb 25 '23 at 17:31

How does a gradient pass through argmax in classification?

1 Answers1