Why is computing the loss from logits more numerically stable?

Question

In TensorFlow the documentation for SparseCategoricalCrossentropy states that using from_logits=True and therefore excluding the softmax operation in the last model layer is more numerically stable for the loss calculation.

Why is this the case?

olegr · Answer 1 · 2021-03-03T10:42:04.127

First of all here I think a good explanation about should you worry about numerical stability or not. Check this answer but in general most likely you should not care about it.

To answer your question "Why is this the case?" let's take a look on source code:

def sparse_categorical_crossentropy(target, output, from_logits=False, axis=-1):
""" ...
"""
...

# Note: tf.nn.sparse_softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
    _epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
    output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
    output = tf.log(output)
...

You could see that if from_logits is False then output value is clipped to epsilon and 1-epsilon. That means that if the value is slightly changing outside of this bounds the result will not react on it.

However in my knowledge it's quite exotic situation when it really matters.

Why is computing the loss from logits more numerically stable?

1 Answers1