In TensorFlow the documentation for SparseCategoricalCrossentropy states that using from_logits=True and therefore excluding the softmax operation in the last model layer is more numerically stable for the loss calculation.
Why is this the case?
In TensorFlow the documentation for SparseCategoricalCrossentropy states that using from_logits=True and therefore excluding the softmax operation in the last model layer is more numerically stable for the loss calculation.
Why is this the case?
First of all here I think a good explanation about should you worry about numerical stability or not. Check this answer but in general most likely you should not care about it.
To answer your question "Why is this the case?" let's take a look on source code:
def sparse_categorical_crossentropy(target, output, from_logits=False, axis=-1):
""" ...
"""
...
# Note: tf.nn.sparse_softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
_epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
output = tf.log(output)
...
You could see that if from_logits is False then output value is clipped to epsilon and 1-epsilon.
That means that if the value is slightly changing outside of this bounds the result will not react on it.
However in my knowledge it's quite exotic situation when it really matters.