I'm curious as to why the Negative Log Likelihood (NLL) loss is used for classification tasks in PyTorch (see here). The negative log likelihood is a much more general notion than a measurement of error in a classification problem.
Yes the negative log likelihood of a Categorical distribution can be minimized (with respect to some parameters) to do maximum likelihood estimation, but it is not reserved for the Categorical distribution.
The negative log likelihood is a function we can determine for any distribution. For example, we can also use the negative log likelihood of a Gaussian distribution to then minimize, effectively doing maximum likelihood, in a simple regression problem.
Is there any reason PyTorch decided to do this? Is the negative log likelihood a commonly abused term to refer to a classification objective?