2

I was looking at the Informer model implemented in HuggingFace and I found that the model is implemented with negative log-likelihood (NLL) loss even though it is a model for a regression task. How can NLL loss be used for regression? I thought it is a loss used for classification only.

Dave
  • 62,186
  • 6
    Minimizing the negative log-likelihood function is equivalent to maximum likelihood estimation. Classification or regression is irrelevant here. – mhdadk Apr 05 '23 at 14:40

1 Answers1

4

As I discuss here, "negative log likelihood" seems to be a slang in some circles to refer to the log loss in classification problems, since minimizing that log loss is equivalent to maximizing the binomial log-likelihood.

However, maximum likelihood estimation is a general idea in statistics. For instance, minimizing square loss as is done in OLS linear regression is equivalent to maximum likelihood estimation of the regression parameters under the assumption of a Gaussian likelihood (so $iid$ Gaussian error terms for linear regressions). Other loss functions correspond to maximum likelihood estimation, too, such as minimizing absolute loss being equivalent to maximizing Laplace likelihood. Indeed, if you approach the regression problem from the standpoint of using maximum likelihood estimation of the parameters, that is equivalent to estimating the regression parameters using negative log likelihood. If $L(\theta)$ is the likelihood function, then the following is true.

$$ \underset{\theta}{\arg\max} \{L(\theta)\} = \underset{\theta}{\arg\min} \{-\log\left(L(\theta)\right)\} $$

(The left side is the set of values of some parameter (possibly a vector) $\theta$ that maximize the likelihood function, and the right side is the set of values of that same parameter (possibly a vector) that minimize the negative log-likelihood.)

This even holds outside of regression or supervised learning problems. You can use this idea to estimate variances or whatever other parameter you want to estimate (assuming you are fine with using maximum likelihood estimation).

Your reference seems to be using correct statistical terminology, and this is the exactly the point being made in the comment.

Dave
  • 62,186