3

When we train models, often we use the log of outputs in $(0,1)$ in our cost functions. Does anyone have a source that explains if this improves training?

For example, supposed the cost is $f(\theta)=\sum_i{\delta_{y_i=1}\log\hat{y}_i}$.

I think I read somewhere that stretching $(0,1)$ to $(-\infty,0)$ helps learning.

Does anyone have a reference?

Neil G
  • 15,219
Alex
  • 351

2 Answers2

6

Remember that: $\sum_i \delta_{y_i = 1}\log( y_i) = \sum_i \log( y_i^{\delta_{y_i = 1}}) = \log(\prod_i y_i^{\delta_{y_i = 1}})$

In short, if you don't take the log you are multiplying a lot of numbers between 0 and 1 with each other. If you do that, you will end up with very very very small numbers. So small, that computers will have trouble acurately working with them.

Maarten Buis
  • 21,005
  • I am curious about this answer, and upvoted it. However I don't currently understand. If the targets are one hot, the product has only one non-unity term. That's not a lot of numbers between 0 and 1 being multiplied, it's just 1s and a single fraction. – Alex Sep 16 '22 at 23:44
  • However |log(min double precision)| << max double precision, which gives credit to your idea. – Alex Sep 19 '22 at 02:45
5

There is a lot going on here, but this point may be useful: log loss is a "proper loss" or "proper scoring rule".

Setup: suppose the classifier output $p$ is a probability distribution over $\mathcal{Y}$, the possible labels. That is, rather than guessing one of the labels, our model outputs a posterior distribution over them. Suppose the true Bayes-optimal distribution is $p'$. Then we want to use a loss $\ell(p,y)$ where setting $p=p'$ minimizes expected loss. This is the definition of "proper".

Log loss is proper: $E_{y \sim p'} \ell(p,y) = - \sum_y p'(y) \log p(y), $ and you can check that the minimizing choice is $p=p'$. (The expected loss expression is known as cross-entropy.)

It would not be proper to, for example, use loss $\ell(p,y) = -p(y)$. In this case you can check that $\sum_y p'(y) p(y)$ is not minimized by $p=p'$, but instead by the delta distribution on the mode of $p'$.

In fact, log loss is the only proper loss of the form $\ell(p,y) = f(p(y))$, i.e. the only proper loss that only depends on the probability assigned to the observation $y$ and not to the rest of the probabilities.

Resource: https://stats.stackexchange.com/a/493949/70612

usul
  • 884
  • This is helpful, can you give a reference that presents loss this way? – Alex Sep 19 '22 at 02:54
  • 1
    @Alex I found a great crossvalidated answer which I've linked at the bottom. – usul Sep 19 '22 at 15:24
  • Thanks for that link. I am wondering also about the formulation of loss in terms of a distribution and class like you show l(p,y). This looks similar to a Bayesian decision theory formulation I saw where they find expected loss, where loss is given in terms of a parameter estimate and an action. Do you have a reference for loss formulation like you present? I think this is related to the term you mentioned "Bayes-optimal". – Alex Sep 19 '22 at 17:53