Does using log of classifier outputs improve training performance?

Question

When we train models, often we use the log of outputs in $(0,1)$ in our cost functions. Does anyone have a source that explains if this improves training?

For example, supposed the cost is $f(\theta)=\sum_i{\delta_{y_i=1}\log\hat{y}_i}$.

I think I read somewhere that stretching $(0,1)$ to $(-\infty,0)$ helps learning.

Does anyone have a reference?

the log-odds formulation converts logistic regression into a linear regression form. It has good results. — EngrStudent, Sep 16 '22 at 19:21

score 6 · Answer 1 · answered Sep 16 '22 at 17:53

6

Remember that: $\sum_i \delta_{y_i = 1}\log( y_i) = \sum_i \log( y_i^{\delta_{y_i = 1}}) = \log(\prod_i y_i^{\delta_{y_i = 1}})$

In short, if you don't take the log you are multiplying a lot of numbers between 0 and 1 with each other. If you do that, you will end up with very very very small numbers. So small, that computers will have trouble acurately working with them.

answered Sep 16 '22 at 17:53

Maarten Buis

21,005

I am curious about this answer, and upvoted it. However I don't currently understand. If the targets are one hot, the product has only one non-unity term. That's not a lot of numbers between 0 and 1 being multiplied, it's just 1s and a single fraction. – Alex Sep 16 '22 at 23:44
However |log(min double precision)| << max double precision, which gives credit to your idea. – Alex Sep 19 '22 at 02:45

usul · Accepted Answer · 2022-09-19T15:24:09.820

There is a lot going on here, but this point may be useful: log loss is a "proper loss" or "proper scoring rule".

Setup: suppose the classifier output $p$ is a probability distribution over $\mathcal{Y}$, the possible labels. That is, rather than guessing one of the labels, our model outputs a posterior distribution over them. Suppose the true Bayes-optimal distribution is $p'$. Then we want to use a loss $\ell(p,y)$ where setting $p=p'$ minimizes expected loss. This is the definition of "proper".

Log loss is proper: $E_{y \sim p'} \ell(p,y) = - \sum_y p'(y) \log p(y), $ and you can check that the minimizing choice is $p=p'$. (The expected loss expression is known as cross-entropy.)

It would not be proper to, for example, use loss $\ell(p,y) = -p(y)$. In this case you can check that $\sum_y p'(y) p(y)$ is not minimized by $p=p'$, but instead by the delta distribution on the mode of $p'$.

In fact, log loss is the only proper loss of the form $\ell(p,y) = f(p(y))$, i.e. the only proper loss that only depends on the probability assigned to the observation $y$ and not to the rest of the probabilities.

Resource: https://stats.stackexchange.com/a/493949/70612

This is helpful, can you give a reference that presents loss this way? — Alex, Sep 19 '22 at 02:54
@Alex I found a great crossvalidated answer which I've linked at the bottom. — usul, Sep 19 '22 at 15:24
Thanks for that link. I am wondering also about the formulation of loss in terms of a distribution and class like you show l(p,y). This looks similar to a Bayesian decision theory formulation I saw where they find expected loss, where loss is given in terms of a parameter estimate and an action. Do you have a reference for loss formulation like you present? I think this is related to the term you mentioned "Bayes-optimal". — Alex, Sep 19 '22 at 17:53

Does using log of classifier outputs improve training performance?

2 Answers2

Linked