Two definitions of logistic loss function?

Question

During my studies I have come across seemingly two different definitions of the logistic loss function. Please see the pictures attached.

What do I make of these two different definitions?
Which one should I use?

Am I even wrong to say that they are different? Please use plain english, Thanks

Maximum likelihood estimation is the traditional way of estimating the parameters of a logistic regression, and it is equivalent to using the binary cross-entropy loss. There are many different types of loss functions, however, even for the same model. Linear regression can use square loss (ordinary least squares) or could use the absolute values of the residuals instead of their squares. These give different results, and you may value the properties of one over the other, but neither is “right” or “wrong”. — Dave, Apr 25 '20 at 20:31
What does your second formula have to do with the logistic function? I don't see it anywhere, not even in disguise. Since the two formulas have no apparent connection or relationship, how are we supposed to answer the question about choosing one over the other?? — whuber, Apr 25 '20 at 21:23

Dave · Answer 1 · 2023-04-20T11:18:06.183

Quick Take: it turns out that the two are equivalent, so it does not matter which you use as long as you are clear about what the terms mean and what numbers you input into the equations.

Let's break down what the terms mean in each equation.

$$ \text{Logistic Loss}\\ \dfrac{1}{N}\overset{N}{\underset{i=1}{\sum}} \log\left(1 + \exp(-y_i w^Tx_i)\right) $$

(This is the correct way to write the full "logistic loss", as the equation given in the question is the contribution to the loss by each prediction (of which the mean loss value is calculated).)

$N$ is the sample size.

$y_i\in\{-1,+1\}$ is the $i$th true value.

$w^T$ is the transposed parameter vector estimate of the logistic regression.

$x_i$ is the $i$th feature vector (your vector of predictors).

Note that $w^Tx_i$ is the predicted value of the logistic regression on the log-odds scale (so before applying the inverse link function to convert to probability). After all, a generalized linear model is $g(\mathbb E[y\vert X=x_i])=w^Tx_i$.

Therefore, the logistic loss will be useful if you have coded your categories as $\pm1$. The predicted values you input into the loss function along with these $\pm1$-coded categories are the log-odds.

$$ \text{Log Loss}\\ -\dfrac{1}{N}\overset{N}{\underset{i=1}{\sum}}\left[ y_i \log(p(y_i)) + (1 - y_i)\log(1 - p(y_i)) \right] $$

$N$ is the sample size.

$y_i\in\{0, 1\}$ is the $i$th true value.

$p(y_i)$ is the predicted probability that observation $i$ belongs to category $1$. This is the predicted value of the logistic regression on the probability scale, so applying the inverse of the log-odds logistic regression link function to the linear predictor of the logistic regression.

$$ p(y_i) = \dfrac{1}{ 1 + \exp(-w^Tx_i) }\\ \Big\Updownarrow\\ w^Tx_i = \log\left( \dfrac{ p(y_i) }{ 1 - p(y_i) } \right) $$

This "log" form of the loss function makes sense when the categories are coded as $0$ and $1$ instead of $\pm1$ and when you have predicted probabilities.

That you can convert easily between the $\{0,1\}$ and $\{-1,+1\}$ categorical encodings and between the log-odds and probabilities means that you are free to use whichever you like. Just keep track of what goes into which equation. For instance, do not mix together the predicted log-odds and $\{0,1\}$ encoding.

If you want to use log-odds and $\{-1,+1\}$ encoding, use the "logistic" form of the loss function. If you want to use probability and $\{0,1\}$ encoding, use the "log" form of the loss function.

EDIT

A simulation is not a proof, but it did give me a good feeling to see that, in the below simulation that calculates loss values each way for a range of (over $25000$) possible parameter values for the logistic regression, the two loss functions give the same loss value if the correct arguments are passed to each function.

set.seed(2023)
library(ggplot2)
N <- 100
x <- runif(N, 0, 1)
z <- 4*x - 2
p <- 1/(1 + exp(-z))
y01 <- rbinom(N, 1, p)
y_pm <- 2 * y01 - 1
b0s <- seq(-4, 0, 0.025)
b1s <- seq(2, 6, 0.025)
log_losses <- logistic_losses <- rep(NA, length(b0s) * length(b1s))
log_loss <- function(p, y){
  return(
    -mean(
      (y) * log(p) 
      +
        (1 - y) * log(1-p)
    )
  )
}
logistic_loss <- function(logodds, y){
return(
    mean(
      log(
        1 + exp(
          -y * logodds
        )
      )
    )
  )
}
counter <- 1
for (i in 1:length(b0s)){
  print(i)
  intercept <- b0s[i]
for (j in 1:length(b1s)){
slope &lt;- b1s[j]

log_odds &lt;- intercept + slope*x
probability &lt;- 1/(1 + exp(-log_odds))


log_losses[counter] &lt;- log_loss(probability, y01)
logistic_losses[counter] &lt;- logistic_loss(log_odds, y_pm)
counter &lt;- counter + 1

}
}
L <- lm(log_losses ~ logistic_losses)
d <- data.frame(
  log_loss = log_losses,
  logistic_loss = logistic_losses
)
ggplot(d, aes(x = logistic_loss, y = log_losses)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0)
summary(L)

Coefficients:
                  Estimate Std. Error    t value Pr(>|t|)    
(Intercept)     -5.384e-15  3.348e-17 -1.608e+02   <2e-16 ***
logistic_losses  1.000e+00  4.422e-17  2.261e+16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.189e-15 on 25919 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 5.114e+32 on 1 and 25919 DF,  p-value: < 2.2e-16

Indeed, the differences between the two calculations all are on the order of $10^{-16}$, if not smaller.

summary(abs(logistic_losses - log_losses))

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 0.000e+00 0.000e+00 2.115e-17 0.000e+00 6.661e-16

Two definitions of logistic loss function?

1 Answers1

Linked