3

For better or for worse, some people shoehorn binary $y$ variables into an ordinary least squares linear regression.

$$ \mathbb E[Y\vert X]=\hat y=X\beta $$

If we encode the $y_i$ as either $0$ or $1$, we can make this work. The usual OLS estimate of $\hat\beta_{ols}=(X^TX)^{-1}X^Ty$ will work. However, the statistical properties are lacking.

  1. We can predict impossible probability values like $-0.2$ and $+1.4$.

  2. The OLS solution corresponds to maximum likelihood estimation for a Gaussian likelihood, even though we know the likelihood to be binomial.

If, however, we still use a linear model of the probability but minimize crossentropy loss, which is equivalent to maximum likelihood estimation for the correct binomial likelihood, instead of square loss, what happens?

$$ \text{Crossentropy Loss}$$$$ L(y,\hat y)=-\dfrac{1}{N}\sum_{i=1}^N\bigg[ y_i\log(\hat y_i)+ (1-y_i)\log(1-\hat y_i) \bigg] $$

NEW INFO

As an update, R’s glm function is unhappy when I use a binomial family and an identity link function (which I think is exactly what I mean). This could just be the particular numerical method used by the function, but that strikes me as points against the idea of minimizing a different loss function in a linear probability model.

NEWER INFO

I find evidence against this idea (which I don't claim is a good idea, just an idea) in this quick R simulation (which takes code from a simulation I found on Cross Validated and have used many times).

set.seed(2022)
N <- 1000
x1 <- rnorm(N)           # some continuous variables 
x2 <- rnorm(N)
z <- 1 + 2*x1 + 3*x2        # linear combination with a bias
pr <- 1/(1+exp(-z))         # pass through an inv-logit function
y <- rbinom(N, 1, pr)      # bernoulli response variable

L1 <- glm(y ~ x1 + x2, family = binomial) L2 <- lm(y ~ x1 + x2) L3 <- glm(y ~ x1 + x2, family = binomial(link = "identity"))

The logistic regression in L1 compiles. The linear probability model in L2 estimated via ordinary least squares compiles. The linear probability model estimated by minimizing crossentropy loss in L3 gives an error message, Error: no valid set of coefficients has been found: please supply starting values, that I have not bee able to resolve by tacking on a start in the glm argument. This might just be the particular numerical method used in this function, but this sure seems like a strike against my idea to minimize the crossentropy loss but keep the linear model.

However, I still wonder if this is an issue of the numerical optimization not working for these illegal $\log$ values, or if there is something theoretically wrong with $\hat\beta = \underset{\beta\in\mathbb R^p}{\text{argmin}}\{L(y,\hat y)\}$

(Maybe the right way to write the $\text{argmin}$ would be $\underset{\beta\in S}{\text{argmin}}\{L(y,\hat y)\}$ for $S=\{\beta\in\mathbb R^p\vert L(y,\hat y)\in\mathbb R\}$.)

Dave
  • 62,186
  • Here are some posts about binomial regression with identity link function, it can work well: https://stats.stackexchange.com/questions/198439/are-there-any-reasons-to-use-the-identity-link-in-logistic-regression-or-any-ot, https://stats.stackexchange.com/questions/139917/r-binomial-family-with-identity-link, https://stats.stackexchange.com/questions/471374/generalized-linear-model-and-identity-link-whats-its-benefit, – kjetil b halvorsen Sep 22 '22 at 17:29
  • @kjetilbhalvorsen Maybe it’s my R version, but the code at some of those links looks like what I ran and had fail to compile. Weird. – Dave Sep 22 '22 at 17:32
  • Weird ... but those posts are old and R might have changed. You might investigate and if necessary ask at R-help – kjetil b halvorsen Sep 22 '22 at 17:34
  • 1
    @kjetilbhalvorsen Just seeing that it once worked convinces me that the statistics side of this model is fine, no matter what the software implementation requires. Thanks for the links! – Dave Sep 22 '22 at 17:36

3 Answers3

4

I anticipate there is a real problem with this approach because there is no way to enforce that $L$ can be computed in the case where -- as you mention -- $\hat{y}$ is outside the unit interval. In those cases, the summand can not be computed.

  • +1 I’ve wondered about that, but I also wonder if that would force the in-sample predictions to be valid probabilities. – Dave Sep 22 '22 at 04:47
  • 1
    glm.fit does that by step-halving if the Fisher scoring algorithm gives in-sample predictions outside (0,1). See https://github.com/SurajGupta/r-source/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/R/glm.R#L288 (and the result of binomial("identity")$validmu) – Mark Sep 23 '22 at 00:50
4

You can't use cross-entropy loss with a linear model. Notice that it calculates $\log(\hat y)$ and $\log(1 - \hat y)$, while $\hat y$ can be negative or bigger than one, so logs would be undefined (NaN in practice) and it simply won't work. For it to work, you would either need to adapt the loss function so that for $\hat y < 0$ and $\hat y > 1$ it returns something like $\infty$, but then it's not the elegant cross-entropy loss that we know anymore. Another solution might be to transform $\hat y$ so that it is constrained, for example by truncating it (again, not nice) or using something like a logistic function...

Tim
  • 138,066
  • 1
    This was my concern, but why wouldn’t that just force the optimization to pick parameters that assure all $\hat y_i\in (0,1)?$ – Dave Sep 22 '22 at 12:27
  • 1
    @Dave NaN is an invalid result for an optimization algorithm, there is no way it could use it as it doesn't tell it if it gets closer or farther away from the optimum. – Tim Sep 22 '22 at 12:31
  • 1
    @Tim I disagree; common in the OR community is to rewrite constraint violations as a cost of $\infty$ (which we can coerce NaNs to be), and algorithms based on line searches or trust regions can handle this (such as the BFGS implemented in R's "optim"). – John Madden Sep 22 '22 at 21:11
  • 1
    @JohnMadden you disagree with what? This is what I said in the answer. – Tim Sep 22 '22 at 21:27
  • @Tim Disagree with "NaN is an invalid result for an optimization algorithm, there is no way it could use it as it" – John Madden Sep 22 '22 at 22:08
  • @JohnMadden there is a difference between Inf and NaN, Inf is higher than all the other values, NaN cannot be compared to numerical values. – Tim Sep 22 '22 at 22:13
  • 1
    @Tim Sure but we can simply view Dave's $\hat{y}\in(0,1)$ as a nonlinear constraint, and then feel good about returning $\infty$ if it is violated. – John Madden Sep 22 '22 at 22:23
  • Hmm actually there may be issues with unbounded below costs. – John Madden Sep 23 '22 at 03:42
  • @JohnMadden and this is what I suggested in my answer :) – Tim Sep 23 '22 at 05:17
1

In glm function if the family=binomial is chosen, then link function must be logit like L3 <- glm(y ~ x1 + x2, family = binomial(link = "logit")) otherwise it will throw error. If link=linear is chosen then it is assuming a linear model and the purpose of Logistic Regression is lost.

Secondly, if the logistic regression is implemented from scratch then either Frequentist way is to be adopted in which IRLS i.e. Iterative Reweighted Least Squares algorithm is the most popular one or if Bayesian is adopted then there are two routes either by pure MH sampling or MH sampling within Gibbs approach can be adopted.

  • 1
    Thanks for responding; link = “probit” compiles with no errors, so something about this is false. – Dave Sep 22 '22 at 17:20
  • link = “probit” also applied for Binary Classification task only the link function is different, gaussian in this case. – Sandipan Karmakar Sep 22 '22 at 17:22
  • 1
    You said that the link function must be logit, which is not the case. – Dave Sep 22 '22 at 17:25
  • See https://stats.stackexchange.com/questions/139917/r-binomial-family-with-identity-link for how to use identity link function with binomial regression. – kjetil b halvorsen Sep 22 '22 at 17:30
  • @Dave: Yes I was wrong. Logistic Regression inherently calls for Logit link function. But when the link function is used as Gaussian inside Binomial family then no more stays Logistic Regression, it becomes Probit Regression. But both Logit and Probit has the same purpose i.e. for Binary Classification. – Sandipan Karmakar Sep 22 '22 at 17:34
  • 1
    And the linear probability model would have the same purpose as logistic or probit regression: estimating probability of class membership (which is not quite the same as classification). – Dave Sep 22 '22 at 17:35
  • How can probability be linear??? As far my knowledge it is a nonlinear function. I may be wrong. Except Uniform probability distribution no other probability function is linear. – Sandipan Karmakar Sep 22 '22 at 17:37
  • We posit that the probability of an event (conditional on the features) follows the first equation in my question, rather than following a transformation of $X\beta$ like we do in logistic or probit regression. That doesn’t mean we’re right, but it could be a viable model. – Dave Sep 23 '22 at 10:27