3

I know that OLS regression is linear and output expected is continuous and values will fall higher than 1 or less than 0 so is no meaning of values what are not between 0 and 1 (here pointing to values 20, 30 etc not strictly around 1), can’t be interpreted.

My questions is: Mathematically speaking: Why is not appropriate to use OLS regression when we have binary dependent variable?

n1tk
  • 131
  • Is there a difference between a statistical and mathematical explanation? I believe there are many on this site and others that explain why it is not appropriate statistically. – Jon Aug 15 '17 at 01:21
  • By proof is what I’m interested, statistically is clear but can’t find solid math proof ... – n1tk Aug 15 '17 at 01:38
  • If the regression predicts one value to be 1.001 and another value to be 0.999, then I personally would not reject the prediction of 1.001 in the case you describe. As I understand your description, one way to interpret the results would be, "Is the regression value above or below 0.5?" – James Phillips Aug 15 '17 at 13:12
  • I doubt that you will find a mathematical proof. Sometimes OLS is used for this very thing (see https://en.wikipedia.org/wiki/Linear_probability_model). Probit and logit are usually used instead because, as you say, the linear model often can't be interpreted at all. – Michael Webb Aug 15 '17 at 14:10
  • 3
    Beacuse the assumptions underlying OLS are not fulfilled when you have a binary dependent variable (e.g. the homoscedasticity assumption). See e.g. Basic Econometrics by Gujaratti. –  Aug 15 '17 at 14:57
  • Of possible interest: https://stats.stackexchange.com/questions/589655/linear-probability-model-with-crossentropy-log-loss – Dave Feb 12 '23 at 06:09
  • I’m voting to close this question because it asks for mathematical proof of something that can't really be proven, but its first paragraph admits that the method can yield ridiculous results. – Peter Flom Dec 07 '23 at 15:14
  • @Peter There is a simple mathematical proof: OLS (at least when employed with a probability model) assumes iid conditional responses and when the slope is nonzero, those conditional responses are not identically distributed because they don't even have a common support. – whuber Dec 07 '23 at 20:15

1 Answers1

1

This approach can (and likely will) produce values that are impossible, and there are other methods like logistic regression that avoid this issue. That’s one reason why OLS might not be preferred for such a situation.

However, the math works out fine. Nothing will stop you from fitting an OLS model with your parameters estimated the usual way with $\hat\beta=(X^TX)^{-1}X^Ty$. If such a model works best for what you’re doing, as proponents of linear probability models believe, then go for it! Nothing about the derivation of $\hat\beta=(X^TX)^{-1}X^Ty$ makes a distribution assumption about the error term.

For instance, in the following simulation, the linear probability model has tight confidence intervals around the true parameters (intercept is $0$, slope is $1$), and that even happens with a small sample size of just ten.

set.seed(2023)
R <- 1000
N <- 10
intercepts <- slopes <- rep(NA, R)
for (i in 1:R){
  x <- runif(N, 0, 1)
  y <- rbinom(N, 1, x)
  L <- lm(y ~ x)
  intercepts[i] <- summary(L)$coef[1, 1]
  slopes[i] <- summary(L)$coef[2, 1]
}
t.test(intercepts)$conf.int[c(1, 2)] # Confidence interval (-0.01600467,  0.01664349)
t.test(slopes, mu = 1)$conf.int[c(1, 2)] # (0.9723355, 1.0277178)

If the true relationship is $\mathbb E\left[Y\vert X\right] = \beta_0 + \beta_1X$, as is the case here, then the linear model captures this relationship.

As with Frank Harrell in the comments, I have my doubts about how well you will get a linear probability model to fit most data, but if it works...

Dave
  • 62,186
  • 4
    Not quite. The poor fitting OLS model will likely need a lot of added interactions to help make up for the lack of fit. – Frank Harrell Apr 05 '22 at 18:36