5

A latent variable model involving a binomial observed variable $Y$ can be constructed such that $Y$ is related to the latent variable $Y^*$ via

$ Y = \begin{cases} 0, & \mbox{if }Y^*>0 \\ 1, & \mbox{if }Y^*<0. \end{cases} $

The latent variable $Y^*$ is then related to a set of regression variables $X$ by the model $Y^* = X\beta + \varepsilon$. This results in a binomial regression model.

The variance of $\varepsilon$ can not be identified and when it is not of interest is often assumed to be equal to one. If $\varepsilon$ is normally distributed, then a probit is the appropriate model and if $\varepsilon$ is log-Weibull distributed, then a logit is appropriate. If $\varepsilon$ is uniformly distributed, then a linear probability model is appropriate.

Maarten Buis
  • 21,005

1 Answers1

6

Let's try to validate the claim that if the error term of the underlying latent variable model is assumed uniformly distributed, then a Linear Probability model is appropriate.

The underlying latent variable model is (assuming a simple regression setting for simplicity - it doesn't change anything)

$$Y^* = b_0+ b_1X + \epsilon,\;\; \epsilon\mid X\sim U(-a,a)$$

where the limits for $U$ are chosen so that the error term has a zero expected value, conditional on the regressors. The cumulative distribution function here is $F_{\epsilon|X}(\epsilon\mid X) = \frac {\epsilon + a}{2a}$

and the observed model is (given how $Y$ is in the specific question defined as a function of $Y^*$)

$$P(Y =1\mid X ) = P(Y^*<0\mid X) = P(b_0+ b_1X + \epsilon<0\mid X) = P(\epsilon <- b_0- b_1X\mid X)$$ $$=F_{\epsilon|X}(- b_0- b_1X\mid X) = \frac {- b_0- b_1X + a}{2a} = \frac {- b_0+a}{2a}+\frac {- b_1}{2a}X$$

$$\Rightarrow P(Y =1\mid X )= \beta_0 + \beta_1X$$

which is the Linear Probability model with the mapping

$$\beta_0 = \frac {- b_0+a}{2a},\;\; \beta_1=\frac{- b_1}{2a}$$

  • I think the equation $$F_{\epsilon|X}(- b_0- b_1X\mid X) = \frac {- b_0- b_1X + a}{2a} $$ derives from $P[Z<k]=k$ for $Z\sim U(0,1)$, as made evident from your reply here: https://stats.stackexchange.com/a/105163/159259 . However, this only holds for $0 \leq k \leq 1$: for $k<0$, $P[Z<k]=0$, and, for $k>1$, $P[Z<k]=1$. Then, in case of out-of-range values, I’d say that such underlying latent variable foundation is in line performing OLS estimation first (given that the ML one wouldn't be feasible in such case) and then moving to 1 the values above 1, and moving to 0 negative values. – Federico Tedeschi Jan 07 '23 at 13:58