It is well known that Probit regression can be formulated as transformed linear model in the following sense: Suppose that $Z = X\beta + \epsilon$ for some normally distributed $\epsilon$ and let $Y = 1[Z\geq 0]$. If we observe $Z$ we have an ordinary regression model. If we observe $Y$ we have a Probit regression model.
But sometimes you observe some $Y$s and some $Z$s, not all $Y$s or all $Z$s. What we observed is determined by the variable $I$, which is independent of $Z$ and the covariates $X$. For instance, we could observe:
set.seed(313)
x <- rnorm(10)
y <- 1 + 2 * x + rnorm(10) # Normal model.
i <- rbinom(10, 1, 0.7) # Randomly sampled indices. 0 if linear,
# 1 if probit.
y[i == 1] <- y[i == 1] >= 0 # So i == 1 gets transformed to 0 or 1.
y
#> [1] 1.0000000 1.0000000 1.0000000 -1.8411864 1.0000000 0.0000000 1.0000000 1.0000000 0.4962323 3.3924775
In R, you would naïvely use lm to fit the variables that are not 0 or 1 and glm to fit the rest. But I want to fit both models at once! Now, it is not hard to implement an estimator using e.g. the optim function together with the log-likelihood of the model, but I'm asking if:
- Is there an
RorPythonpackage that fits these models efficiently? - Are there any papers or books discussing these kinds of models?
Observe that this is not a hurdle model or about zero-inflated data (the Probit part can be both $0$ and $1$), not a Tobit model (which censors data), nor a mixture model (we condition on the type of the variables). It's not hard to write down the likelihood for it:
$$\prod_{i=1}^{n}\left[\phi(y_{i}\mid\beta^Tx_{i})1[y_{i}\text{ is continuous}]+f(y_{i}\mid \beta^Tx_{i})1[y_{i}\text{ is binary}]\right],$$
where $\phi$ is the normal density and $f$ is the pmf of the Probit model.