1

It is well known that Probit regression can be formulated as transformed linear model in the following sense: Suppose that $Z = X\beta + \epsilon$ for some normally distributed $\epsilon$ and let $Y = 1[Z\geq 0]$. If we observe $Z$ we have an ordinary regression model. If we observe $Y$ we have a Probit regression model.

But sometimes you observe some $Y$s and some $Z$s, not all $Y$s or all $Z$s. What we observed is determined by the variable $I$, which is independent of $Z$ and the covariates $X$. For instance, we could observe:

set.seed(313)
x <- rnorm(10)
y <- 1 + 2 * x + rnorm(10) # Normal model.
i <- rbinom(10, 1, 0.7) # Randomly sampled indices. 0 if linear, 
                        # 1 if probit.
y[i == 1] <- y[i == 1]  >= 0 # So i == 1 gets transformed to 0 or 1.
y
#>  [1]  1.0000000  1.0000000  1.0000000 -1.8411864  1.0000000  0.0000000  1.0000000  1.0000000  0.4962323  3.3924775

In R, you would naïvely use lm to fit the variables that are not 0 or 1 and glm to fit the rest. But I want to fit both models at once! Now, it is not hard to implement an estimator using e.g. the optim function together with the log-likelihood of the model, but I'm asking if:

  1. Is there an R or Python package that fits these models efficiently?
  2. Are there any papers or books discussing these kinds of models?

Observe that this is not a hurdle model or about zero-inflated data (the Probit part can be both $0$ and $1$), not a Tobit model (which censors data), nor a mixture model (we condition on the type of the variables). It's not hard to write down the likelihood for it:

$$\prod_{i=1}^{n}\left[\phi(y_{i}\mid\beta^Tx_{i})1[y_{i}\text{ is continuous}]+f(y_{i}\mid \beta^Tx_{i})1[y_{i}\text{ is binary}]\right],$$
where $\phi$ is the normal density and $f$ is the pmf of the Probit model.

Grassie
  • 343
  • Mixture models may be relevant. They're often invoked for dealing with zero-inflated data. – mkt Oct 06 '22 at 05:52
  • I wouldn't call this a zero-inflated data problem though, as that's about problems with more zeros than the model implies. This, on the other hand, is a problem involving two different kinds of data types, continuous and binary (which could be encoded as true and false instead, making it clear it's not about zero-inflated data) with a fully specified model. – Grassie Oct 06 '22 at 06:58
  • Yes, I understand that. I was not suggesting that you use a zero-inflated model, but that they are the most common example of mixture models - which I still think might be useful to you. – mkt Oct 06 '22 at 07:17
  • The hurdle models mentioned in the answer may be what you're after. If not, your question might benefit from more detail to help us understand the problem. – mkt Oct 06 '22 at 08:45

0 Answers0