Zero Inflated Logistic Regression - Does This Exist?

Question

I have a dataset with information belonging to medical patients (e.g. age, gender, height, etc.). Suppose that the response variable is whether or not the patient has a specific disease or not - thus, the goal would be to understand the effects and significance of the different explanatory variables on having the disease.

Now, imagine that this is a rare disease and only 5% of the patients in your dataset has the disease. If you fit a Regression model to this data (e.g. Logistic Regression), the model might not have observed enough disease cases to effectively "learn" the difference between diseased cases and non-diseased cases - and therefore poorly generalize to new data.

I tried to read more about this online and came across "Zero Inflated Models" (https://en.wikipedia.org/wiki/Zero-inflated_model) and "Gamma Hurdle Models" (https://en.wikipedia.org/wiki/Hurdle_model) - these models seem to be appropriate for instances where there are many "Zeros" within the response variable. However, it seems that these models are intended for "Count Data", whereas the problem I am working with has a "Binary Response".

I tried to read more online to see if there are extensions of these frameworks for Logistic Models (e.g. Zero Inflated Logistic Model, Logistic Hurdle Model) and if it would be possible to implement these in R - but there does not seem to be anything at first glance.

I was thinking of just "tricking" my model into believing that the binary response variable I have is actually "count data" and then using Zero Inflated Models/Hurdle Models - but I feel that this is disingenuous and will likely result in problems later on.

As such, the closest thing I could find was "Weighted Logistic Regression" - but again, there do not seem to be many references and R implementations for this approach. Initially, I had thought of using an "Oversampling Approach" to correct for class imbalance, but I was advised that this might not be suitable (Logistic Regression With Imbalanced Data?)

Can someone please comment if it is possible to adapt Zero Inflated Models/Hurdle Models to a Logistic Regression? What kind of strategies can I employ in such a problem?

Possible References:

I thought of this over the summer and have a draft question saved to my desktop. I’m glad someone got around to posting! — Dave, Nov 24 '22 at 15:31
@ Dave: great minds think alike lol! I would be very interested in seeing your question! — stats_noob, Nov 24 '22 at 15:56
It was the same idea: if can we use a zero-inflated model to improve our probability predictions. I’ll look later or tomorrow to see if there are any additional pieces, though. — Dave, Nov 24 '22 at 16:51

score 16 · Answer 1 · answered Nov 24 '22 at 08:21

16

Logistic regression will not "state that all future patients do not have the disease". Logistic regression yields probabilistic predictions, i.e., probabilities that a patient has the disease.

In the case of a rare disease, this probability may be extremely low (for a patient that is essentially healthy - no need for action), or very low (better to run another non-invasive test), or "merely" low. If the disease in question is rare but dangerous, it may make sense to run an invasive test, e.g., taking biopsies, even if the predicted probability your logistic regression yields is only $\hat{p}=0.2$. You need to adapt your decision thresholds (possibly multiple ones, as here!) to the costs of decisions.

Thus, while a "zero-inflated logistic regression" could in principle make sense (e.g., in the case where we suspect two data generating processes to be at work, one of which always yields a zero), that does not seem to be the case here. Logistic regression can deal quite well with rare instances of the target variable. If all goes well, it will simply output low probabilities. If these are well-calibrated, this is precisely what should happen.

And no, oversampling (or weighting, which is essentially the same as oversampling) won't address a non-problem.

answered Nov 24 '22 at 08:21

Stephan Kolassa

123,354

@ Stephan Kolassa: thank you for your reply! I made a few corrections, thank you for pointing those out! I am not interested in individual predictions but on estimating the effects of the explanatory variables on the response. – stats_noob Nov 24 '22 at 08:30
"Logistic regression can deal quite well with rare instances of the target variable. " . Can you please explain why this is? – stats_noob Nov 24 '22 at 08:33
"Thus, while a "zero-inflated logistic regression" could in principle make sense (e.g., in the case where we suspect two data generating processes to be at work, one of which always yields a zero), that does not seem to be the case here" . Even with the clarifications I added, a zero hurdle model is still not suited? – stats_noob Nov 24 '22 at 08:34
If you have few observations of your variable of interest, then your dataset does not contain a lot of information, and any parameter estimates will be imprecise. That is correct. ... – Stephan Kolassa Nov 24 '22 at 10:21
3

... However, what happens in a zero-inflated logistic regression? Essentially, a zero-inflated model simply estimates a mixture model with two components, one a constant zero and one "something else", whether that is Poisson, Negbin or logistic regression. So adding zero inflation means that we have to estimate even more parameters, namely how the mixture probabilities depend on any covariates. Thus, the problem of imprecise estimates will only get worse, unless we can simplify the models involved in using zero inflation. ... – Stephan Kolassa Nov 24 '22 at 10:22
... That is why I would only recommend using zero inflation if we have grounds for suspecting multiple data generating processes to be involved. In this case, a zero inflated model is well specified, and a non-zero inflated one is simply misspecified. But you will still have the issue of low information and high parameter variance. There is no way around the fact that having seen few cases means we have low information. The only remedy is to collect more data. – Stephan Kolassa Nov 24 '22 at 10:24
As to how logistic regression deals with rare instances: it simply predicts (and fits) low probabilities. That parameter estimates are highly variable is not due to the logistic regression, per above, but to the fact that we have little information. – Stephan Kolassa Nov 24 '22 at 10:25
@ Stephen Kolassa: thank you for your replies! Do you think something like this might be a good choice for my problem? https://r.iq.harvard.edu/docs/zelig.pdf – stats_noob Nov 24 '22 at 15:57
http://docs.zeligproject.org/articles/zelig_relogit.html – stats_noob Nov 24 '22 at 15:58

Martin Modrák · Answer 2 · 2022-11-24T16:52:25.357

While the answer by Stephan gives a good overview of the bigger picture, I think the answer in the narrow sense is IMHO that

No, zero-inflated logistic regression does not make much sense

Why? Assume the true data-generating process is indeed a mixture of a Bernoulli distribution and a constant zero, specifically:

$$ \begin{align} y_i &= \begin{cases} 0 & z_i = 0 \\ \bar{y}_i & z_i = 1 \end{cases}\\ \bar{y}_i &\sim \text{Bernoulli}(p_i)\\ z_i &\sim \text{Bernoulli}(\theta_i) \end{align} $$

Where both $p_i$ and $\theta_i$ are some function of some predictors (usually a logit-transformed linear predictor term). We can quickly see that the outcome is 1 if and only if both $z_i$ and $\bar{y}_i$ are 1, so $P(y_i = 1) = \theta_i p_i$ and thus simply $y_i \sim \text{Bernoulli}(\theta_i p_i)$. This means $y_i$ can only give you information about the product $\theta_i p_i$ and cannot disentangle the individual contributions of the "logistic regression" and "zero inflation" components. unless you make strong restricting assumptions about the possible forms of predictors for $\theta_i$ and $p_i$. (In theory, there is a very tiny difference as this "zero-inflated" formulation implies a slightly different link function and thus different behavior of continuous predictor terms than the logistic regression, but I think this is highly unlikely to be relevant to any practical analysis task).

A similar line of reasoning applies to hurdle logistic regression.

So standard logistic regression model is likely sensible, but it is known that maximum likelihood estimators can be biased when there is little information in the data (small sample size and/or rare events) and bias corrected methods such as Firth's correction (e.g. via logistf) are thus likely to be preferred to R's glm or similar.

The case would be different if you had zero inflation to a binomial response with more than one trial - then you could in fact at least in some cases learn something about the zero-inflation/hurdle component separately from the success probability.

@ Martin Modrak: Thank you for your reply! Do you think something like this might work?http://docs.zeligproject.org/articles/zelig_relogit.html — stats_noob, Nov 24 '22 at 15:59
@stats_noob I've never seen this package before, but looks plausible. Generally, standard logistic regression often works quite well. There is a bunch of bias-correcting methods. The only one I've used previously is the Firth's correction implemented in the logistf package. Not sure how that differs from the correction in the Zelig's relogit, but Zeligs' reference (King & Zeng 2001: "Logistic regression in rare events data") cites Firth favorably and claims that results are numerically highly similar. — Martin Modrák, Nov 24 '22 at 16:45
Thank you so much! Do you think if you have time, you could please elaborate more on the second paragraph (i.e. the mathematical details)? I would be interested in learning more about this! — stats_noob, Nov 25 '22 at 10:09
@stats_noob There are many directions that one could elaborate on. Do you have a specific question in mind? — Martin Modrák, Nov 25 '22 at 15:23

score 7 · Answer 3 · answered Nov 25 '22 at 12:45

Zero inflated models are using a distribution (like Poisson or something else) that is mixed with a point mass at zero.

Logistic regression relates to binary data which is modelled with a Bernoulli distribution. When you mix a Bernoulli distribution with a point mass at zero then you get another Bernoulli distribution. So zero inflation does not make much sense.

An exception could be when your response is a Binomial distribution that is modelled with a logistic function for a parameter $p$. E.g. instead of whether a patient has a disease or not, the response could be something like during how many years $x$ out of $n$ a patient has had some rare symptom.

score 6 · Answer 4 · answered Nov 26 '22 at 03:54

Zero-inflated logistic regressions do exist. The most "famous" example is that of a species occupancy model where the probability of presence of a species follows a logistic model dependent on various predictors AND the probability of detection on a single visit is less than 1.

If only a single visit is observed, then as @MartinModrak has mentioned only the product of the occupancy probability and the detection probability can be estimated. However, if multiple independent observations are taken on the same site under the same conditions (with the true but unknown presence status constant), then the two probabilities can be separated.

The R package unmarked deals with such models (logistic regressions for both the probability of occupancy and the probability of detection). A reference for this is

MacKenzie, D. I., J. D. Nichols, G. B. Lachman, S. Droege, J. Andrew Royle, and C. A. Langtimm. 2002. Estimating Site Occupancy Rates When Detection Probabilities Are Less Than One. Ecology 83: 2248-2255.

For your problem with a rare disease such models might be appropriate if the test for having the rare disease does not always detect the disease when the patient does have the disease. You would need to have patients tested multiple times (with an appropriate model describing the probability of detection).

These methods are certainly data hungry and having a "rare" disease (i.e., a small probability of having the disease) raises the need for lots of patients.

But my main point is that there are models that deal with excess zeros in a logistic regression but a secondary model (and subject matter rationale) that describes the generation of excess zeros is required.

Zero Inflated Logistic Regression - Does This Exist?

4 Answers4

Linked