9

I have a dataset with information belonging to medical patients (e.g. age, gender, height, etc.). Suppose that the response variable is whether or not the patient has a specific disease or not - thus, the goal would be to understand the effects and significance of the different explanatory variables on having the disease.

Now, imagine that this is a rare disease and only 5% of the patients in your dataset has the disease. If you fit a Regression model to this data (e.g. Logistic Regression), the model might not have observed enough disease cases to effectively "learn" the difference between diseased cases and non-diseased cases - and therefore poorly generalize to new data.

I tried to read more about this online and came across "Zero Inflated Models" (https://en.wikipedia.org/wiki/Zero-inflated_model) and "Gamma Hurdle Models" (https://en.wikipedia.org/wiki/Hurdle_model) - these models seem to be appropriate for instances where there are many "Zeros" within the response variable. However, it seems that these models are intended for "Count Data", whereas the problem I am working with has a "Binary Response".

I tried to read more online to see if there are extensions of these frameworks for Logistic Models (e.g. Zero Inflated Logistic Model, Logistic Hurdle Model) and if it would be possible to implement these in R - but there does not seem to be anything at first glance.

I was thinking of just "tricking" my model into believing that the binary response variable I have is actually "count data" and then using Zero Inflated Models/Hurdle Models - but I feel that this is disingenuous and will likely result in problems later on.

As such, the closest thing I could find was "Weighted Logistic Regression" - but again, there do not seem to be many references and R implementations for this approach. Initially, I had thought of using an "Oversampling Approach" to correct for class imbalance, but I was advised that this might not be suitable (Logistic Regression With Imbalanced Data?)

Can someone please comment if it is possible to adapt Zero Inflated Models/Hurdle Models to a Logistic Regression? What kind of strategies can I employ in such a problem?

Possible References:

Stephan Kolassa
  • 123,354
stats_noob
  • 1
  • 3
  • 32
  • 105
  • 1
    I thought of this over the summer and have a draft question saved to my desktop. I’m glad someone got around to posting! – Dave Nov 24 '22 at 15:31
  • @ Dave: great minds think alike lol! I would be very interested in seeing your question! – stats_noob Nov 24 '22 at 15:56
  • It was the same idea: if can we use a zero-inflated model to improve our probability predictions. I’ll look later or tomorrow to see if there are any additional pieces, though. – Dave Nov 24 '22 at 16:51

4 Answers4

16

Logistic regression will not "state that all future patients do not have the disease". Logistic regression yields probabilistic predictions, i.e., probabilities that a patient has the disease.

In the case of a rare disease, this probability may be extremely low (for a patient that is essentially healthy - no need for action), or very low (better to run another non-invasive test), or "merely" low. If the disease in question is rare but dangerous, it may make sense to run an invasive test, e.g., taking biopsies, even if the predicted probability your logistic regression yields is only $\hat{p}=0.2$. You need to adapt your decision thresholds (possibly multiple ones, as here!) to the costs of decisions.

Thus, while a "zero-inflated logistic regression" could in principle make sense (e.g., in the case where we suspect two data generating processes to be at work, one of which always yields a zero), that does not seem to be the case here. Logistic regression can deal quite well with rare instances of the target variable. If all goes well, it will simply output low probabilities. If these are well-calibrated, this is precisely what should happen.

And no, oversampling (or weighting, which is essentially the same as oversampling) won't address a non-problem.

Stephan Kolassa
  • 123,354
  • @ Stephan Kolassa: thank you for your reply! I made a few corrections, thank you for pointing those out! I am not interested in individual predictions but on estimating the effects of the explanatory variables on the response. – stats_noob Nov 24 '22 at 08:30
  • "Logistic regression can deal quite well with rare instances of the target variable. " . Can you please explain why this is? – stats_noob Nov 24 '22 at 08:33
  • "Thus, while a "zero-inflated logistic regression" could in principle make sense (e.g., in the case where we suspect two data generating processes to be at work, one of which always yields a zero), that does not seem to be the case here" . Even with the clarifications I added, a zero hurdle model is still not suited? – stats_noob Nov 24 '22 at 08:34
  • 3
    ... However, what happens in a zero-inflated logistic regression? Essentially, a zero-inflated model simply estimates a mixture model with two components, one a constant zero and one "something else", whether that is Poisson, Negbin or logistic regression. So adding zero inflation means that we have to estimate even more parameters, namely how the mixture probabilities depend on any covariates. Thus, the problem of imprecise estimates will only get worse, unless we can simplify the models involved in using zero inflation. ... – Stephan Kolassa Nov 24 '22 at 10:22
  • ... That is why I would only recommend using zero inflation if we have grounds for suspecting multiple data generating processes to be involved. In this case, a zero inflated model is well specified, and a non-zero inflated one is simply misspecified. But you will still have the issue of low information and high parameter variance. There is no way around the fact that having seen few cases means we have low information. The only remedy is to collect more data. – Stephan Kolassa Nov 24 '22 at 10:24
  • As to how logistic regression deals with rare instances: it simply predicts (and fits) low probabilities. That parameter estimates are highly variable is not due to the logistic regression, per above, but to the fact that we have little information. – Stephan Kolassa Nov 24 '22 at 10:25
  • @ Stephen Kolassa: thank you for your replies! Do you think something like this might be a good choice for my problem? https://r.iq.harvard.edu/docs/zelig.pdf – stats_noob Nov 24 '22 at 15:57
  • http://docs.zeligproject.org/articles/zelig_relogit.html – stats_noob Nov 24 '22 at 15:58
13

While the answer by Stephan gives a good overview of the bigger picture, I think the answer in the narrow sense is IMHO that

No, zero-inflated logistic regression does not make much sense

Why? Assume the true data-generating process is indeed a mixture of a Bernoulli distribution and a constant zero, specifically:

$$ \begin{align} y_i &= \begin{cases} 0 & z_i = 0 \\ \bar{y}_i & z_i = 1 \end{cases}\\ \bar{y}_i &\sim \text{Bernoulli}(p_i)\\ z_i &\sim \text{Bernoulli}(\theta_i) \end{align} $$

Where both $p_i$ and $\theta_i$ are some function of some predictors (usually a logit-transformed linear predictor term). We can quickly see that the outcome is 1 if and only if both $z_i$ and $\bar{y}_i$ are 1, so $P(y_i = 1) = \theta_i p_i$ and thus simply $y_i \sim \text{Bernoulli}(\theta_i p_i)$. This means $y_i$ can only give you information about the product $\theta_i p_i$ and cannot disentangle the individual contributions of the "logistic regression" and "zero inflation" components. unless you make strong restricting assumptions about the possible forms of predictors for $\theta_i$ and $p_i$. (In theory, there is a very tiny difference as this "zero-inflated" formulation implies a slightly different link function and thus different behavior of continuous predictor terms than the logistic regression, but I think this is highly unlikely to be relevant to any practical analysis task).

A similar line of reasoning applies to hurdle logistic regression.

So standard logistic regression model is likely sensible, but it is known that maximum likelihood estimators can be biased when there is little information in the data (small sample size and/or rare events) and bias corrected methods such as Firth's correction (e.g. via logistf) are thus likely to be preferred to R's glm or similar.

The case would be different if you had zero inflation to a binomial response with more than one trial - then you could in fact at least in some cases learn something about the zero-inflation/hurdle component separately from the success probability.

  • @ Martin Modrak: Thank you for your reply! Do you think something like this might work?http://docs.zeligproject.org/articles/zelig_relogit.html – stats_noob Nov 24 '22 at 15:59
  • @stats_noob I've never seen this package before, but looks plausible. Generally, standard logistic regression often works quite well. There is a bunch of bias-correcting methods. The only one I've used previously is the Firth's correction implemented in the logistf package. Not sure how that differs from the correction in the Zelig's relogit, but Zeligs' reference (King & Zeng 2001: "Logistic regression in rare events data") cites Firth favorably and claims that results are numerically highly similar. – Martin Modrák Nov 24 '22 at 16:45
  • Thank you so much! Do you think if you have time, you could please elaborate more on the second paragraph (i.e. the mathematical details)? I would be interested in learning more about this! – stats_noob Nov 25 '22 at 10:09
  • @stats_noob There are many directions that one could elaborate on. Do you have a specific question in mind? – Martin Modrák Nov 25 '22 at 15:23
  • "A similar line of reasoning applies to hurdle logistic regression"
  • – stats_noob Nov 26 '22 at 03:13
  • "it is known that maximum likelihood estimators can be biased when there is little information in the data"
  • – stats_noob Nov 26 '22 at 03:14
  • Why does a point mass and another Bernouli distribution make another Bernouli?
  • – stats_noob Nov 26 '22 at 03:15