1

I'm struggling on how can I use logistic regression when I have a database where 99% of the responses are 0 ('NO') and 1% are 1 ('YES'). I'm afraid by using somethin like zero-inflated, since it is not counting. Any hint on where I can find a solution, or some technique to apply?

I'm not 'stuck' in this model, I`d like to compare with others, like Decision Tree. But I had this same problem using this approach.

Ga13
  • 278
  • What do you want to use the logistic regression or decision tree to do? – Dave Aug 16 '21 at 19:35
  • The reason is because I need the interpretability of the coefficients, and in the case of logistics, because I have a binary response variable. – Ga13 Aug 16 '21 at 19:37
  • But what do you want to do? – Dave Aug 16 '21 at 19:37
  • Get how much each covariate explains the response variable. – Ga13 Aug 16 '21 at 19:40
  • 1
    What happens when you fit the logistic regression? The unequal number of YES and NO responses is not a problem for logistic regression. – Dave Aug 16 '21 at 19:40
  • What's key is the absolute number of Y=1 regardless of the proportion having Y=1. What is that frequency? That will be the limiting sample size with regard to ability to develop a reliable model. – Frank Harrell Aug 16 '21 at 20:58

1 Answers1

0

This is a common occurrence in machine learning and exactly the issue of that gets discussed on here so often. The typical recommendation is to do nothing, as class imbalance is minimally problematic. Even the bounty-earning answer to the linked question discusses a solution to cost minimization through clever experimental design, rather than what to do once you have the data.

I have thought about if a a zero-inflated Bernoulli likelihood might be helpful, but this kind of hierarchy winds up not making sense to me: you can represent the low probability of vategory $1$ through the zero-inflated model, or you can just predict low probabilities and use a model that is not as complicated (can deal with just a Bernoulli variable instead of some zero-inflated valiable, since a binary variable is so simple).

An issue you might encounter is that all or almost all of the outcomes might be predicted to belong to the majority category (typically coded as $0$). This is an artifact of your software imposing an arbitrary threshold (typically a probability of $0.5$) to bin the continuous outputs into discrete categories. As you change the threshold, you can have predictions ranging from all predicted as category $0$ to all predicted as category $1$. However, this means that you no longer evaluate the original but, instead, the models in conjunction with a decision rule that maps predictions below a threshold to one category and predictions above the threshold to the other.

Dave
  • 62,186