4

An example of 100 subjects. Let's say I wish to study the impact of literacy levels (0 as limited and 1 as adequate), anxiety levels (0 as no anxiety and 1 as severe anxiety) and sex (0 for female and 1 for male) on the decision that has been made.

We have decision A and decision B.

the frequency of decision A is 98 and the frequency of decision B is 2.

In such as scenario, is it reasonable to use logistic regression where the dependent variable is the binary variable (decision) ? Why/why not ?

An116
  • 367

1 Answers1

9

A rule of thumb is to have $10$-$15$ members of the minority class per feature included in your regression. In your case, you have two such members, meaning that even one feature is too many to include.

Imbalance isn't really a problem, but a low number of minority-class observations might be. This is to say that $98000$ and $2000$ might not be so problematic but $98$ and $2$ probably is.

Unless you get more data, I do not see much hope for saying anything other than there being a $2\%$ chance of decision $\text{B}$, perhaps with a confidence interval.

Dave
  • 62,186
  • 1
    That's what I thought. Thank you so much so illuminating – An116 Jan 05 '24 at 15:47
  • 3
    (+1) Also one might look up logistic regression in connection with the terms 'narrow distribution' or 'rare events' as there is a literature covering specialized techniques for these situations. – rolando2 Jan 05 '24 at 15:58
  • 2
    It is helpful to consider answering the question in the case where there is no X variable, and consider the resulting uncertainty in the estimate. Then realize that the problem is worse when you introduce X variables. – BigBendRegion Jan 06 '24 at 11:39
  • 1
    "Unless you get more data"... they can also do synthetic data generation; but to do that they'd need an a-priori assumption/knowledge of the distribution of response variable as a fn of {literacy, anxiety, sex}. – smci Jan 06 '24 at 19:51