My dependent variable is a yes/no measure; I'm trying to predict the probability of an event to lead to a yes or no. The independent variable is a categorical predictor with 3 levels. My data is unbalanced; I have an unequal number of events in each of the 3 levels of the independent variable (i.e. 311, 180, 238). Is that okay for a binomial regression or do I need to have an equal number of events across the 3 levels? (I'm not asking about unequal yes/no answers within each level, but about the total number of events in each level). I would appreciate any advice on this.
-
Calling this "binomial logistic regression" may be overkill. If your only predictor is categorical, there is nothing to "model" beyond a lookup table. The problem reduces to fitting 3 separate binomial distributions, one for each condition. You can then compare those distributions to understand the effect of the predictor. – nanoman Mar 17 '22 at 10:09
2 Answers
That is completely fine. Logistic regression will happily work with your data.
Roughly speaking, your parameter estimates will be somewhat less precise than if you had more data in your minority conditions, but you should definitely have enough data for estimation precision to be a minor concern.
- 123,354
You have $729$ observations in total. You would have had the most powerful model by allocating $729/3=243$ subjects to each group.
However, despite your imbalance, you still have a large number of subjects ($180$) in your smallest group, so you still have a powerful model.
The way I think about this is that, if I have the funding to collect observations about, say, $51$ subjects on either of two conditions, I would be forced to allocate them unevenly: $25$ in one group and $26$ in the other. If I suddenly found myself with additional funding to collect data on another subject, I would want to balance the groups to $26$ each, not $25$ and $27$.
- 62,186
-
I should clarify that the events are coming from a sample of 21 infants; each contributing a different amount of data points at each of the 3 levels (data loss is an issue in infant studies, and we don't always get responses for all trials) -- would this cause a problem or does it still not matter? – RAZ Mar 16 '22 at 22:15
-
-
-
Is there any simple way of showing algebraically how this works? I.e. showing how power depends on the balance between classes? – Richard Hardy Mar 17 '22 at 09:48
-
-
Right. Though I am not sure how you are defining power of a model, so this was why I wanted to see some formulas. – Richard Hardy Mar 17 '22 at 09:50