What are the consequences of rare events in logistic regression?

Question

I know that sample size affects power in any statistical method. There are rules are thumb for how many samples a regression needs for each predictor.

I also hear often that the number of samples in each category in the dependent variable of a logistic regression is important. Why is this?

What are the actual consequences to the logistic regression model when the number of samples in one of the categories is small (rare events)?

Are there rules of thumb that incorporate both the number of predictors and the number of samples in each level of the dependent variable?

https://stats.stackexchange.com/questions/306122/rare-events-logistic-regression https://stats.stackexchange.com/questions/178015/feature-selection-and-pca-in-logistic-regression-with-rare-events-data (and a lot of similar unanswered questions) — kjetil b halvorsen, Oct 12 '17 at 18:26
I think this reference may help. Manel, S., Williams, H.C., Ormerod, S.J., 2001. Evaluating presence-absence models in ecology: the need to account for prevalence. J. Appl. Ecol. 38 (5), 921–931. http://dx.doi.org/10.1046/j.1365-2664.2001.00647.x There a many more about modelling unbalanced datasets. — Rafa_Mas, Oct 18 '17 at 17:54

score 12 · Accepted Answer · answered Oct 12 '17 at 18:27

12

The standard rule of thumb for linear (OLS) regression is that you need at least $10$ data per variable or you will be 'approaching' saturation. However, for logistic regression, the corresponding rule of thumb is that you want $15$ data of the less commonly occurring category for every variable.

The issue here is that binary data just don't contain as much information as continuous data. Moreover, you can have perfect predictions with a lot of data, if you only have a couple of actual events. To make an example that is rather extreme, but should be immediately clear, consider a case where you have $N = 300$, and so tried to fit a model with $30$ predictors, but had only $3$ events. You simply can't even estimate the association between most of your $X$-variables and $Y$.

answered Oct 12 '17 at 18:27

gung - Reinstate Monica

145,122

2

+1 Also, with rare events you will need a surprisingly large number of cases to estimate the true intercept (Harrell, on p. 233, says 96 cases total to have 95% confidence of having predicted probability within 0.1 of the true value when true probability is close to 0 in an intercept-only model), and if there is unbalanced sampling you might need a rare events correction – EdM Oct 12 '17 at 19:04
1

So rare events can bias the estimated intercept. Do rare events cause other specific problems (inconsistency, instability, convergence issues when computing the MLE)? – Michael Webb Oct 12 '17 at 21:21
@Great38 the "perfect predictions" issue in this answer can lead to problems with convergence and wide standard errors. See this post and others on the Hauck-Donner effect or perfect separation. – EdM Oct 13 '17 at 00:30
@Great38, the question is a little unclear. There isn't really any problem w/ rare events. If I have $10^{20}$ data, but w/ 'only' $10^{6}$ events in a model with hundreds of predictors, my event rate is $0.00000000000001$ But I shouldn't expect to have any problems despite my low proportion of events & my hundreds of predictors. – gung - Reinstate Monica Oct 13 '17 at 00:50

score 2 · Answer 2 · answered Dec 10 '23 at 13:41

In the realm of statistical estimation, it is well-established that in sufficiently large samples, the influence of rare events on estimated coefficients is negligible. However, in smaller samples, these rare events can exert a noticeable effect on estimated coefficients.

It is essential to recall that Maximum Likelihood Estimation (MLE) estimators in Logit regression are only asymptotically unbiased, implying that bias is present, particularly in small sample sizes. As the sample size (denoted as `$n$') increases, it can be demonstrated that MLE estimators progressively exhibit more bias as events become increasingly rare.

Considering a Logit regression, the MLE estimator `$\beta$' seeks to maximize the likelihood function:

$$ L(\beta | y) = \sum_{y_t = 1} \log(p_t) + \sum_{y_t = 0} \log(1 - p_t) $$

where $p_t = \frac{1}{1+\exp(-X_t\beta)}$. This formulation implies that the MLE estimator $\hat{\beta}_{MLE}$ aims to maximise $p_t$ when $y_t = 1$ and minimise $p_t$ when $y_t = 0$.

In large samples, accurate estimation of the population distribution leads to a consistent estimator. However, in small samples, when $p_t$ is extremely small (indicating rare events), the probability of sampling fewer $y_t = 1$ instances than expected tends to be higher than 0.5. That is to say $$ Pr(\sum_{t=1}^n y_t \leq nE(y_t)) > 0.5 $$ Consequently, MLE estimators tend to 'care more' about samples with $y_t = 0$, underestimating $p_t$. As $p_t$ is positively correlated with $\beta$, the MLE estimator also underestimates the true value of $\beta$. In other words, $\hat{\beta}_{MLE} < \beta$.

In short, I want to clarify: 1. MLE estimators are asymptotically unbiased. Nevertheless, they are biased in small samples; 2. It can be demonstrated that MLE estimators exhibit more bias as events become increasingly rare". — Tao Li, Dec 16 '23 at 23:20

score 1 · Answer 3 · answered Dec 15 '23 at 22:51

The logistic regression can lead to an overfit in high dimensions (that is, with many features) if there are rare features that tell you rare events. See Why is logistic regression particularly prone to overfitting in high dimensions? for more.

It is strength and weakness at the same time that the logistic regression can deal with unbalanced data like "3 disease observations out of 10000 observations" and still care for that little disease category as much as the rest.

See Logistic regression -> Error and significance of fit -> Coefficient significance -> Case-control sampling:

Case-control sampling

Suppose cases are rare. Then we might wish to sample them more frequently than their prevalence in the population. For example, suppose there is a disease that affects 1 person in 10,000 and to collect our data we need to do a complete physical. It may be too expensive to do thousands of physicals of healthy people in order to obtain data for only a few diseased individuals. Thus, we may evaluate more diseased individuals, perhaps all of the rare outcomes. This is also retrospective sampling, or equivalently it is called unbalanced data. As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.[34]

Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome. ...

I also gave this a try at What is the intuition behind the idea that for linear regression, the number of observations should exceed the number of parameters?, where I argued that too many parameters will trouble both:

the OLS of a linear regression tries to minimize the distances between independent input and dependent output variables for a best fit.
the MLE minimizes the cross-entropy loss function of a an ML model of the logistic regression, which also tries to get a best fit but tends to overfit the rare events (output) with rare features (input). Such rare features only come up if you add many dimensions or features, so that some few tell the story only, while the rest is the mainstream.

What are the consequences of rare events in logistic regression?

3 Answers3

Linked