Glm binomial. How to deal with many presences or many absences?

Question

I apologize in advance if this question has been already solved elsewhere, I have search for it but I could not find a solution for my problem yet. I want to assess the effect of forest cover on species presences and absences using logistic regressions (glm family=binomial). I have a large dataset (data compiled from different studies and datasets across different regions and countries) with presences and absences of different species but some of these species have many presences and few absences and vice-versa. My questions are:

Is this per se a problem? I guess that with so unbalanced data I would not find effects of forest cover on my species and I would have to remove some of my presences or absences
Is there any rule for binomial data such as having at least 25% of your data being presences or absences? And if not, how can I know how to choose it? Example of the data set

	Presences	Absences
Species 1	872	1
Species 2	50	4
Species 3	30	18
Species 4	3	409

it's not a problem per se but you should note that your effective sample size is ~ sample size of the smaller class. Thus, for species 1, you effectively have a very small sample size. This will not be solved by removing data — Florian Hartig, Oct 20 '22 at 12:58
@FlorianHartig I believe effective sample size, usually used in the context of correlated data analysis, is not the term you're after. — AdamO, Oct 20 '22 at 16:39

score 1 · Answer 1 · answered Oct 20 '22 at 16:54

You wouldn't expect "balance" in these data, so it's not right to call them imbalanced. Imbalance is a trait that usually signals "problem" to reviewers and audiences who are used to interpreting blocked or randomized trial results.

Logistic regression with categorical covariates is pretty straightforward - it just gives you what you see. The predictions equal the empirical proportions, so that species 4 would have an estimated proportion of 3/(409+3) which is a really small value. We call this a "saturated model". There's nothing wrong with saturated models for estimation. If you go on to add other adjustments or strata variables, the model quickly will get out of hand. It's likely that separation or small sample bias will cause instability of odds ratios, predictions, etc.

Alan Agresti proposed some nifty adjustments for small cells that amount to basically "adding something". You can for instance, just add 1 or 2 or 10 or ... to the cells. Consequently, you trade an unbiased estimator for a biased one that has lower variance - and hopefully lower MSE.

If you perform inference, simulation studies show that logistic regression is well behaved in terms of power and control of type 1 error rate even when the sample size is small. So you can test differences between Species 1, 2, 3, and 4 - while some cell sizes are small, the proportion-differences are remarkable and striking and the overall $N$ is large for each species, so the comparisons are very well powered. Alternately, you can do Fisher's Exact Tests which allows for cells to be exactly 0 even, but it does not correctly achieve the alpha level and has lower power than the Pearson Chi-Square test.

In more complicated (non-saturated) logistic models, people often speak of "events per variable" - meaning the number of positive cases - presumably positive is less prevalent - needs to equal 10 (or 20 or more) per variable. So 4 adjustments require 40 or 80 "events" (and even more non-events) to provide adequate estimation. This rarely works out as expected, and it's easy to do little simulations to account for data structure and assess power and precision that way.

score 1 · Answer 2 · answered Oct 20 '22 at 18:37

In some diciplines, this is knwon as a "rare events" scenario, and that term may help you find additional advice on the matter. Logistic regression is biased in rare events (well, always, but particularly for rare events). There have been adjustments to the usual losigtic regression methods that attempt to account for rare events and don't yield the same kinds of problem usual logistic regression does (e.g., failiure to converge, massive coefficients and standard errors). The Firth correction is one such adjustment. This site has some links and references for software that implements this method and others.

Glm binomial. How to deal with many presences or many absences?

2 Answers2