Poisson regression for rare events?

Question

Poisson regression is commonly used to analyse count data. However, when we deal with rare events it does not seem to be appropriate any more. At least, graphical criteria to assess the model fit like Pearson residuals or deviance residuals are inappropriate.

How do we have to deal with that?

For example, consider the following toy example. Coefficient estimates are actually close to the ‚true‘ values which we never know, of course. So, how can we evaluate and measure the model fitness?

Edit to clarify the question: I think the significance tests for the parameter in the summary output use the the Pearson and deviance residuals, at least implicitly. So if these residuals are not appropriate in this case, how reliable are the significance tests?

# Toy-Example (Poisson Regression, Rare Event)
set.seed(11)
N <- 100000
Two categorical variables
I1 <- sample(c(0,1), size = N, replace = TRUE)
I2 <- sample(c(0,1), size = N, replace = TRUE)
lambda <- 0.050.9^I11.2^I2
Y <- rep(0,N)
for (i in 1:N){
Y[i] <- rpois(1,lambda=lambda[i])
}
model <- glm(Y ~ I1 + I2, family = poisson())
summary(model)
Coefficients are close to the true values
exp(model$coefficients)
However residuals look awful
res0 <- Y - model$fitted.values
res1 <- residuals(model, "pearson")
res2 <- residuals(model, "deviance")
par(mfrow=c(2,3))
plot(model$fitted.values,res0, main = "Raw Residuals")
plot(model$fitted.values,res1, main = "Pearson Residuals")
plot(model$fitted.values,res2, main = "Deviance Residuals")

boxplot(res0, main = "Raw Residuals")
boxplot(res1, main = "Pearson Residuals")
boxplot(res2, main = "Deviance Residuals")

When using a PRNG in your example it's good practice to include a fixed seed, otherwise the results will differ from run to run. The impact here will be small because your N is large, but it's only one extra line of code. — PBulls, Dec 07 '23 at 11:14
When $Y=0$ (almost $95%$ of cases) your residuals are small (about $-0.05$ for the raw residuals, about $-0.23$ for the Pearson residuals and about $-0.32$ for the deviance residuals), but eyes are drawn in your graphs to the other cases, both for the almost $5%$ of cases where $Y=1$ and the tiny number of cases where $Y$ is $2$ or more. Your top row of charts makes no attempt to distinguish these scale differences. You need those residuals to be large, to drive the estimation of coefficients when $Y$ is usually $0$ no matter what the values of $I_1$ and $I_2$. — Henry, Dec 07 '23 at 11:57

score 6 · Answer 1 · answered Dec 07 '23 at 11:09

6

Welcome to CV. Poisson regression is fine for rare events. In fact, it got started with modeling the number of soldiers kicked to death by the horses --- not common, even in the days of large cavalry forces! The problem is that the plots you made aren't great for the purpose you are using them for when you are running Poisson regression.

The patterns of residuals in the plots is due to the discrete nature of your dependent variable. This has been discussed on CV before e.g. here. A better alternative is discussed here .

answered Dec 07 '23 at 11:09

Peter Flom

119,535
36
175
383

Thanks for your answer and the references. However, the summary of the glm fitting yields significance tests for the parameter. And Don't these significance tests work on the basis of these Pearson and deviance residuals, at least implicitly? – Student Dec 10 '23 at 15:33
I don't think so, but I'm not sure. They are based on the standard error of the estimate. – Peter Flom Dec 10 '23 at 18:46
1

"[...] it got started with modeling the number of soldiers kicked to death by the horses [...]" Whaaaat!? That's fascinating! Thank you for sharing that (+1). – Galen Dec 10 '23 at 18:50
See Quine MP, Seneta E. Bortkiewicz's data and the Law of Small Numbers. International Statistical Review 1987; 55: 173–81. (The original is by Bortkiewicz, 1898 and is in German) – Peter Flom Dec 10 '23 at 18:54
@Student see this answer for an explanation of the significance tests performed by summary.glm(). They are based on the overall log-likelihood of the data set as a function of parameter values, not on individual deviance residuals. Even in ordinary linear regression, you get (nominal) significance tests for coefficients regardless of whether the assumptions are met. For evaluating assumptions in Poisson models, see the last link provided by Peter Flom. – EdM Dec 10 '23 at 19:06

score 3 · Answer 2 · answered Dec 10 '23 at 19:54

3

Peter Flom is correct (+1). For completeness, here's the recommended graphical evaluation of the model in the OP, following the method in this answer based on the DHARMa package. (I have version 0.4.3 installed on this computer; 0.4.6. is the current version.)

library(DHARMa)
res <- simulateResiduals(model)
plot(res)

Those residuals are certainly well behaved.

answered Dec 10 '23 at 19:54

EdM

92,183
10
92
267

Almost "too well" :) (+1) btw ! – Vivek Sharma Dec 14 '23 at 15:33

score 1 · Answer 3 · answered Dec 10 '23 at 22:57

I agree with the answers above, but I think that there is (somehow) one point missing.

The method of Dunn and Smyth (1996) may serve as a graphical criteria, which was the main focus of the question. However, I have not found a result in their paper that makes me feel comfortable judging the sensitivity of such a graphical criterion.

In the following example (with a slightly disturbed data) the likelihood-based significance tests indicates that there is something wrong. The randomized quantile residuals seem not to be sensible to such a misspecification:

# Toy-Example (Poisson Regression, Rare Event)
set.seed(19)
N <- 10000
Two categorical variables
I1 <- sample(c(0,1), size = N, replace = TRUE)
I2 <- sample(c(0,1), size = N, replace = TRUE)
lambda <- 0.050.9^I11.2^I2
Y <- rep(0,N)
Error
Error <- rbinom(length(Y),1,prob=1/10)
for (i in 1:N){
Disturbance
Y[i] <- rpois(1,lambda=lambda[i]) + Error[i]
}
model <- glm(Y ~ I1 + I2, family = poisson())
summary(model)
Mehod of Dunn and Smyth (1996)
library(DHARMa)
res <- simulateResiduals(model)
plot(res)

Poisson regression for rare events?

Two categorical variables

Coefficients are close to the true values

However residuals look awful

3 Answers3

Two categorical variables

Error

Disturbance

Mehod of Dunn and Smyth (1996)