Seeing if a Poisson Regression follows the necessary assumptions

Question

I am trying to understand how simulation plays a role in checking model assumptions when the residuals do not have exact distributions.

I took this Poisson Regression Model:

$$\lambda_i = e^{(\beta_0 + \beta_1X_i)}$$

$$P(Y_i | X_i) \sim Poisson(\lambda_i = e^{(\beta_0 + \beta_1X_i)})$$

How can I see if the model assumptions are being met?

Some ideas come to mind:

Approach 1: I can bootstrap the data, fit the model on each bootstrap sample. If the model assumptions are met - at each $x_i$: the distribution of residuals over all bootstrap samples will have a mean of 0 and a variance of $\lambda_i$. (this approach seems like the most work computationally since multiple models are being fit)
Approach 2: I fit the model on the original data, then simulate new datasets from the model ... and compare all real $Y$ vs the simulated $Y$. If the model assumptions are met, the model will be able to very closely "reproduce" the observed data. (I don't know how to measure "closeness" over all $x_i$) . This approach seems like less work since only one model is being fit.
Approach 3 (DHARMA approach https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html): I fit the model on the original data. After fitting the model, I generate multiple values of $y_i$ at each $x_i$. For a given $x_i$ , I make an Empirical CDF of the simulated $y_i$. For the actual observed $y_i$ at that $x_i$, I find out where this $y_i$ is situated on the CDF. If the model fits the data well, this $y_i$ should be at the 50% level (I am not sure why - is this because that residuals have a uniform distribution?). I then repeat this empirical CDF for each $x_i$ and see on average if the true $y_i$ are close to the 50% level in each CDF on average. (this approach also only involves fitting one model)

All 3 approaches correct? Is this the correct use of simulation to validate model assumptions when the residuals do not have exact distributions?

"Errors" in the sense of differences make no sense for conditional Poisson distributions: you are subtracting a parameter value from an observation. Consequently I stopped reading at "show the errors are not Poisson" because that indicated the rest of your post is likely meaningless. (Please review the concepts of logistic regression.) Moreover, it reads like almost all the (many) questions you have posted recently. Please focus on making some of them understandable instead of starting new threads. — whuber, Mar 04 '24 at 23:36
thank you for your feedback. this was the idea that I was going to after: errors are meaningless in Poisson — Uk rain troll, Mar 05 '24 at 01:14
i remove some confusing content from my question. should I delete this one and ask again? — Uk rain troll, Mar 05 '24 at 04:36
I searched and did not find anything that addressed my exact confusion. I am also looking at internet references as well. I only posted this question because I could not find anything on stackoverflow — Uk rain troll, Mar 05 '24 at 16:21
Please, then, edit this post to articulate what your "exact confusion" is and to demonstrate that it isn't addressed by other posts. — whuber, Mar 05 '24 at 16:26
Sure, I will start doing this right now. Informally, can you please tell me if my understanding is correct? — Uk rain troll, Mar 05 '24 at 16:31
I cannot determine what your understanding might be, except insofar as the general approach of asking "how can I use technique Y to do X" (where "Y" is some generic form of "simulation") is often fruitless and confusing, especially when there exist alternative better methods of doing X. Furthermore, asking variants of that question repeatedly does nothing to improve things. — whuber, Mar 05 '24 at 16:33
It is not clear what your "understanding" is. You ask, "How can I see if the model assumptions are being met?" and propose three methods, then ask if they are correct. However, phrases like "... to validate model assumptions when the residuals do not have exact distributions?" give me the sense that you don't understand what you're doing and are just making stuff up, hoping it works. It is better to do some research first - start with reading about logistic regression, as @whuber recommends. — jbowman, Mar 09 '24 at 03:02

score 2 · Answer 1 · answered Mar 10 '24 at 11:57

In addition to the resampling based methods you suggested, let me offer you the "traditional" old-fashined methods. For Poisson regression we write $y_i\sim\text{Poisson}(\lambda_i)$ with $E[y_i|x_i]=\lambda_i=\exp(\beta^Tx_i+\epsilon_i)$. We have the following model assumptions:

Error independence ($\epsilon_i\perp\epsilon_j$)
Error normality ($\hat\beta$ has multivariate normal distribution)
Unbiasedness ($E[\epsilon]=0$)
Homoskedasticity ($Var(\epsilon)=const.$)
Poisson property ($E[\hat y]=Var(\hat y)$)

Assumptions 1-4 are standard for all GLMs. Unlike linear regression, we have to define here standardized residuals as (see Section 6.2 of Gelman & Hill)

$$z_i=\frac{y_i-\lambda_i}{\sqrt{\lambda_i}}$$

With these at hand, you can go for Q-Q plots and all traditional diagnostics to validate the error assumptions (see here). The only special thing here is assumption 5. The case where $Var(\hat y)$ is too large is called overdispersion, and we diagnose it using a $\chi^2$ deviance statistic:

$$\text{Est. Overdispersion}=\frac{1}{n-p}\sum_{i=1}^{n}z_i^2$$

with $n$ the number of samples and $p$ the length of the vector $\beta$. The above mean is assumed to follow the $\chi^2_{n-p}$ distribution, and so a simple $\chi^2$ test will help you here. Of course, you have to keep in mind that like all other goodness-of-fit tests, the test tends to reject to null when $n$ is large.

As per additional resampling based methods, I recommend taking a look at Wickham's Lineup method, which might become handy here.

What do you mean by $\epsilon$ here? The linked book chapter doesn't use it and I have only seen error terms inside the link function in the context of random effects. — Lukas Lohse, Mar 13 '24 at 10:59

Seeing if a Poisson Regression follows the necessary assumptions

1 Answers1

Linked