2

I'm going to provide a simulated case. However, the question is of a general nature (see end of the post). Let's suppose we have some data generated in this way:

set.seed(123)
n <- 10000
x <- rpois(n, 2)
x2 <- rpois(n, 2)
dat <- data.frame("Dependent"= x + rpois(n, 1), "Independent"= x)
newDat <- data.frame("Independent" = x2, "Dependent"= x2 + rpois(n, 1))

If we were seeing this data for the first time, we would do some EDA and conclude that a closely approximating distribution could be a Poisson:

hist(dat$Dependent, main = "Histogram of response", xlab = "Dependent")

enter image description here

Hence, we fit a glm with Poisson process and check the fit and Pearson residuals:

plot(dat$Dependent, fitted(glmPoisson), main="Fitted vs real values", xlab="Real values", ylab = "Fitted")
abline(0,1, col="red")
plot(fitted(glmPoisson), residuals(glmPoisson, type = "pearson"), main="Fitted vs Pearson residuals", ylab="Pearson residuals", xlab = "Fitted")

From these simple plots, we clearly see that the fit is awful and residuals are clearly showing a pattern. Also, it is not able to model unseen data by the same process:

enter image description here

enter image description here enter image description here

However, if we run the same analysis assuming an underlying Gaussian process, all these problems disappear and the fit/predictive power is excellent:

glmGaussian <- glm(Dependent ~ Independent, data = dat, family = gaussian)
summary(glmGaussian)

plot(dat$Dependent, fitted(glmGaussian), main="Fitted vs real values", xlab="Real values", ylab = "Fitted")
abline(0,1, col="red")
plot(fitted(glmGaussian), residuals(glmGaussian, type = "pearson"), main="Fitted vs Pearson residuals", ylab="Pearson residuals", xlab = "Fitted")

plot(newDat$Independent, predict(glmGaussian, type = "response", newdata = newDat), main = "New actual values vs predictions", xlab = "New values", ylab = "Predictions")
abline(0,1, col="red")

Hence:

  1. Why, even if the underlying process is Poisson, the model is better when Gaussian?
  2. In general, how can one decide a-priori the functional process, given that even in this simulated case we would have got the model completely wrong?
OnlyAL
  • 98
  • 1
    You haven't indicated how you performed the original fit. Most likely it used a default log link rather than the linear link you used to generate the data. It's wonderful how clearly your results demonstrate the model you used is the wrong one! – whuber Mar 08 '19 at 13:07
  • 1
    What matters is not the marginal distribution of your dependent variable (shown in your first histogram) but its conditional distribution given your independent variable $x$. This conditional distribution is not Poisson but a shifted Poisson distribution with a constant variance and mean depending linearly on $x$. Since the variance is constant, ordinary linear regression will be appropriate and robust given a sufficient sample size. – Jarle Tufto Mar 08 '19 at 13:10
  • 1
    @whuber Gotcha! Good point. I did not think about that. I exactly used a log-link when fitting the Poisson – OnlyAL Mar 08 '19 at 15:27
  • See also: https://stats.stackexchange.com/questions/142338/goodness-of-fit-and-which-model-to-choose-linear-regression-or-poisson/142353#142353 – kjetil b halvorsen Dec 03 '19 at 04:49

0 Answers0