5

When fitting a Poisson regression on data with low expected values, the intercept term has a small bias even when the model is perfectly specified. Below, I simulated data just using $y \sim rPois(exp(\beta_0))$ and then fit the data using the glm model $log(E[y]) \sim \beta_0$. On average, the estimates are slightly biased downwards. The bias is small, but I would like to understand why this happens.

I could understand why this would happen if $\beta_0$ was a large negative number and the data were mostly zeros, but the data from the $\beta_0$ values I chose is always mostly non-zero. Why would this happen?

# function to run the simulation for one set of beta values
    run_sim <- function(b0, n = 50, R = 10000){
  # simulate y values and then estimate
  b0_estimates &lt;- sapply(1:R, function(i){
    y = rpois(n, exp(b0))
    tmp = data.frame(y = rpois(n, exp(b0)))
    mod_col &lt;- glm('y ~ 1', data = tmp, family=poisson)
    b0_hat &lt;- mod_col$coefficients[1]
    return(b0_hat)
  })

  # get the bias
  mean_bias = mean(b0_estimates) - b0

  return(mean_bias)
}

# simulate for beta0 values ranging from 1 to 10
b0_vec = 1:10
bias_vec = sapply(b0_vec, function(b0){
  run_sim(b0, R = 10000)
})

# plot the results
plot(b0_vec, bias_vec, xlab = 'true b0', ylab = 'b0 bias')

Words

1 Answers1

12

The score function is exactly unbiased $$E_{\beta_0}[\sum_i x_i(y_i-\mu_i)]=0$$ In your case that simplifies to $$E_{\beta_0}[\sum y_i-\exp\beta_0]=0$$ The parameter estimate is a non-linear function of the score, so that tells us it won't be exactly unbiased.

Can we work out the direction of the bias? Well, the mean of $Y$ is $\exp \beta_0$, so $\beta_0=\log EY$ and $\hat\beta=\log \bar Y$. The logarithm function is concave, and $E[\bar Y]=E[Y]=\exp\beta_0$ so we can use Jensen's inequality to see that the bias is downward. (Or draw a picture, like the one here only the other way up)

Thomas Lumley
  • 38,062
  • Thank you! This is clear and helpful. I have two follow up questions:
    1. Is there a way to calculate the bias/Jensen's gap in this case? I.e. calculate $E[\hat{\beta_0}] - \beta_0$. It should be some function of $n$ and $\beta_0$. I am having trouble dealing with the distribution of $\log \bar{Y}$,
    – Nick Link Apr 11 '23 at 14:41
  • The downward bias on the intercept term is still present when there are other coefficients in the model. How can we show this is due to Jensen's inequality in a similar way? When I try to show it similarly I end up with $\beta_0 = log E[\bar{Y}] - \beta_1 X_1 - \beta_2 X_2 - ...$ and $E[\hat{\beta}0] = E[\log (n\bar{Y})] - E[\log(\sum{i=1}^n exp(\hat{\beta}1 X{i1} + \hat{\beta}2 X{i2} + ...))]$. I don't know where to go from there.
  • – Nick Link Apr 11 '23 at 14:48
  • You can't just use Jensen's inequality when it's a multidimensional problem. In fact, the bias is typically not defined when you have a covariate, because there's a tiny but non-zero probability that $\hat\beta$ is infinite – Thomas Lumley Apr 11 '23 at 23:09