Why does this demonstration of a GLM have a confident fit so far from the truth?

Question

sample code:

set.seed(123)
n = 500
x = rnorm(n)
a = -5
b = 0.5
y = exp(a + b * x + rnorm(n))
plot(x, y, pch = 20)
m = glm(y ~ x, family = gaussian(link = "log"))
sample_coefs = MASS::mvrnorm(n = 5000, mu = m$coefficients, Sigma = vcov(m))
sample_x = seq(-4, 4, 0.05)
sample_lines = sample_coefs |> apply(1, (pars){exp(pars[1] + sample_x * pars[2])}) |> apply(1, (x)quantile(x, c(0.1, 0.9)))
polygon(x = c(sample_x, rev(sample_x)), y = c(sample_lines[1,], rev(sample_lines[2,])), col = rgb(1, 0, 0, 0.5))
y_true = exp(a + sample_x * b)
lines(sample_x, y_true, col = "blue", lwd = 2)
y_est = exp(m$coefficients[1] + sample_x * m$coefficients[2])
lines(sample_x, y_est, lwd = 2)
y_buffer_90 = qnorm(p = 0.9, sd = summary(m)$sigma)
lines(x = sample_x, y = m$coefficients[1] + sample_x * m$coefficients[2] + y_buffer_90)
lines(x = sample_x, y = m$coefficients[1] + sample_x * m$coefficients[2] - y_buffer_90)

The thick black line shows the estimated relationship. The red area shows the 80% confidence interval of the estimated relationship. The blue line shows the true relationship.

As can be seen, the truth is far from the fit. Why does this happen?

See here. Your simulation of the data does not correspond to the model you fit. The GLM model is $y = \exp{(a + bx)} + N(0, \sigma^2)$ or equivalently $y = N(\exp{(a + bx)}, \sigma^2)$. Try this: y <- exp(a + b * x) + rnorm(n, 0, 0.001) and fit the GLM again. — COOLSerdash, Jul 13 '23 at 07:11
@COOLSerdash, spot on, thanks! If you would like to elevate your comment to an answer I will mark is as accepted. — Chechy Levas, Jul 13 '23 at 08:01

score 9 · Accepted Answer · answered Jul 13 '23 at 08:12

Remember that the GLM models the conditional mean of $Y$ and consists of three parts:

The conditional distribution of $Y$ (here assumed to be Gaussian)
The link function (here a log-link).
The linear predictor.

In particular, your model is the following: $$ \log(\operatorname{E}(Y|X)) = -5 + 0.5x $$ Together with the conditional distribution, we have $$ Y = \operatorname{N}(\exp{(-5 + 0.5*x)}, \sigma^2) $$ So the correct data simulation corresponding to this model would be

y = exp(a + b * x) + rnorm(n, 0, 0.001)

Note that I've chosen $\sigma = 0.001$ here because otherwise the model wouldn't converge for me. Additional information can be found in this post.

Plotting the model fit and the true DGP, we now have an excellent agreement:

library(visreg)
set.seed(123)
n = 500
x = rnorm(n)
a = -5
b = 0.5
y = exp(a + b * x) + rnorm(n, 0, 0.001)
plot(x, y, pch = 20)
m = glm(y ~ x, family = gaussian(link = "log"), start = c(1, 1))
visreg(m, "x", scale = "response", rug = FALSE)
points(y~x, pch = 1, cex = 0.5)
curve(((x)(exp(a + b*x)))(x), from = min(x), to = max(x), add = TRUE, col = "red")

The two lines are barely distinguishable.

thanks for also telling me about visreg. much easier! – Chechy Levas Jul 13 '23 at 08:43 — Chechy Levas, Jul 13 '23 at 08:43

Why does this demonstration of a GLM have a confident fit so far from the truth?

1 Answers1