5

sample code:

set.seed(123)
n = 500
x = rnorm(n)
a = -5
b = 0.5

y = exp(a + b * x + rnorm(n)) plot(x, y, pch = 20)

m = glm(y ~ x, family = gaussian(link = "log"))

sample_coefs = MASS::mvrnorm(n = 5000, mu = m$coefficients, Sigma = vcov(m)) sample_x = seq(-4, 4, 0.05) sample_lines = sample_coefs |> apply(1, (pars){exp(pars[1] + sample_x * pars[2])}) |> apply(1, (x)quantile(x, c(0.1, 0.9))) polygon(x = c(sample_x, rev(sample_x)), y = c(sample_lines[1,], rev(sample_lines[2,])), col = rgb(1, 0, 0, 0.5))

y_true = exp(a + sample_x * b) lines(sample_x, y_true, col = "blue", lwd = 2) y_est = exp(m$coefficients[1] + sample_x * m$coefficients[2]) lines(sample_x, y_est, lwd = 2)

y_buffer_90 = qnorm(p = 0.9, sd = summary(m)$sigma) lines(x = sample_x, y = m$coefficients[1] + sample_x * m$coefficients[2] + y_buffer_90) lines(x = sample_x, y = m$coefficients[1] + sample_x * m$coefficients[2] - y_buffer_90)

plot output

The thick black line shows the estimated relationship. The red area shows the 80% confidence interval of the estimated relationship. The blue line shows the true relationship.

As can be seen, the truth is far from the fit. Why does this happen?

Chechy Levas
  • 1,255
  • 1
    See here. Your simulation of the data does not correspond to the model you fit. The GLM model is $y = \exp{(a + bx)} + N(0, \sigma^2)$ or equivalently $y = N(\exp{(a + bx)}, \sigma^2)$. Try this: y <- exp(a + b * x) + rnorm(n, 0, 0.001) and fit the GLM again. – COOLSerdash Jul 13 '23 at 07:11
  • @COOLSerdash, spot on, thanks! If you would like to elevate your comment to an answer I will mark is as accepted. – Chechy Levas Jul 13 '23 at 08:01

1 Answers1

9

Remember that the GLM models the conditional mean of $Y$ and consists of three parts:

  1. The conditional distribution of $Y$ (here assumed to be Gaussian)
  2. The link function (here a log-link).
  3. The linear predictor.

In particular, your model is the following: $$ \log(\operatorname{E}(Y|X)) = -5 + 0.5x $$ Together with the conditional distribution, we have $$ Y = \operatorname{N}(\exp{(-5 + 0.5*x)}, \sigma^2) $$ So the correct data simulation corresponding to this model would be

y = exp(a + b * x) + rnorm(n, 0, 0.001)

Note that I've chosen $\sigma = 0.001$ here because otherwise the model wouldn't converge for me. Additional information can be found in this post.

Plotting the model fit and the true DGP, we now have an excellent agreement:

library(visreg)

set.seed(123) n = 500 x = rnorm(n) a = -5 b = 0.5

y = exp(a + b * x) + rnorm(n, 0, 0.001) plot(x, y, pch = 20)

m = glm(y ~ x, family = gaussian(link = "log"), start = c(1, 1))

visreg(m, "x", scale = "response", rug = FALSE) points(y~x, pch = 1, cex = 0.5) curve(((x)(exp(a + b*x)))(x), from = min(x), to = max(x), add = TRUE, col = "red")

GLMfit

The two lines are barely distinguishable.

COOLSerdash
  • 30,198