What is the objective function to optimize in glm with gaussian and poisson family?

Question

I am reading this post and still confused about the different ways of fitting exponential data. Specifically, why I am getting different results with following code? Could anyone help me to write down the objective functions for different model ?

For lm, it is $||X\beta-\log(y)||_2^2$, but what about glm cases?

The reason I ask for objective function is that many literatures are focusing on algorithm details of "re-weighted least square", but lack the emphasis on high level objective.

last_14 = data.frame(rbind(
c(3460,  14,    0),
c(3558,  17,    1),
c(3802,  21,    2),
c(3988,  22,    3),
c(4262,  28,    4),
c(4615,  36,    5),
c(4720,  40,    6),
c(5404,  47,    7),
c(5819,  54,    8),
c(6440,  63,    9),
c(7126,  85,   10),
c(7905, 108,   11),
c(8733, 118,   12),
c(9867, 200,   13)))
names(last_14) = c('World', 'US', 'days')

fit_lm = lm(log(World) ~ days, last_14)
fit_glm = glm(formula = World ~ days,  data=last_14, family=gaussian(link='log'))
fit_glm2 = glm(formula = World ~ days,  data=last_14, family=poisson())

Sextus Empiricus · Accepted Answer · 2020-09-18T13:46:11.053

summary

Linear model with least squares (Gaussian distributed observations)

fit_lm = lm(log(World) ~ days, last_14)

$$\sum_{\forall i} (\log(y_i) - X_i \beta)^2$$

Non-linear model with least squares (Gaussian distributed observations)

using GLM model (with Gaussian distribution family)

fit_glm = glm(formula = World ~ days,  data=last_14, 
family=gaussian(link='log'))

or using non linear least squars (NLS)

fit_nls = nls(World ~ exp(a+b*days), start = list(a = 8, b = 0.1), data=last_14)

$$\sum_{\forall i} (y_i - e^{X_i \beta})^2$$

Poisson regression (Poisson distributed observations)

using GLM model (with Poisson distribution family)

fit_glm2 = glm(formula = World ~ days,  data=last_14, family=poisson())

$$\sum_{\forall i} (e^{X_i \beta} -(X_i \beta)y_i)$$

GLM

The relationship for GLM can be written down as

$$Y_i = f( X_i \beta) + \epsilon_i$$

Sometimes people are instead using the link function $f^{-1}$ to linearize the equation

$$\begin{array}{} f^{-1}(Y_i) = f^{-1}\left( f(X_i \beta) + \epsilon_i \right) \neq X_i \beta + \epsilon\end{array}$$

But that it is not the same. See the last inequality and how $\epsilon$ is placed differently (an example with $f(x)=\exp(x)$ is $\log(\exp(1)+1) \neq 1+1$).

The difference between glm with link function and linearized least squares

The difference is that the error terms are incorporated differently. We can write it down more explicitly for a logarithm/exponential function.

Let the linearized relationship lm(log(World) ~ days) be

$$\log(y_i) = a + b x_i + \epsilon_i$$

Then the non-linearized relationship is:

$$y_i = e^{a + b x_i + \epsilon_i}$$

and this is not like the glm(World ~ days, family=gaussian(link='log'))

$$y_i = e^{a + b x_i} + \epsilon_i$$

The error term $\epsilon_i$ occurs differently in the formula.

The difference with between different families

In the case of the Gaussian/Normal family the following two are the same:

$$Y\vert X \sim \mathcal{N}(\mu = h(X), \sigma^2 )$$

or

$$Y = h(X) + \epsilon \quad \text{where} \quad \epsilon \sim N(0,\sigma^2)$$

this separation into a linear sum of a deterministic component $h(X)$ plus some error/noise term $\epsilon$, will not work the same for other families. For instance for the Poisson distribution you will get that the noise term is larger for a large mean.

Poisson distribution with log link

The log likelihood for a single observation $z$ is

$$L = z X\beta - e^{X\beta}$$

and

$$\frac{\partial L}{\partial \beta_i} = \left( z - e^{X\beta} \right) x_i$$

In the framework of GLM the optimum for this likelihood function is found by iterated least squares solving this likelihood

$$L_{itteration} = 0.5 w(Y^\prime - X\beta)^2$$

with derivative

$$\frac{ \partial L_{itteration}}{\partial \beta_i} = w (Y^\prime - X\beta) x_i$$

and the transformation between the two would be (check https://www.jstor.org/stable/2344614 for the details):

$$Y^\prime = X\beta + \frac{z - e^{X\beta}}{e^{X\beta}}$$

and

$$w = e^{X\beta}$$

where we do not know $e^{X\beta}$ but the current estimate $e^{X\hat\beta}$ can be used and then iteratively improve the result.

Intuitively

You could see GLM as loosely approximating the more general exponential family as Gaussian noise, for $\theta = X\beta$

$$Y \approx f(\theta) + \epsilon \quad \text{where} \quad \epsilon \sim N(0,w\sigma^2) $$

where

the weight $w$ relates to the non-homogeneity of the distribution function (e.g. in the case of Poisson distribution then $\sigma^2 = \mu$)

and in linearized form

$$f^{-1}(Y) \approx \theta + \epsilon + \frac{Y-f(\theta + \epsilon)}{\partial f(\theta) / \partial \theta } \quad \text{where} \quad \epsilon \sim N(0,w\sigma^2) $$

where

the term $\frac{Y-f(\theta + \epsilon)}{\partial f(\theta) / \partial \theta }$ relates to the non-linearity in the effect of errors on the response when a link function is applied to the response. (ie. the model of the error distribution is for $Y$ and not for $f^{-1}(Y)$ and that needs to be corrected. So that is an additional correction, aside from the weights which only correct for the non-homogeneity in the variance of $Y\vert X$ and not $f^{-1}(Y) \vert X$)

Computational demonstration

days <- last_14$days
US <- last_14$US
iterrating
Y <- last_14$US
X <- cbind(rep(1,14),last_14$days)
coef <- c(2,0.3)                 # begin solution
yp <- exp(X %% coef)
for (i in 1:100) {
  w <- as.numeric(yp)            # weights

  Yprime <- log(yp) + (Y-yp)/yp  # y-values
  coef <- solve(crossprod(X,wX), crossprod(X,wYprime))
  yp <- exp(X %% coef)          # new solution
}
glm function
modglm <- glm(US ~ days,

              family = poisson(link = "log"), 
              control = list(epsilon = 10^-20, maxit = 100))
direct optimization of likelihood
Loption = "Poisson"
L <- function(x) {
  a <- x[1]
  b <- x[2]
  Xb <- a+bdays
  if (Loption == "Poisson") {
    return(-sum(YXb-exp(Xb)))
  } 
  if (Loption == "Gaussian loglink") {
    return(sum((Y-exp(Xb))^2))
  } 
  if (Loption == "linearized model") {
    return(sum((log(Y)-Xb)^2))
  } 
}
start <- c(a=2,b=0.3)
modoptim <- optim(par = start,fn = L)

Which give the same results

> # glm model
> modglm$coefficients
(Intercept)        days 
  2.4750654   0.2030466
> # optimizing likelihood function
> modoptim$par
        a         b 
2.4745912 0.2031048
> # manual computation
> coef
         [,1]
[1,] 2.4750654
[2,] 0.2030466
>

Computations for other cases

Below are the other cases. Note that the GLM function with Gaussian family can also be alternatively done with nls.

> ###for the other cases
> 
> Loption = "Gaussian loglink"
> optim(par = start,fn = L)$par
        a         b 
2.1735638 0.2315177 
> glm(formula = US ~ days,  data=last_14, family=gaussian(link='log'))
Call:  glm(formula = US ~ days, family = gaussian(link = "log"), data = last_14)
Coefficients:
(Intercept)         days

     2.1736       0.2315
Degrees of Freedom: 13 Total (i.e. Null);  12 Residual
Null Deviance:      35020 
Residual Deviance: 1375     AIC: 110
> nls(US ~ exp(a+bdays), start = list(a=2,b=0.2))
Nonlinear regression model
  model: US ~ exp(a + b  days)
   data: parent.frame()
     a      b 
2.1736 0.2315 
 residual sum-of-squares: 1375
Number of iterations to convergence: 7 
Achieved convergence tolerance: 3.19e-06
> 
> 
> Loption = "linearized model"
> optim(par = start,fn = L)$par
        a         b 
2.5917459 0.1879523 
> lm(log(US) ~ days)
Call:
lm(formula = log(US) ~ days)
Coefficients:
(Intercept)         days

     2.5918       0.1879

$Y_i = f( X_i \beta) + \epsilon_i$ confuses me. What would $\epsilon_i$ be for a logistic regression? — Dave, May 13 '20 at 16:24
We often take $\epsilon \sim N(0,\sigma^2)$ but you could use any other distribution (in the case of GLM it is limited to a distribution from a particular exponential family). Also the function $f$ can be anything. For logistic regression $f$ is the exponential function. — Sextus Empiricus, May 13 '20 at 16:38
More precise would be to write that $Y\vert X \sim D(\mu = f(X\beta))$ and Y can be seen as following a distribution $D$ for which the mean parameter $E(Y\vert X)$ is equal to $f(X\beta)$. The description of a linear sum of a deterministic part and a random part makes indeed more sense for least squares (normal distribution). — Sextus Empiricus, May 13 '20 at 16:43
optimizing likelihood function on POISSON one modoptim$par is not exactly equal to manual calculation or glm results. Could help me to understand why? — Haitao Du, May 21 '20 at 08:16
@HaitaoDu I am not sure but I guess that the (small) discrepancy is because the algorithm used in optim stops already early before reaching the exact solution (optim is using a gradient method to itteratively improve the estimate, obtaining an estimate with a higher likelihood each step and getting closer to the exact solution each step, and stops when the steps of the improvements are smaller than a certain factor times the machine precision). You can change the parameters to make optim return a more accurate result. — Sextus Empiricus, May 21 '20 at 09:03
Thanks for your comments i think control = list(reltol=1e-30) fixed it. — Haitao Du, May 21 '20 at 09:32
Example: for Nelder-Mead you can change the reltol parameter like this: modoptim <- optim(par = start, fn = L, method = 'Nelder-Mead', control = list(reltol=10^-12)). That reltol parameter defines when the algorithm stops (that is when the algorithm decides that it has converged close enough to the solution). For different methods there are different parameters and there will be different ways to get more precise results (e.g. for the gradient based methods it might be usefull to change the scale or ndeps parameters which have an effect on the precision of the computation of the gradient). — Sextus Empiricus, May 21 '20 at 09:51
This is by far the clearest explanation of the GLM which I have ever read. — hasManyStupidQuestions, Jul 09 '20 at 20:20

What is the objective function to optimize in glm with gaussian and poisson family?

1 Answers1

summary

GLM

The difference between glm with link function and linearized least squares

The difference with between different families

Poisson distribution with log link

Intuitively

Computational demonstration

iterrating

glm function

direct optimization of likelihood

Computations for other cases

Linked