summary
Linear model with least squares (Gaussian distributed observations)
fit_lm = lm(log(World) ~ days, last_14)
$$\sum_{\forall i} (\log(y_i) - X_i \beta)^2$$
Non-linear model with least squares (Gaussian distributed observations)
using GLM model (with Gaussian distribution family)
fit_glm = glm(formula = World ~ days, data=last_14,
family=gaussian(link='log'))
or using non linear least squars (NLS)
fit_nls = nls(World ~ exp(a+b*days), start = list(a = 8, b = 0.1), data=last_14)
$$\sum_{\forall i} (y_i - e^{X_i \beta})^2$$
Poisson regression (Poisson distributed observations)
using GLM model (with Poisson distribution family)
fit_glm2 = glm(formula = World ~ days, data=last_14, family=poisson())
$$\sum_{\forall i} (e^{X_i \beta} -(X_i \beta)y_i)$$
GLM
The relationship for GLM can be written down as
$$Y_i = f( X_i \beta) + \epsilon_i$$
Sometimes people are instead using the link function $f^{-1}$ to linearize the equation
$$\begin{array}{}
f^{-1}(Y_i) = f^{-1}\left( f(X_i \beta) + \epsilon_i \right) \neq X_i \beta + \epsilon\end{array}$$
But that it is not the same. See the last inequality and how $\epsilon$ is placed differently (an example with $f(x)=\exp(x)$ is $\log(\exp(1)+1) \neq 1+1$).
The difference between glm with link function and linearized least squares
The difference is that the error terms are incorporated differently. We can write it down more explicitly for a logarithm/exponential function.
Let the linearized relationship lm(log(World) ~ days) be
$$\log(y_i) = a + b x_i + \epsilon_i$$
Then the non-linearized relationship is:
$$y_i = e^{a + b x_i + \epsilon_i}$$
and this is not like the glm(World ~ days, family=gaussian(link='log'))
$$y_i = e^{a + b x_i} + \epsilon_i$$
The error term $\epsilon_i$ occurs differently in the formula.
The difference with between different families
In the case of the Gaussian/Normal family the following two are the same:
$$Y\vert X \sim \mathcal{N}(\mu = h(X), \sigma^2 )$$
or
$$Y = h(X) + \epsilon \quad \text{where} \quad \epsilon \sim N(0,\sigma^2)$$
this separation into a linear sum of a deterministic component $h(X)$ plus some error/noise term $\epsilon$, will not work the same for other families. For instance for the Poisson distribution you will get that the noise term is larger for a large mean.
Poisson distribution with log link
The log likelihood for a single observation $z$ is
$$L = z X\beta - e^{X\beta}$$
and
$$\frac{\partial L}{\partial \beta_i} = \left( z - e^{X\beta} \right) x_i$$
In the framework of GLM the optimum for this likelihood function is found by iterated least squares solving this likelihood
$$L_{itteration} = 0.5 w(Y^\prime - X\beta)^2$$
with derivative
$$\frac{ \partial L_{itteration}}{\partial \beta_i} = w (Y^\prime - X\beta) x_i$$
and the transformation between the two would be (check https://www.jstor.org/stable/2344614 for the details):
$$Y^\prime = X\beta + \frac{z - e^{X\beta}}{e^{X\beta}}$$
and
$$w = e^{X\beta}$$
where we do not know $e^{X\beta}$ but the current estimate $e^{X\hat\beta}$ can be used and then iteratively improve the result.
Intuitively
You could see GLM as loosely approximating the more general exponential family as Gaussian noise, for $\theta = X\beta$
$$Y \approx f(\theta) + \epsilon \quad \text{where} \quad \epsilon \sim N(0,w\sigma^2) $$
where
- the weight $w$ relates to the non-homogeneity of the distribution function (e.g. in the case of Poisson distribution then $\sigma^2 = \mu$)
and in linearized form
$$f^{-1}(Y) \approx \theta + \epsilon + \frac{Y-f(\theta + \epsilon)}{\partial f(\theta) / \partial \theta } \quad \text{where} \quad \epsilon \sim N(0,w\sigma^2) $$
where
- the term $\frac{Y-f(\theta + \epsilon)}{\partial f(\theta) / \partial \theta }$ relates to the non-linearity in the effect of errors on the response when a link function is applied to the response. (ie. the model of the error distribution is for $Y$ and not for $f^{-1}(Y)$ and that needs to be corrected. So that is an additional correction, aside from the weights which only correct for the non-homogeneity in the variance of $Y\vert X$ and not $f^{-1}(Y) \vert X$)
Computational demonstration
days <- last_14$days
US <- last_14$US
iterrating
Y <- last_14$US
X <- cbind(rep(1,14),last_14$days)
coef <- c(2,0.3) # begin solution
yp <- exp(X %% coef)
for (i in 1:100) {
w <- as.numeric(yp) # weights
Yprime <- log(yp) + (Y-yp)/yp # y-values
coef <- solve(crossprod(X,wX), crossprod(X,wYprime))
yp <- exp(X %% coef) # new solution
}
glm function
modglm <- glm(US ~ days,
family = poisson(link = "log"),
control = list(epsilon = 10^-20, maxit = 100))
direct optimization of likelihood
Loption = "Poisson"
L <- function(x) {
a <- x[1]
b <- x[2]
Xb <- a+bdays
if (Loption == "Poisson") {
return(-sum(YXb-exp(Xb)))
}
if (Loption == "Gaussian loglink") {
return(sum((Y-exp(Xb))^2))
}
if (Loption == "linearized model") {
return(sum((log(Y)-Xb)^2))
}
}
start <- c(a=2,b=0.3)
modoptim <- optim(par = start,fn = L)
Which give the same results
> # glm model
> modglm$coefficients
(Intercept) days
2.4750654 0.2030466
> # optimizing likelihood function
> modoptim$par
a b
2.4745912 0.2031048
> # manual computation
> coef
[,1]
[1,] 2.4750654
[2,] 0.2030466
>
Computations for other cases
Below are the other cases. Note that the GLM function with Gaussian family can also be alternatively done with nls.
> ###for the other cases
>
> Loption = "Gaussian loglink"
> optim(par = start,fn = L)$par
a b
2.1735638 0.2315177
> glm(formula = US ~ days, data=last_14, family=gaussian(link='log'))
Call: glm(formula = US ~ days, family = gaussian(link = "log"), data = last_14)
Coefficients:
(Intercept) days
2.1736 0.2315
Degrees of Freedom: 13 Total (i.e. Null); 12 Residual
Null Deviance: 35020
Residual Deviance: 1375 AIC: 110
> nls(US ~ exp(a+bdays), start = list(a=2,b=0.2))
Nonlinear regression model
model: US ~ exp(a + b days)
data: parent.frame()
a b
2.1736 0.2315
residual sum-of-squares: 1375
Number of iterations to convergence: 7
Achieved convergence tolerance: 3.19e-06
>
>
> Loption = "linearized model"
> optim(par = start,fn = L)$par
a b
2.5917459 0.1879523
> lm(log(US) ~ days)
Call:
lm(formula = log(US) ~ days)
Coefficients:
(Intercept) days
2.5918 0.1879
modoptim$paris not exactly equal to manual calculation or glm results. Could help me to understand why? – Haitao Du May 21 '20 at 08:16optimstops already early before reaching the exact solution (optim is using a gradient method to itteratively improve the estimate, obtaining an estimate with a higher likelihood each step and getting closer to the exact solution each step, and stops when the steps of the improvements are smaller than a certain factor times the machine precision). You can change the parameters to make optim return a more accurate result. – Sextus Empiricus May 21 '20 at 09:03control = list(reltol=1e-30)fixed it. – Haitao Du May 21 '20 at 09:32reltolparameter like this:modoptim <- optim(par = start, fn = L, method = 'Nelder-Mead', control = list(reltol=10^-12)). That reltol parameter defines when the algorithm stops (that is when the algorithm decides that it has converged close enough to the solution). For different methods there are different parameters and there will be different ways to get more precise results (e.g. for the gradient based methods it might be usefull to change the scale or ndeps parameters which have an effect on the precision of the computation of the gradient). – Sextus Empiricus May 21 '20 at 09:51