3

According to GAM, it utilizes a penalized likelihood, which is maximized by penalized iteratively re-weighted least squares (P-IRLS), to obtain parameter estimations. The likelihood is defined as:

enter image description here

The structure is quite close to ridge regression with L2 penalty with an extra S matrix in the penalty term, but from the book, I learned that the S matrix is 0 when we have non-smoothing terms. That is to say, if I specify a model without any smoothing terms, the likelihood will not have any penalization. However, I did some comparative experiment and found out that the GAM results are quite close to ridge regression results, but not completely equivalent.

Did I miss any details in the GAM algorithm that there exists some penalization for non-smoothing terms?

1 Answers1

2

There is no penalization applied by default to the parametric terms of a GAM fitted by {mgcv}. If you tried to fit:

gam(y ~ x + z)

you would get back the equivalent (up to details of the actual algorithm and implementation of course) of

glm(y ~ x + z)

because these terms are not subject to any penalization.

However, what you say is correct; that is a ridge penalty on the coefficients $\boldsymbol{\beta}$, it's just that it doesn't apply to the non smooth terms in the model (by default). Hence

gam(y ~ s(x) + z)

would see a ridge-like penalty controlling the wiggliness (smoothness) of the s(x) term, while the z term would not be subject to any penalty.

You can get ridge penalties on the parametric terms in the model (the z term above) using the paraPen mechanism and argument to gam() and there the penalty is a ridge penalty, where $\mathbf{S}$ has the form of an identity matrix.

Gavin Simpson
  • 47,626
  • I haven't been able to find practical examples of how to implement an L2-like penalization for the parametric terms with gam, e.g., how to specify a paraPen list that would be similar to a lambda coefficient in glmnet. If I understand, to penalize the parametric terms we pass an identity matrix to paraPen like you mentioned (with nrow and ncol the length of the coefficient vector?), but how would the equivalent of lambda be specified with paraPen? Or is that learned during fitting along with the penalty for the smooth terms? – Darren Oct 13 '22 at 01:09
  • It's learned alongside the other model coefficients and smoothing parameters. I recently posted an answer here that uses paraPen. – Gavin Simpson Oct 22 '22 at 10:26