Significance testing on Generalized Additive Mixed Models (GAMMs) - mgcv::gam

Question

I am wondering whether anyone has any insight on comparing the performance of two GAMMs?

Specifically, I want to compare two models:

A "nested model", one with a smoothed age term s(age) and a sex term sex specified as separate response variables.
A "complex model" with a s(age,by=sex) term.

My understanding is that the former is essentially a nested version of the latter, as the latter will fit two smoothed/spline terms--one for male, one for female--while the former will just fit a single, global smoothed/spline term (if this is the correct terminology).

I want to determine whether there is significant sexual dimorphism in ageing trajectories. In a standard multiple linear regression (MLR) model, I would be able to include an age^n*sex interaction and check the p-value, but the addition of the smoothed term complicates matters (the models also contain a random effect, hence the need for a GAMM). The summary(complex_model) printout specifies a separate p-value for the smoothed fit for each sex, making the interpretation of sexual dimorphism more complicated. I don't think that it is an interaction term in the same sense as in the MLR model. However, it seems clear to me that a significant improvement in overall model performance in a complex model vs a nested model would nevertheless serve as evidence of sexual dimorphism.

Since both models have log likelihoods, I could see an argument for using a Likelihood Ratio Test (LRT). However, I understand that GAMMs have properties that makes their analysis less straightforward than some other types of models, so I am a bit wary of using the LRT uncritically. I also know that I can compare the AICs of both models, but it is not clear to me how to test for "a significant difference" between two AIC values, or if this is even possible. I am thus wondering whether there is another robust way of comparing the performance of these two models, particularly one that provides a p-value.

Here is a little more information, in case it is helpful:

The models are being generated in R using mgcv::gamm()
The summary() I refer to is the "lme" option, since the "gam" option seems to disregard the random effect specified in the model.
The printout of the "lme" summary reads "Linear mixed-effects model fit by maximum likelihood", so it might not technically be a GAMM, despite being produced with the mgcv:gamm() command(?)

Edit 30 Nov 2023 the exact model specification, as per Shawn Hemelstrand's comment, is:

complex_model <- mgcv::gamm(strength ~ sex + s(age, by=sex) + ethnicity, random=list(strength_measurement_device = ~1+ usage_date_of_strength_measurement_device), data=df)

The "device*date" random effect is necessary because the dataset includes a large number of measurement devices, many of which I know to have been miscalibrated, or otherwise become uncalibrated over time; however, I do not have good enough calibration records to manually correct all of the data, and unsurprisingly the measurement 'drift' is not identical across all devices. So, I think that the random effect is required, and can't be substituted with a simpler assumption, thus necessitating the use of a GAMM rather than a GAM (my understanding is that a GAMM is required in order to accommodate fixed effects, random effects, and smoothing terms within the same model, but please correct me if I am wrong).

The summary(complex_model$lme) print-out reports the random effects (although I am not really sure how to interpret them); however, I can find no mention of the random effect or its constituent predictor variables in the summary(complex_model$gam) print-out, and the formula is explicitly specified as strength ~ sex + s(age, by=sex) + ethnicity in the print-out, hence my inference that the "gam" model 'disregards'--i.e. does not incorporate--the random effect. This was also in keeping with my understanding that a GAM would not contain a random effect (see above).

Hopefully this clarifies things somewhat.

I wrote up a lengthy answer based on this question but then realized you noted that your model disregards your random effects in the gam implementation. Can you please show the exact model specification? I might be able to help with that depending on what the problem is. For now, I just provide a general answer about model comparison. — Shawn Hemelstrand, Nov 29 '23 at 23:06
"my understanding is that a GAMM is required" well yes, but that doesn't mean you need to go to gamm(). gam() is perfectly content fitting simple random effects via s(f, bs = "re"). The $gam component of the model you fitted is conditional upon the random effects, it's just that it doesn't report info about them. But good luck interpreting the output of the $lme component if you want to look at the smooths. There's a duality between penalized smooths and random effects; they are two views on the same thing, & we can represent GAMs as mixed models & vice versa. — Gavin Simpson, Dec 01 '23 at 10:19
@GavinSimpson thank you very much for your response. Just to double-check that I am not misunderstanding you. The call brought up by summary(complex_model$gam) does account for the random effects, but doesn't reference them in the summary, so the reported AIC has taken the random effects into account? I am not so interested in interpreting the smooths themselves, so the $gam print-out should be fine for my purposes, if it is actually a GAMM in essence. Thank you again Shawn and Gavin. — PhelsumaFL, Dec 04 '23 at 09:51

Shawn Hemelstrand · Answer 1 · 2023-11-29T23:03:39.547

Model Comparison for GAMMs: AIC

Both ways can be achieved in GAMMs, but I will note how they differ compared to other models, starting first with AIC. It is important to note how AIC is derived in mgcv, as it is slightly different from a typical AIC score. The way that AIC is derived in mgcv is detailed more specifically in Wood et al., 2016, but in a nutshell, there were originally two types of AIC used for GAM models, the marginal AIC and the conditional AIC. Per Simon Wood's canonical text on GAMs:

Marginal AIC is based on the (frequentist) marginal likelihood of the model: that is on the likelihood obtained by treating all penalized coefficients as random effects and integrating them out of the joint density of response data and random effects. The number of coefficients to use for the AIC penalty is then just the number of fixed effects plus the number of variance and smoothing parameters.

Conditional AIC is based on the likelihood of all the coefficients at their maximum penalized likelihood (MAP) estimates. The number of coefficients in the penalty then has to be based on some estimate of the effective number of parameters, in order to account for the fact that the coefficient estimates are penalized.

The problem with marginal AIC is that it underestimates the variance components and "oversmooths" so that it often favors simpler models to an extreme. Conditional AIC has the problem of neglecting smoothing parameter uncertainty, which leads to bias towards larger models. With respect to GAMMs, they are even more problematic per Wood et al., 2016:

Greven and Kneib (2010) showed that this is overly likely to select complex models, especially when the model contains random effects: the difficulty arises because τ0 neglects the fact that the smoothing parameters have been estimated and are, therefore, uncertain (a marginal AIC based on the frequentist marginal likelihood, in which unpenalized effects are not integrated out, is equally problematic, partly because of underestimation of variance components and consequent bias toward simple models).

Interestingly, BIC is not usually estimated in modern GAM models. As highlighted in the Wood paper:

When viewing smoothing from a Bayesian perspective, the smooths have improper priors (or alternatively vague priors of convenience) corresponding to the null space of the smoothing penalties. This invalidates model selection via marginal likelihood comparison.

Thus while AIC operates similarly in practice to how you normally think of it in other regression contexts, it is defined differently for GAMMs. In any case, the AIC calculated in mgcv accounts for these issues by adding a simple correction to the effective degrees of freedom when obtaining the AIC score and is straightforwardly calculated using AIC(fit) on the respective models (by extracting the corrected version of the log-likelihood and inputting it into the AIC function).

Model Comparison for GAMMs: P-Values

I'm not particularly a giant fan of $p$-values as a means for comparing models, and this is particularly problematic for GAMs with random effects. The issue with a typical LRT is noted in Wood, 2017:

It is tempting to try to compare GAMs using a generalized likelihood ratio test (appendix A.5, p. 411). One possibility is to use the frequentist marginal likelihood, counting the number of fixed effects plus number of smoothing parameters and variance parameters in order to obtain appropriate degrees of freedom. An alternative is to use the (conditional) likelihood along with effective degrees of freedom. Neither approach works for testing whether a random effect is needed, since in the marginal case the null model is restricting the variance parameters to the edge of the feasible parameter space, and in the conditional case we can not really view effective degrees of freedom as representing the number of unpenalized coefficients needed to approximate the penalized model.

Simon Wood then shows some simulations which calculate AIC models on different types of GAMs and notes:

As expected, the test is clearly useless for comparing models differing in random effect structure. For other comparisons the test seems to provide a reasonable approximation provided the smoothing parameter uncertainty correction is applied, which in practice requires use of REML or ML smoothing parameter selection.

A comparison of the QQ plots shows just how bad the LRT is for random effects compared to fixed main effects and interactions:

Thus the AIC score is a clearer contender for proper model selection for GAMMs compared to GAMs.

Computational Details

You also noted that you fit your model with the gamm function, which operates differently from the gam function (which also fits random effects). As noted by Gavin Simpson, the AIC from these models can vary considerably due to the way the AIC is calculated in each. I'm not as privy to the differences in AIC between both functions, but know that if you compare models, the models should be fitted to the same function.

References

Wood, S. N. (2017). Generalized additive models: An introduction with R (2nd ed.). CRC Press, Taylor and Francis Group.
Wood, S. N., Pya, N., & Säfken, B. (2016). Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association, 111(516), 1548–1563. https://doi.org/10.1080/01621459.2016.1180986

Significance testing on Generalized Additive Mixed Models (GAMMs) - mgcv::gam

1 Answers1

Model Comparison for GAMMs: AIC

Model Comparison for GAMMs: P-Values

Computational Details

References