1

I have fit a GAM with two continuous independent variables and one discrete covariate with two levels ("gray" and "brown"). All variables are scaled and the model runs fine:

mod3 = gam(cont1 ~ disc1 * cont2 * cont3 + 
             s(cont2, cont3, by=disc1, k=4),
           data=data_so)
summary(mod3)

Family: gaussian Link function: identity

Formula: cont1 ~ disc1 * cont2 * cont3 + s(cont2, cont3, by = disc1, k = 4)

Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.015009690 0.038191859 0.39301 0.69440639 disc1brown -0.032717084 0.066537035 -0.49171 0.62304187 cont2 0.212856916 0.018793283 11.32622 < 2.22e-16 cont3 0.005457858 0.019394264 0.28142 0.77845557 disc1brown:cont2 0.061748331 0.021911015 2.81814 0.00493549 disc1brown:cont3 0.016191669 0.027321828 0.59263 0.55357838 cont2:cont3 0.118947202 0.035662553 3.33535 0.00088654 disc1brown:cont2:cont3 -0.098224982 0.071036399 -1.38274 0.16708507

Approximate significance of smooth terms: edf Ref.df F p-value s(cont2,cont3):disc1gray 1.8362810 2.0676743 1.08907 0.1846056 s(cont2,cont3):disc1brown 0.7998747 0.7998747 9.65092 0.0055757

Rank: 10/14 R-sq.(adj) = 0.128 Deviance explained = 13.5% GCV = 0.88013 Scale est. = 0.87182 n = 914

However, the approximate significance of each smooth term does not match my a priori expectation. More importantly, it also doesn't match the visualized model fit, by which "gray" should be non-linear, and "brown" somewhat linear:

vis.gam(mod3, view=c("cont2","cont3"), plot.type="contour", 
        cond=list(disc1="gray"), main="disc1: gray")
vis.gam(mod3, view=c("cont2","cont3"), plot.type="contour", 
        cond=list(disc1="brown"), main="disc1: brown")

enter image description here

If I run the model only with either class, it appears to match the figure (high F/low P value for approx. sign. of smooth term for "gray, and the opposite for the "brown" model:

> mod3a = gam(cont1 ~ cont2 * cont3 + 
+               s(cont2, cont3, k=4),
+             data=data_so[disc1=="gray"])
> summary(mod3a)

Family: gaussian Link function: identity

Formula: cont1 ~ cont2 * cont3 + s(cont2, cont3, k = 4)

Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.021064013 0.038786204 0.54308 0.5872749 cont2 0.175710375 0.019469841 9.02475 < 2.22e-16 cont3 -0.003258934 0.019493737 -0.16718 0.8672855 cont2:cont3 0.118727033 0.036447026 3.25752 0.0011869

Approximate significance of smooth terms: edf Ref.df F p-value s(cont2,cont3) 1.590295 1.828967 34.76433 < 2.22e-16

Rank: 5/7 R-sq.(adj) = 0.127 Deviance explained = 13.2% GCV = 0.91631 Scale est. = 0.90938 n = 609 > mod3b = gam(cont1 ~ cont2 * cont3 +

  •           s(cont2, cont3, k=4),
    
  •         data=data_so[disc1==&quot;brown&quot;])
    

> summary(mod3b)

Family: gaussian Link function: identity

Formula: cont1 ~ cont2 * cont3 + s(cont2, cont3, k = 4)

Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.01997134 0.05075102 -0.39352 0.69422 cont2 0.17893076 0.02588782 6.91177 0.000000000028642 cont3 0.01326603 0.02519297 0.52658 0.59888 cont2:cont3 0.02072222 0.05873067 0.35283 0.72446

Approximate significance of smooth terms: edf Ref.df F p-value s(cont2,cont3) 1.075582 1.075582 2.71988 0.06871

Rank: 5/7 R-sq.(adj) = 0.133 Deviance explained = 14.2% GCV = 0.80732 Scale est. = 0.79673 n = 305

What is going on here? Might be related: Why do gam predictions not match gam smooth?

mluerig
  • 701
  • Two comments: 1) There usually isn't a need to use standardized scaling for GAM(M)s and makes the model less interpretable 2) You should really consider using something like method=REML, as the default GCV is not recommended. – Shawn Hemelstrand Mar 13 '23 at 10:44
  • ad 1) I standardized the data because they are on vastly different scales (how does mgcv::gam deal with that?), and because I use the same standardized data in analyses where I have to standardize. ad 2) you mean always or just in this context? – mluerig Mar 13 '23 at 10:52
  • 2
    If by "the visualized model fit" you mean the plots of the smoothers then I don't understand the question. p-values take into account uncertainty but the plots do not depict uncertainty. Also, I don't understand why you include both linear effects for the continuous predictors and a smoother. Why not just include a tensor product smoother? – Roland Mar 13 '23 at 11:22
  • 1
    @mluerig Roland alludes to what I already was going to say. Tensor products include the raw data without any need for standardization. It's one of the massive benefits of a GAM over a GLM because there is no need to standardize variables for interactions. As for GCV, it almost always undersmooths. Read Page 8 of https://www.frontiersin.org/article/10.3389/fevo.2018.00149/full – Shawn Hemelstrand Mar 13 '23 at 11:51

1 Answers1

3

So there are a number of issues I have to address here. First off, you should almost never use GCV, as it typically undersmooths and creates overly "wiggly" lines in your GAM (Simpson, 2018). Second, if you use a tensor product term, there is no need to standardize your variables in the model (see Baayen et al., 2017 and Pedersen et al., 2019 for more info on these). I express why standardization is problematic from a practical view in this answer.

But by far the biggest issue I see in your model is you have actually modeled most of your variables as parametric coefficients, which means they have no estimated curvilinearity in your regression. In order to estimate main effects as well as the interaction of each variable, you should give each their own spline. I'm assuming what you actually wanted was something like this:

mod3 <- gam(cont1 
           ~ disc1
           + s(cont2) 
           + s(cont3)
           + ti(cont2, 
                cont3,
                by=disc1, 
                k=c(4,4),
           data = data_so,
           method = "REML")

An example of what this is doing is shown here in Baayen et al., 2017 (here they use ti for the main effect splines, but s for main effects is functionally equivalent):

enter image description here

This may be partially why your interactions don't match expectation. So you should revise your model to have proper spline terms first, then try to understand what is going on thereafter.

Citations

  • Baayen, H., Vasishth, S., Kliegl, R., & Bates, D. (2017). The cave of shadows: Addressing the human factor with generalized additive mixed models. Journal of Memory and Language, 94, 206–234. https://doi.org/10.1016/j.jml.2016.11.006
  • Pedersen, E. J., Miller, D. L., Simpson, G. L., & Ross, N. (2019). Hierarchical generalized additive models in ecology: An introduction with mgcv. PeerJ, 7, e6876. https://doi.org/10.7717/peerj.6876
  • Simpson, G. L. (2018). Modelling palaeoecological time series using generalised additive models. Frontiers in Ecology and Evolution, 6(149), 1–21. https://doi.org/10.3389/fevo.2018.00149