How to apply splines to logistic GLMM if predictors are scaled?

Question

Model Details

After some thought about previous questions I've posed here, I've decided to add splines to non-linear terms in my logistic GLMM. However, there are a number of questions I have about doing so, and the answers on here seem fairly non-specific to my issue. My original model is a logistic GLMM that goes something like this:

glmer(y
~ (x1+x2+x3)*x4
+ x5
+ x6
+ (1|subject)
+ (1|item)

Because the model dealt with three interaction terms and the fourth predictor was on a very different scale from the others, all of the predictors were centered and scaled with the scale function in R (as lmer/glmer often kick back errors anyway if you don't).

Problem: Linearity Assumption Violated

It is difficult to check linearity for logistic GLMMs, but plotting the logit of the GLMM and my predictors showed this trend: x4 in the model is the top middle predictor:

It was not immediately clear why until I did some more digging. Because things were on a logistic scale it was difficult to spot. But transforming the outcome into proportions and plotting the outcome by the x4 predictor, it appears this relationship is visible (note the predictor isn't scaled here, it is plotted by its raw value):

Question

I have tried fitting regular GAMs and random effect version GAMMs with something like this. Keep in mind I have around 7000 observations in one model and more than 50,000 in another, so it is computationally expensive no matter what I do, which is why this is frustrating to work with already.

Both of my models look something like this but with a different binary outcome. I have fit five knots (I have read df is used and k is used in multiple places on Cross Validated so which argument actually does what is lost to me still). The Control variables are simple spline main effects, the X variables are main effects which also have interactions with a theoretically important Main variable. The splines at the end are crossed random effects.

gamm.model <- gam(Outcome
                  ~ s(Control_1, k=5) # control vars
                  + s(Control_2, k=5) 
                  + ti(X1, k=5) # main effects
                  + ti(X2, k=5) 
                  + ti(X3, k=5) 
                  + ti(Main, k=5) # main variable of interest
                  + ti(Main, X1, k=c(5,5)) # tensor interactions
                  + ti(Main, X2, k=c(5,5))
                  + ti(Main, X3, k=c(5,5))
                  + s(Subject, bs = "re", k = 5) # REs
                  + s(Item, bs = "re", k = 5), 
                  method = "REML", # restricted ML
                  data= data, 
                  family=binomial) # logistic

My questions are still the following:

Have I used the correct splines (the s and ti functions)？ I specifically used ti because the predictors have main and interaction effects as well as very different scaling. However, I have not scaled the variables before running them in the GAMM model...I simply used ti for that.
How are the knots supposed to be set? I can't tell how to do this with a logistic outcome and the only knowledge I have comes from this article on a recommended 5 or less knots.
Both models were previously fit to GLMMs and the second model had an ICC of around 87% if my memory is correct. Will this overfit the data with splines involved?
With how complex this model is, do I need to change the smoothing parameters?

Lukas Lohse · Answer 1 · 2022-11-04T22:32:50.757

so i was prepping something for this, but as i can see you have mostly answered your own questions.

One thing I'd add to 1) is that it is important for subject and item to be factors. Otherwise the model will treat numerical IDs as numerical values like time or spatial variables.

Source: https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/random.effects.html

Demonstration:

> ran_effs <- rnorm(100)
> y <- rnorm(1000, mean = ran_effs[subject_id])
# wrong
> summary(gam(y ~ s(subject_id, bs = "re", k = 5)))
Family: gaussian 
Link function: identity
Formula:
y ~ s(subject_id, bs = "re", k = 5)
Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.08945    0.06975   1.282      0.2
Approximate significance of smooth terms:
                 edf Ref.df    F p-value
s(subject_id) 0.5349      1 1.15   0.143
R-sq.(adj) =  0.00115   Deviance explained = 0.168%
GCV = 1.8431  Scale est. = 1.8403    n = 1000
now done correctly
> summary(gam(y ~ s(as.factor(subject_id), bs = "re", k = 5))) # k is ignored
Family: gaussian 
Link function: identity
Formula:
y ~ s(as.factor(subject_id), bs = "re", k = 5)
Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.01541    0.09678   0.159    0.874
Approximate significance of smooth terms:
                           edf Ref.df     F p-value

s(as.factor(subject_id)) 87.57     99 8.667  <2e-16 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =  0.463   Deviance explained =   51%
GCV = 1.0861  Scale est. = 0.9899    n = 1000

Now much more important: I think your test of linearity is wrong.

Canonically Linearity is tested by plotting predictors(various ones) on the x-Axis and the residuals on the y-axis. You seem to have plotted predict(your_model) on the x-axis and the your Predictors on the y. As a reference and recommendation I link the performance package: https://github.com/easystats/performance . DHARMa(https://github.com/florianhartig/DHARMa) is also great and uses the same principal, but is unsuited for such a large Logistic Model due to the extensive simulation necessary. For a Logistic Model performance will look at the standardized pearson residual of the conditional residuals. They can be accessed directly by calling residuals(your_model). These will form two lines point symmetric to the origin, but only the (running/smoothed) mean matters as binary data can only have the one parameter.
What you did is wrong, because:

a) By plotting against the predictor in the data you're looking at an unadjusted effect, which can be non-linear even if all adjusted effects on y are linear if the covariates have a non linear relationship:

n <- 1000
x1 <- rnorm(n)
x2 <- x1^2 + rnorm(n)
# linear effects on y
y <- x1 + x2 + rnorm(n)
plot(x1, y)
curve(x^2 + x, add = T, col = 2 )

b) Your model has the terms like x1:x4 with both appearing to be numeric. Therefore it is a higher dimensional polynomial of 2nd degree. Again with a bit of correlation between the covariates you can get non-linear unadjusted effect:

x <- mvtnorm::rmvnorm(n, sigma = cbind(c(1, .5), c(.5, 1))) 
# "linear correlation"
plot(x)
# y with an interaction term
y <- x[, 1] + x[, 2] + .5*x[, 1]*x[, 2] + rnorm(n)
plot(x[, 1], y)
curve(0.5*x^2 + 2*x, add = T, col = 2 )

c) You're not even looking at y directly. Instead looking at predict(your_model), which means that all non-linearities you are seeing, are ones your model managed to fit. If there is anything actually wrong with it, it cannot be found by these plots.

n <- 1000
x1 <- rnorm(n)
x2 <- x1 + rnorm(n)
# x1 is quadratic
y <- rbinom(n, prob = plogis(x1^2 + x2), size = 1) == 1
replicating http://www.sthda.com/english/articles/36-classification-methods-essentials/148-logistic-regression-assumptions-and-diagnostics-in-r/#linearity-assumption
df <- data.frame(x1, x2, y)
model <- glm(y ~ ., data = df, family = binomial)
summary(model)
probabilities <- predict(model, type = "response")
df <- df %>%
 dplyr::select_if(is.numeric) 
predictors <- colnames(df)
Bind the logit and tidying the data for plot
df <- df %>%
 mutate(logit = log(probabilities/(1-probabilities))) %>%
 gather(key = "predictors", value = "predictor.value", -logit)
no apparend with linearity
ggplot(df, aes(logit, predictor.value))+
 geom_point(size = 0.5, alpha = 0.5) +
 geom_smooth(method = "loess") + 
 theme_bw() + 
 facet_wrap(~predictors, scales = "free_y")
demonstrating the issue with residuls
df$res <- residuals(model)
ggplot(df, aes(predictor.value, res)) +
 geom_point(size = 0.5, alpha = 0.5) +
 geom_smooth(method = "loess") + 
 geom_hline(yintercept = 0, lty = 2) +
 theme_bw() + 
 facet_wrap(~predictors, scales = "free_x")
model_true <- glm(y ~ I(x1^2) + x2, data = df, family = binomial)
summary(model_true)
df$res <- residuals(model_true)
ggplot(df, aes(predictor.value, res)) +
 geom_point(size = 0.5, alpha = 0.5) +
 geom_smooth(method = "loess") + 
 geom_hline(yintercept = 0, lty = 2) +
 theme_bw() + 
 facet_wrap(~predictors, scales = "free_x")

I converted the REs to factors so that's not an issue. I'm dont think the sample size in my case matters given DHARMa isn't simulating the model itself, its only simulating the residuals from the model. In fact the DHARMa metrics performed way better after fitting to a GAMM, so it certainly had to do with some form of misspecification. I aggregated the outcomes into their composite or proportional values and plotted by their predictors. Relations were still nonlinear in many cases. — Shawn Hemelstrand, Nov 01 '22 at 00:34
The original logit by predictor plot I included is from here: http://www.sthda.com/english/articles/36-classification-methods-essentials/148-logistic-regression-assumptions-and-diagnostics-in-r/#:~:text=This%20can%20be%20done%20by,predictor%20and%20the%20logit%20values.&text=The%20smoothed%20scatter%20plots%20show,diabetes%20outcome%20in%20logit%20scale. — Shawn Hemelstrand, Nov 01 '22 at 00:34
Great to hear that DHARMa worked well for you and that your splines are now properly justified. Also thank you for the link. It seems reputable but, I hope, I've demonstrated how that they got this one wrong. I've also added a chunk of Code to illustrate my point c) by closely replicating the link. — Lukas Lohse, Nov 04 '22 at 22:46

Gavin Simpson · Accepted Answer · 2022-11-01T10:20:05.380

Q1

It is likely safest to use s() for your univariate smooths as (at least at one time) Simon has suggested he might make specifying univariate smooths with tensor product functions te(), ti(), and t2() defunct/deprecated.

The issue of different scales only comes in when you want a smooth interaction; if you have two or more covariates where the units differ or where one expects the degree of wiggliness in one covariate to be different to that in the other covariates, you should use a tensor product. For spatial coordinates, we might still choose a tensor product smooth over a 2D TPRS smooth (s(x,y, bs = "tp")) if the change in the x direction is expected to be more wiggly than in the y direction.

Q2

With the default TPRS smoother, you don't need to set knots. Technically there is a knot at each unique combination of covariates involved in the smooth. Each of those knots has a basis function, which is then eigendecomposed and the eigenvalues and associated eigenvectors are sorted in terms of the magnitude of the eigenvalues. The first k eigenvectors associated with the k largest eigenvalues are then taken as the new basis of size k that is actually used to represent the smooth effect $f(x_i)$ in the model. This is known as a low-rank thin plate regression spline.

This spline is computationally demanding to set up (the eigendecomposition is the expensive part, even with algorithms that can find only the first k eignevalues and eignevectors. As such, for tensor product smooths — where one naturally will have more data because one is trying to smooth in more dimensions which is more demanding of data — the default basis for the marginal smooths is a cubic regression spline. These splines do have knots, but the actual position of those knots typically doesn't make much different to the resulting spline, unless you have very unevenly spaced values of the covariate in one or more of the marginal smooths. by default, the knots for a CRS are place at the boundary of the data and then at evenly space quantiles of the covariate — this has the effect of concentrating knots where you have unique data.

The general advice with modern GAMs like those implemented in {mgcv} is to set k to be as large as you think you need to achieve the expected wiggliness and then add a little bit. We add a little bit (increase k above what you actually think is the right wiggliness) because a basis of size k + a functions (for a smaller than k) is a much richer basis of functions of wiggliness k than a basis of only size k. Basically, we are trying to ensure that the basis expansion we create to represent a smooth function in the model is sufficiently rich (large) enough to either include the true function or a close approximation to the true function.

Having fitted the model, an important extra diagnostic step is to check is the basis size(s) you specified were large enough. The k.check() function in {mgcv} provides one such heuristic test for this — it is looking for extra structure in the residuals when ordered by the covariate(s) for individual smooths. An alternative approach is simply to take the deviance residuals for the model and then fit a constant variance, identity link model with residuals as the response and a smooth of the same covariates involved in the smooth that you wish to test but double the k used in the smooth. If any of the smooths in this new model show significant wiggliness, then that's a good sign that the k you used in the original fit was too low, so you can go back, double k and refit your original model. You may require several rounds of this if you guessed low on many of the ks initially.

You don't need to set k on the random effect smooths bs = "re"). You are best leaving this at the default (IIRC it is ignored anyway) for this basis. The ridge penalty on these effects will perform the required shrinkage of the model coefficients to give you the same thing as estimates (posterior modes/means) of the random effects.

Q3

Once you as the user have set k for each of the (marginal) smooths in the model, where k is the expected upper limit (plus a bit) on the wiggliness of the functional effect on the response, a wiggliness penalty on each smooth is created. This penalty is used to avoid overly wiggly, i.e. over fitted, estimated smooths. The parameters for the basis functions of each smooth, any fixed effects, and the smoothing parameters are chosen to maximise the penalised log-likelihood of the data. The smoothing parameters are what controls how much penalty we pay for having a wiggly function. The penalised log-likelihood trades off fit for generality; we avoid over fitting because we,a ll else equal, prefer smooth(er) functions to complex (wiggly) ones.

When fitting with method = "REML" or method = "ML" (which aren't the defaults but you should certainly use either of them over the default), the model you are fitting is an empirical Bayesian one, with (improper) Gaussian priors on the coefficients. These priors imply the same concept that we would, all else equal, favour smoother functions over wigglier ones.

Q4

The smoothing parameters are either estimated or fixed at values you specify via the sp argument. You can also use fx = TRUE in one or more smooths to not penalize that particular smooth, which would then use k degrees of freedom (minus something for the identifiability constraints).

Invariably you don't need to do anything to the smoothing parameters and can just let the model estimate them. The main thing that controls the resulting wiggliness is the values of k that you specified. The smoothing parameters just control how much penalty we pay for having a wiggly function. Hence you should think in terms of the wiggliness of your functions and the associated penalty (which is a function of k), rather than the smoothing parameters themselves.

In summary, you should think about the size of the basis you need for each smooth and not worry about smoothing parameters. Once we have a basis representation for the functions you want to estimate, the penalty will largely take care of the overfitting problem. How well all this works depends on all the other modelling decisions you made; have you got the right response distribution, link function, right model terms, etc.

As for your specific model, with the size of data you are fitting (potentially) you might get better performance fitting with bam() or as you have a binomial GAMM via gamm4::gamm4() as the latter will be more efficient when it comes to fitting random effects. You would have to change the moel structure though, as you don:t want to use the random effect splines in the gamm4::gamm4() model:

m <- gam(Outcome ~ s(Control_1) # control vars
    + s(Control_2) 
    + s(X1) # main effects
    + s(X2) 
    + s(X3) 
    + s(Main) # main variable of interest
    + ti(Main, X1, k = c(5, 5)) # tensor interactions
    + ti(Main, X2, k=c(5,5))
    + ti(Main, X3, k=c(5,5)),
    random = ~ (1 | Subject) + (1 | Item), # REs
    REML = TRUE, # restricted ML
    data= data, 
    family=binomial) # logistic

Thank you this is a very comprehensive and exact answer to my question. Just one minor question: why is subject coded as random twice? — Shawn Hemelstrand, Nov 01 '22 at 08:13
Nevermind I figured it out. I just deleted the s(Subject) spline. Thank you so much. You are a life saver. — Shawn Hemelstrand, Nov 01 '22 at 09:28
Thanks; and yeah, I forgot to delete all the ranef smooths when I was editing your model — Gavin Simpson, Nov 01 '22 at 10:20

How to apply splines to logistic GLMM if predictors are scaled?

Model Details

Problem: Linearity Assumption Violated

Question

2 Answers2

now done correctly

Now much more important: I think your test of linearity is wrong.

replicating http://www.sthda.com/english/articles/36-classification-methods-essentials/148-logistic-regression-assumptions-and-diagnostics-in-r/#linearity-assumption

Bind the logit and tidying the data for plot

no apparend with linearity

demonstrating the issue with residuls

Q1

Q2

Q3

Q4

Linked