Coding Nested Effect in gam() via Implicit Nesting

Question

I'm trying to find the best specification of a nested mixed effects model using gam. I don't have an easy dataset to share, but simulate an example below - and have figure to help explain. Basically, if my data is coded for implicit nesting of a random effect, do I have to tell gam() the variable is nested?

I have multiple Sites sampled each year, each site contains a set of sample blocks (e.g. "A1:A5"). For a given day of sampling, every Site is included, but within each site I select a random subset of blocks to take samples from.

Therefore it is possible for individual sample blocks to be sampled multiple times, while others may not have been sampled at all.

I am interested in a Site and Year effect on my response variable, but want to account for 'Block ID' as a random effect. Block is clearly nested within each site; however, I have given each block a unique block_id name (As the figure suggests), and so may have 'implicit nesting' built into my dataframe.

The question is whether I can simply include 'block_id' as a random effect on its own, or do I need to explicitly define the nesting structure between 'block_id' and 'site' for gam() to run appropriately?

Versions of this question have been asked, but I can't seem to relate it to my particular problem.

glm example

gam example 1

gam example 2

A brief reprex is below, code chunk 1 creates data with uneven sampling of 'block_id' across years and sites. code chunk 2 shows my attempt at using gam() to model results


library(tidyverse)
library(mgcv)
#################################
#################################
Clunky, but generates fake data roughly like I have:
cell id's
cells <- data.frame(
  site = rep(c("A", "B", "C"), each = 5),
  block.temp = rep(seq(1, 5, 1), 3)
) %>%
  mutate(block = paste0(site, "_", block.temp)) %>%
  select(-block.temp)
setup year and area sampling
d <- data.frame(
  year = rep(c(2012, 2013, 2014), each = 1000),
  site = rep(c("A", "B", "C"), each = 1000)
)
Fill in uneven amounts of sampling of block_id by site
site.names <- unique(cells$site)
for (i in 1:length(site.names)) {
  cells.temp <- cells$block[cells$site == site.names[i]]
d$block_id[d$site == site.names[i]] <- sample(cells.temp, 1000, replace = TRUE)
}
d <- d %>%
  mutate(
    block_id = as.factor(block_id),
    site = as.factor(site),
    response.var = runif(3000, 20,60)
  )
table(d$site, d$block_id)

Below I first include a lme4::lmer() model specification where block_id is explicitly defined as nested within site

Then two options for a specification with gam given the way I have coded block_id in my data. Is the first gam1 appropriate?


glm1 = lmer(response.var ~ site + year + (1|site:block_id), data = d)
Is this one correct?
gam1 = gam(response.var ~ site + s(year, k = 3) + s(block_id, bs = 're'), data = d)
gam2 = gam(response.var ~ site + s(year, k = 3) + s(block_id, by = site, bs = 're'), data = d)
summary(gam1)
summary(gam2)

score 2 · Accepted Answer · answered Apr 10 '21 at 20:07

I'm a little rusty on my {lme4} syntax, but the specification you use there isn't estimating separate effects of a variance per site and a variance per block_id within site, at least as far as I can recollect.

As such, I think the equivalent model in gam() is:

m <- gam(response.var ~ site + year + s(block_id, bs = 're'), data = d, method = "REML")

It's not quite the same; lmer() detects some rank deficiency and drops the year effect so there are some subtle differences but:

> ranef(glm1)
$`site:block_id`
        (Intercept)
A:A_1 -0.1539891045
A:A_2  0.2353362669
A:A_3  0.1003993708
A:A_4 -0.0699466347
A:A_5 -0.1117998985
B:B_1 -0.1781176150
B:B_2  0.0727643997
B:B_3  0.0127993045
B:B_4 -0.0129643181
B:B_5  0.1055182290
C:C_1  0.1870077165
C:C_2  0.2821937644
C:C_3 -0.0882142162
C:C_4 -0.0002309663
C:C_5 -0.3807562984
with conditional variances for “site:block_id”

and

> data.frame(`(Intercept)` = coef(m)[grepl("s(block_id)", names(coef(m)), fixed = TRUE)], check.names = FALSE)
                 (Intercept)
s(block_id).1  -0.1538140720
s(block_id).2   0.2350678304
s(block_id).3   0.1002870399
s(block_id).4  -0.0698668742
s(block_id).5  -0.1116739242
s(block_id).6  -0.1779188819
s(block_id).7   0.0726834196
s(block_id).8   0.0127859331
s(block_id).9  -0.0129484046
s(block_id).10  0.1053979339
s(block_id).11  0.1868009316
s(block_id).12  0.2818730608
s(block_id).13 -0.0881112529
s(block_id).14 -0.0002306384
s(block_id).15 -0.3803321011

Dropping the year term from the GAM makes the models effectively the same:

> m0 <- gam(response.var ~ site + s(block_id, bs = 're'), data = d, method = "REML")
> data.frame(`(Intercept)` = coef(m0)[grepl("s(block_id)", names(coef(m0)), fixed = TRUE)], check.names = FALSE)
                 (Intercept)
s(block_id).1  -0.1540066503
s(block_id).2   0.2353631761
s(block_id).3   0.1004106310
s(block_id).4  -0.0699546303
s(block_id).5  -0.1118125265
s(block_id).6  -0.1781375363
s(block_id).7   0.0727725172
s(block_id).8   0.0128006447
s(block_id).9  -0.0129659135
s(block_id).10  0.1055302878
s(block_id).11  0.1870284447
s(block_id).12  0.2822259128
s(block_id).13 -0.0882245380
s(block_id).14 -0.0002309992
s(block_id).15 -0.3807988204

Looking at the variance components these models seem effectively equivalent from that perspective too

> gam.vcomp(m0)
Standard deviations and 0.95 confidence intervals:
           std.dev       lower     upper

s(block_id)  0.4157148  0.06103053  2.831678
scale       11.6121233 11.32129661 11.910421
Rank: 2/2
> summary(glm1)
Linear mixed model fit by REML ['lmerMod']
Formula: response.var ~ site + year + (1 | site:block_id)
   Data: d
REML criterion at convergence: 23226.2
Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.73795 -0.86059 -0.03097  0.86817  1.80403
Random effects:
 Groups        Name        Variance Std.Dev.
 site:block_id (Intercept)   0.1728  0.4157 
 Residual                  134.8418 11.6121 
Number of obs: 3000, groups:  site:block_id, 15
Fixed effects:
            Estimate Std. Error t value
(Intercept)  39.9943     0.4116  97.163
siteB        -0.1622     0.5822  -0.279
siteC        -0.7978     0.5823  -1.370
Correlation of Fixed Effects:
      (Intr) siteB 
siteB -0.707

siteC -0.707  0.500
fit warnings:
fixed-effect model matrix is rank deficient so dropping 1 column / coefficient

So, assuming the model you want is the lmer one I think the GAM version I showed is as close as you'll get to it.

Thank you! Just to clarify: when I'm using my actual dataset, the lmer specification (glm1) does not drop the year term - so maybe this behavior is an artifact of how I 'simulated' data for the example? With that said, assuming that the lmer specification is the one that I want, and that 'year' is retained in the final model - you are suggesting that your model "m" is the closes gam() specification? (as opposed to 'm0') — Kodiakflds, Apr 13 '21 at 00:49
Yes, you are not using the nesting operator block_id | site so site:block_id is equivalent to block_id given the way you have coded block_id. Including year doesn't change anything as that is independent of the random effects terms, just add it back in to the gam() - I removed it because the other terms had larger differences to the lmer() model if I kept year in the gam() one and I wanted to illustrate the random effect structures were resulting in the same model fits, essentially. — Gavin Simpson, Apr 13 '21 at 21:07
Hello! Quick question @GavinSimpson, does s(site,year, bs='re') imply all the same sites are present in every year (crossed design)? Assuming both are factors. — Nate, May 22 '23 at 18:43
@Nate no, it will estimate effects for the columns of model.matrix( ~ site:year - 1, data = foo) (IIRC; I think the -1 is needed so we get the dummy coding). So you'll get estimates for the combinations of site and year that are present in the data. Note that if you want to predict at (site, year) pairs not in the data, you need to have those sites/years coded as levels of the respective factor and specify drop.unused.levels = FALSE when you call gam() etc so that it retains the original levels allowing you to predict for an unobserved (site, year) pair. — Gavin Simpson, May 23 '23 at 08:51

Coding Nested Effect in gam() via Implicit Nesting

Clunky, but generates fake data roughly like I have:

cell id's

setup year and area sampling

Fill in uneven amounts of sampling of block_id by site

Is this one correct?

1 Answers1

Linked