I'm trying to find the best specification of a nested mixed effects model using gam. I don't have an easy dataset to share, but simulate an example below - and have figure to help explain. Basically, if my data is coded for implicit nesting of a random effect, do I have to tell gam() the variable is nested?
I have multiple Sites sampled each year, each site contains a set of sample blocks (e.g. "A1:A5"). For a given day of sampling, every Site is included, but within each site I select a random subset of blocks to take samples from.
Therefore it is possible for individual sample blocks to be sampled multiple times, while others may not have been sampled at all.

I am interested in a Site and Year effect on my response variable, but want to account for 'Block ID' as a random effect. Block is clearly nested within each site; however, I have given each block a unique block_id name (As the figure suggests), and so may have 'implicit nesting' built into my dataframe.
The question is whether I can simply include 'block_id' as a random effect on its own, or do I need to explicitly define the nesting structure between 'block_id' and 'site' for gam() to run appropriately?
Versions of this question have been asked, but I can't seem to relate it to my particular problem.
A brief reprex is below, code chunk 1 creates data with uneven sampling of 'block_id' across years and sites. code chunk 2 shows my attempt at using gam() to model results
library(tidyverse)
library(mgcv)
#################################
#################################
Clunky, but generates fake data roughly like I have:
cell id's
cells <- data.frame(
site = rep(c("A", "B", "C"), each = 5),
block.temp = rep(seq(1, 5, 1), 3)
) %>%
mutate(block = paste0(site, "_", block.temp)) %>%
select(-block.temp)
setup year and area sampling
d <- data.frame(
year = rep(c(2012, 2013, 2014), each = 1000),
site = rep(c("A", "B", "C"), each = 1000)
)
Fill in uneven amounts of sampling of block_id by site
site.names <- unique(cells$site)
for (i in 1:length(site.names)) {
cells.temp <- cells$block[cells$site == site.names[i]]
d$block_id[d$site == site.names[i]] <- sample(cells.temp, 1000, replace = TRUE)
}
d <- d %>%
mutate(
block_id = as.factor(block_id),
site = as.factor(site),
response.var = runif(3000, 20,60)
)
table(d$site, d$block_id)
Below I first include a lme4::lmer() model specification where block_id is explicitly defined as nested within site
Then two options for a specification with gam given the way I have coded block_id in my data. Is the first gam1 appropriate?
glm1 = lmer(response.var ~ site + year + (1|site:block_id), data = d)
Is this one correct?
gam1 = gam(response.var ~ site + s(year, k = 3) + s(block_id, bs = 're'), data = d)
gam2 = gam(response.var ~ site + s(year, k = 3) + s(block_id, by = site, bs = 're'), data = d)
summary(gam1)
summary(gam2)
block_id | sitesosite:block_idis equivalent toblock_idgiven the way you have codedblock_id. Includingyeardoesn't change anything as that is independent of the random effects terms, just add it back in to thegam()- I removed it because the other terms had larger differences to thelmer()model if I keptyearin thegam()one and I wanted to illustrate the random effect structures were resulting in the same model fits, essentially. – Gavin Simpson Apr 13 '21 at 21:07model.matrix( ~ site:year - 1, data = foo)(IIRC; I think the -1 is needed so we get the dummy coding). So you'll get estimates for the combinations ofsiteandyearthat are present in the data. Note that if you want to predict at (site, year) pairs not in the data, you need to have those sites/years coded as levels of the respective factor and specifydrop.unused.levels = FALSEwhen you callgam()etc so that it retains the original levels allowing you to predict for an unobserved (site, year) pair. – Gavin Simpson May 23 '23 at 08:51