Can fixed-effects become biased due to random structure misspecification
Yes they can. Let's do a simulation in R to show it.
We will simulate data according to the following model:
Y ~ treatment + time + (1 | site) + (time | subject)
So we have fixed effects for treatment and time, random intercepts for subject nested within site and random slopes for time over subject. There are many things that we can vary with this simulation and obviously there is a limit to what I can do here. But if you (or others) have some suggestions for altering the simulations, then please let me know. Of course you can also play with the code yourself :)
In order to look at bias in the fixed effects we will do a Monte Carlo simulation. We will make use of the following helper function to determine if the model converged properly or not:
hasConverged <- function (mm) {
if ( !(class(mm)[1] == "lmerMod" | class(mm)[1] == "lmerModLmerTest")) stop("Error must pass a lmerMod object")
retval <- NULL
if(is.null(unlist(mm@optinfo$conv$lme4))) {
retval = 1
}
else {
if (isSingular(mm)) {
retval = 0
} else {
retval = -1
}
}
return(retval)
}
So we will start by setting up the parameters for the nested factors:
n_site <- 100; n_subject_site <- 5; n_time <- 2
which are the number of sites, the number of subjects per site and the number of measurements within subjects.
So now we simulate the factors:
dt <- expand.grid(
time = seq(0, 2, length = n_time),
site = seq_len(n_site),
subject = seq_len(n_subject_site),
reps = 1:2
) %>%
mutate(
subject = interaction(site, subject),
treatment = sample(0:1, size = n_site * n_subject_site,, replace =
TRUE)[subject],
Y = 1
)
X <- model.matrix(~ treatment + time, dt) # model matrix for fixed effects
where we also add a column of 1s for the reponse at this stage in order to make use of the lFormula function in lme4 which can construct the model matrix of random effects Z:
myFormula <- "Y ~ treatment + time + (1 | site) + (time|subject)"
foo <- lFormula(eval(myFormula), dt)
Z <- t(as.matrix(foo$reTrms$Zt))
Now we set up the parameters we will use in the simulations:
# fixed effects
intercept <- 10; trend <- 0.1; effect <- 0.5
SDs of random effects
sigma_site <- 5; sigma_subject_ints <- 2; sigma_noise <- 1; sigma_subj_slopes <- 0.5
correlation between intercepts and slopes for time over subject
rho_subj_time <- 0.2
betas <- c(intercept, effect, trend) # Fixed effects parameters
Then we perform the simulations:
n_sim <- 200
# vectrs to store the fixed effects from each simulations
vec_intercept <- vec_treatment <- vec_time <- numeric(n_sim)
for (i in 1:n_sim) {
set.seed(i)
u_site <- rnorm(n_site, 0, sigma_site) # standard deviation of random intercepts for site
cormat <- matrix(c(sigma_subject_ints, rho_subj_time, rho_subj_time, sigma_subj_slopes), 2, 2) # correlation matrix
covmat <- lme4::sdcor2cov(cormat)
umat <- MASS::mvrnorm(n_site * n_subject_site, c(0, 0), covmat, empirical = TRUE) # simulate the random effects
u_subj <- c(rbind(umat[, 1], umat[, 2])) # lme4 needs the random effects in this order (interleaved) when there are slopes and intercepts
u <- c(u_subj, u_site)
e <- rnorm(nrow(dt), 0, sigma_noise) # residual error
dt$Y <- X %% betas + Z %% u + e
m0 <- lmer(myFormula, dt)
summary(m0) %>% coef() -> dt.tmp
if(hasConverged(m0)) {
vec_intercept[i] <- dt.tmp[1, 1]
vec_treatment[i] <- dt.tmp[2, 1]
vec_time[i] <- dt.tmp[3, 1]
} else {
vec_intercept[i] <- vec_treatment[i] <- vec_time[i] <- NA
}
}
And finally we can check for bias:
mean(vec_intercept, na.rm = TRUE)
## [1] 10.04665
mean(vec_treatment, na.rm = TRUE)
## 0.497358
mean(vec_time, na.rm = TRUE)
## [1] 0.09761494
...and these agree closely with the values used in the simulation: 10, 0.5 and 0.1.
Now, let us repeat the simulations, based on the same model:
Y ~ treatment + time + (1 | site) + (time|subject)
but instead of fitting this model, we will fit:
Y ~ treatment + time + (1 | site)
So we just need to make a simple change:
m0 <- lmer(myFormula, dt)
to
m0 <- lmer(Y ~ treatment + time + (1 | site), data = dt )
And the results are:
mean(vec_intercept, na.rm = TRUE)
## [1] 10.04169
mean(vec_treatment, na.rm = TRUE)
##[1] 0.5068864
mean(vec_time, na.rm = TRUE)
##[1] 0.09761494
So that's all good.
Now we make a simple change:
n_site <- 4
So now, instead of 100 sites, we have 4 sites. We retain the number of subjects per site (5) and the number of time points per subject (2).
For the "correct" model, the results are:
mean(vec_intercept, na.rm = TRUE)
## 10.16447
mean(vec_treatment, na.rm = TRUE)
## [1] 0.422812
mean(vec_time, na.rm = TRUE)
## [1] 0.1049933
Now, while the intercept and time are close to unbiased, the treatment fixed effect is a little off (0.42 vs 0.5, a bias of around -15% which perhaps stregthens the argument for not fitting random intercepts at all for such a small group even when the random structure is correct). But, if we fit the "wrong" model, the results are:
mean(vec_intercept, na.rm = TRUE)
## [1] 10.0194
mean(vec_treatment, na.rm = TRUE)
## [1] 0.7084542
mean(vec_time, na.rm = TRUE)
## [1] 0.1029664
So now we find the bias of around +42%
As mentioned above, there are a huge number of possible ways this simulation can be altered and adapted, but it does show that biased fixed effects can result when the random structure is wrong, as requested.