I have proportion data $Y \in [0,1]$. Without random effects, this is fit in glm without too much trouble, with family = quasibinomial to adjust the standard errors, although I actually bootstrap to correct the SEs.
I'm trying to fit a generalized linear mixed model, with e.g. one fixed effect and two random effects. I had hoped glmer and glmmTMB would fit the mixed models version of glm's model above. The problem is that they don't agree.
Here's an example of the data:
require(lme4)
require(glmmTMB)
## each subject looks at each object for a proportion of trial length
subjects = 15
objects = 20
## five looks per pair
n = subjects * objects * 5
## x is some measured variable
x_var = rnorm(n)
rand_subject = rnorm(subjects, 0, 0.3)
rand_object = rnorm(objects, 0, 0.3)
dat <- data.frame(subject = as.character(rep(1:subjects, times = n / subjects)),
object = as.character(rep(1:objects, each = n / objects)),
x_var = x_var, stringsAsFactors = FALSE)
dat$subj_val <- rand_subject[as.numeric(dat$subject)]
dat$obj_val <- rand_object[as.numeric(dat$object)]
## expit scale
expit <- function(x){exp(x) / (1 + exp(x))}
y_linear = dat$subj_val + dat$obj_val + dat$x_var * 0.8
expected_val = expit(y_linear)
From here, we can just set $y = 1$ with probability expected_val, to get a regular binomial. Or from Papke Wooldrige, our assumption is on E(g(x)). So we can simulate values in $[0,1]$ inclusive. I do this below with a beta distribution. Note that as the 0.7 below tends towards 1, all $y$ values will be in $\{0,1\}$.
SD_shrink = 0.7
sd_val = sqrt(expected_val * (1 - expected_val)) * SD_shrink
alpha_val = expected_val^2 * (1 - expected_val) / sd_val^2 - expected_val
beta_val = alpha_val * (1 / expected_val - 1)
y <- rbeta(n, shape1 = alpha_val, shape2 = beta_val)
## dat$y = rbinom(n, 1, prob = expected_linear)
dat$y = y
Now to run the models:
glmer_reg <- glmer(as.formula('y ~ x_var + (1 | subject) + (1 | object)'),
data = dat,
family = binomial('logit'),
control = glmerControl(optimizer="bobyqa"))
tmb_reg_bin <- glmmTMB(as.formula('y ~ x_var + (1 | subject) + (1 | object)'),
data = dat,
family = list(family="binomial",link="logit"))
[I could also run TMB with beta family with the above data, because the simulation doesn't produce exact zeros and ones (e.g. max would be 0.99999999), but my actual data has real zeros and ones]
Problem: these two models don't always give similar results. In my simulations, their point estimates often aren't even in the confidence interval of the other. Here's an example run:
summary(glmer_reg)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: y ~ x_var + (1 | subject) + (1 | object)
Data: dat
Control: glmerControl(optimizer = "bobyqa")
AIC BIC logLik deviance df.resid
1817.5 1838.7 -904.7 1809.5 1496
Scaled residuals:
Min 1Q Median 3Q Max
-2.55678 -0.52916 -0.02636 0.54188 2.47618
Random effects:
Groups Name Variance Std.Dev.
object (Intercept) 0.08794 0.2966
subject (Intercept) 0.13320 0.3650
Number of obs: 1500, groups: object, 20; subject, 15
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.07381 0.12873 -0.573 0.566
x_var 0.95226 0.06968 13.667 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
x_var -0.024
################
################
summary(tmb_reg_bin)
Family: binomial ( logit )
Formula: y ~ x_var + (1 | subject) + (1 | object)
Data: dat
AIC BIC logLik deviance df.resid
1878.8 1900.1 -935.4 1870.8 1496
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
object (Intercept) 0.03967 0.1992
subject (Intercept) 0.07637 0.2763
Number of obs: 1500, groups: object, 20; subject, 15
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02566 0.10099 -0.254 0.799
x_var 0.82458 0.06555 12.579 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is typical of my results: glmer is much better in loglikelihood / AIC etc, but TMB is closer to the true simulated x value (and this can be verified with an oracle glm(family = quasibinomial), plugging in the true subject / object values). Glmer seems to be inflating the estimate.
Notes: increasing the SD_shrink towards 1 makes these models converge. As SD_shrinks gets smaller, the above problem is made much worse: glmer's fit is even better in terms of loglik, while its bias grows worse too; TMB's random effects std. dev estimate seems to go wrong.
I'm guessing I'm wrong about the two models specifying the same model. Which model is closer to specifying the framework of fitting the expected value of a proportion?
glmeris not able to deal with continuous response and binomial family, for whatever technical reason. I don't know what happens with the loglikelihood, but my impression is that one should not believe any output of this model... (Apart from that, too bad that the overdispersion trick did not help getting rid of the bias, this seemed to work in my data as reported in the linked thread.) – amoeba Sep 19 '17 at 22:43