1

I have proportion data $Y \in [0,1]$. Without random effects, this is fit in glm without too much trouble, with family = quasibinomial to adjust the standard errors, although I actually bootstrap to correct the SEs.

I'm trying to fit a generalized linear mixed model, with e.g. one fixed effect and two random effects. I had hoped glmer and glmmTMB would fit the mixed models version of glm's model above. The problem is that they don't agree.

Here's an example of the data:

require(lme4)
require(glmmTMB)

## each subject looks at each object for a proportion of trial length
subjects = 15
objects = 20
## five looks per pair
n = subjects * objects * 5

## x is some measured variable
x_var = rnorm(n)
rand_subject = rnorm(subjects, 0, 0.3)
rand_object = rnorm(objects, 0, 0.3)

dat <- data.frame(subject = as.character(rep(1:subjects, times = n / subjects)),
                  object = as.character(rep(1:objects, each = n / objects)),
                  x_var = x_var, stringsAsFactors = FALSE)

dat$subj_val <- rand_subject[as.numeric(dat$subject)]
dat$obj_val <- rand_object[as.numeric(dat$object)]

## expit scale
expit <- function(x){exp(x) / (1 + exp(x))}

y_linear = dat$subj_val + dat$obj_val + dat$x_var * 0.8
expected_val = expit(y_linear)

From here, we can just set $y = 1$ with probability expected_val, to get a regular binomial. Or from Papke Wooldrige, our assumption is on E(g(x)). So we can simulate values in $[0,1]$ inclusive. I do this below with a beta distribution. Note that as the 0.7 below tends towards 1, all $y$ values will be in $\{0,1\}$.

SD_shrink = 0.7
sd_val = sqrt(expected_val * (1 - expected_val)) * SD_shrink

alpha_val = expected_val^2 * (1 - expected_val) / sd_val^2 - expected_val
beta_val = alpha_val * (1 / expected_val - 1)

y <- rbeta(n, shape1 = alpha_val, shape2 = beta_val)

## dat$y = rbinom(n, 1, prob = expected_linear)
dat$y = y

Now to run the models:

glmer_reg <- glmer(as.formula('y ~ x_var + (1 | subject) + (1 | object)'),
                   data = dat,
                   family = binomial('logit'),
                   control = glmerControl(optimizer="bobyqa"))

tmb_reg_bin <- glmmTMB(as.formula('y ~ x_var + (1 | subject) + (1 | object)'),
                       data = dat,
                       family = list(family="binomial",link="logit"))

[I could also run TMB with beta family with the above data, because the simulation doesn't produce exact zeros and ones (e.g. max would be 0.99999999), but my actual data has real zeros and ones]

Problem: these two models don't always give similar results. In my simulations, their point estimates often aren't even in the confidence interval of the other. Here's an example run:

summary(glmer_reg)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: y ~ x_var + (1 | subject) + (1 | object)
   Data: dat
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
  1817.5   1838.7   -904.7   1809.5     1496 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.55678 -0.52916 -0.02636  0.54188  2.47618 

Random effects:
 Groups  Name        Variance Std.Dev.
 object  (Intercept) 0.08794  0.2966  
 subject (Intercept) 0.13320  0.3650  
Number of obs: 1500, groups:  object, 20; subject, 15

Fixed effects:
            Estimate Std. Error z value            Pr(>|z|)    
(Intercept) -0.07381    0.12873  -0.573               0.566    
x_var        0.95226    0.06968  13.667 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
      (Intr)
x_var -0.024

################
################

summary(tmb_reg_bin)
 Family: binomial  ( logit )
Formula:          y ~ x_var + (1 | subject) + (1 | object)
Data: dat

     AIC      BIC   logLik deviance df.resid 
  1878.8   1900.1   -935.4   1870.8     1496 

Random effects:

Conditional model:
 Groups  Name        Variance Std.Dev.
 object  (Intercept) 0.03967  0.1992  
 subject (Intercept) 0.07637  0.2763  
Number of obs: 1500, groups:  object, 20; subject, 15

Conditional model:
            Estimate Std. Error z value            Pr(>|z|)    
(Intercept) -0.02566    0.10099  -0.254               0.799    
x_var        0.82458    0.06555  12.579 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is typical of my results: glmer is much better in loglikelihood / AIC etc, but TMB is closer to the true simulated x value (and this can be verified with an oracle glm(family = quasibinomial), plugging in the true subject / object values). Glmer seems to be inflating the estimate.

Notes: increasing the SD_shrink towards 1 makes these models converge. As SD_shrinks gets smaller, the above problem is made much worse: glmer's fit is even better in terms of loglik, while its bias grows worse too; TMB's random effects std. dev estimate seems to go wrong.

I'm guessing I'm wrong about the two models specifying the same model. Which model is closer to specifying the framework of fitting the expected value of a proportion?

Colman
  • 64
  • 1
    short answer: I wouldn't trust lme4 with non-integer responses for a binomial. I'll look at this more when I get a chance. – Ben Bolker Sep 18 '17 at 22:34
  • See https://stats.stackexchange.com/questions/233366. I think using binomial family for continuous response variable is not appropriate, so it's not a big wonder that different libraries treat it differently. Using beta mixed model would make sense, too bad that you have exact 0s and 1s. You can either try zero/one-inflated beta, or see option #4 in the linked answer for a glmer hack. – amoeba Sep 19 '17 at 17:17
  • In fact, your problem (glmer not agreeing with glm) seems to be identical to my problem in the linked question, so I vote to close as a duplicate. – amoeba Sep 19 '17 at 17:30
  • 1
  • Hey @amoeba, I read your question, and in some respects this is a response to Ben's proposed solutions (very useful). My issue is not that glmer and glm disagree necessarily - in nonlinear models with random effects, they don't have to agree - it's that glmer and glmmTMB disagree, while in theory are fitting the same model; further, that usual methods to choose between competing models (e.g. loglikelihood based tests) seem to be giving the wrong answer, in terms of matching the simulated data. So, that's a problem, and I'm looking for a ref for which model is fitting the theory – Colman Sep 19 '17 at 22:20
  • So @amoeba, while some parts overlap, I don't think the initial question you asked is quite what I'm having trouble understanding, and in the end it's great that you (for me and you!) found glmmTMB to work better - but unanswered is which one is a more reasonable theoretical match (and/or what's the diff), and why does glmer fit so much better, while looking worse. The answer could even be that my simulation is "wrong" in terms of the theory. – Colman Sep 19 '17 at 22:27
  • Also @amoeba - I didn't address your other useful comments. Conceptually I don't think zero/one inflated beta makes sense for the data, and sadly fits poorly in terms of density. And in terms of the overdispersion trick, the fit is still "good" in loglik, but maintains the bias issue, which is my source of worry here. Finally I don't follow what you mean by not appropriate - in terms of the Papke Wooldridge paper linked, I think it's well defined in terms of expected value. – Colman Sep 19 '17 at 22:31
  • I think what both you and me found out experimentally is that glmer is not able to deal with continuous response and binomial family, for whatever technical reason. I don't know what happens with the loglikelihood, but my impression is that one should not believe any output of this model... (Apart from that, too bad that the overdispersion trick did not help getting rid of the bias, this seemed to work in my data as reported in the linked thread.) – amoeba Sep 19 '17 at 22:43
  • I mean, I presume it's not properly implemented, but it's hard to say: it returns a very good fit, better than glmmTMB. I trust my simulations to a degree, but not enough to say that I know that glmer is making a mistake (of course, Ben saying not to trust it indicates it's not done right - but I didn't know that before). This warning is also reported by glm, but that still gives the right estimates (with conservative SE values - so suitable coverage). I've read that the warning was put in because generally when non-int is put in, users had intended to use weights (so, y = p/n, weights = n). – Colman Sep 19 '17 at 22:45
  • By the way, when you say that overdispersion trick did not work, did you implement it as per point #4 in my linked answer, i.e. with "fake" weights? – amoeba Sep 19 '17 at 22:46
  • I misread your overdispersion trick - I tried something else, sorry. For your trick, I think that only works because for whatever reason, letting y = p/n and setting weights = n is in R a valid way of giving actual binomial data - equivalent to (successes, fails). So by doing that, glmer just fits an actual binomial - but then this is not your data, and you get misleading SE values (far too small). – Colman Sep 19 '17 at 22:50
  • I was doing it some time ago, but I recall getting appropriate SE values (when using the extra random effect for each row, as written there). If you select n big enough, then this "rounded" data will be very close to the actual data. I am not saying it's the way to go, it's more of a hack. – amoeba Sep 19 '17 at 22:53
  • I just tried it in simulation, and with the rowid add on, still gives very small SEs - with K = 100, approx 10 times too small, which makes sense, since it's similar to essentially pretending you have 100 times more data than you do. It does "undo" the bias (also makes sense, since it's now a true binomial), but that only is right if the glmer "bias" is really bias (which indeed I believe it is, but I don't think this shows it to be). – Colman Sep 19 '17 at 23:05
  • Hmm, interesting. I might need to get back to that answer and clarify this point. Thanks. – amoeba Sep 19 '17 at 23:08

0 Answers0