Estimating Multilevel Logistic Regression Models

Question

The following multilevel logistic model with one explanatory variable at level 1 (individual level) and one explanatory variable at level 2 (group level) :

$$\text{logit}(p_{ij})=\pi_{0j}+\pi_{1j}x_{ij}\ldots (1)$$ $$\pi_{0j}=\gamma_{00}+\gamma_{01}z_j+u_{0j}\ldots (2)$$ $$\pi_{1j}=\gamma_{10}+\gamma_{11}z_j+u_{1j}\ldots (3)$$

where , the group-level residuals $u_{0j}$ and $u_{1j}$ are assumed to have a multivariate normal distribution with expectation zero . The variance of the residual errors $u_{0j}$ is specified as $\sigma^2_0$ , and the variance of the residual errors $u_{1j}$ is specified as $\sigma^2_1$ .

I want to estimate the parameter of the model and I like to use R command glmmPQL .

Substituting equation (2) and (3) in equation (1) yields ,

$$\text{logit}(p_{ij})=\gamma_{00}+\gamma_{10}x_{ij}+\gamma_{01}z_j+\gamma_{11}x_{ij}z_j+u_{0j}+u_{1j}x_{ij}\ldots (4)$$

There are 30 groups$(j=1,...,30)$ and 5 individual in each group .

R code :

   #Simulating data from multilevel logistic distribution 
   library(mvtnorm)
   set.seed(1234)

   J <- 30             ## number of groups
   n_j <- rep(5,J)     ## number of individuals in jth group
   N <- sum(n_j)

   g_00 <- -1
   g_01 <- 0.3
   g_10 <- 0.3
   g_11 <- 0.3

   s2_0 <- 0.13  ##variance corresponding to specific ICC
   s2_1 <- 1     ##variance standardized to 1
   s01  <- 0     ##covariance assumed zero

   z <- rnorm(J)
   x <- rnorm(N)

   #Generate (u_0j,u_1j) from a bivariate normal .
   mu <- c(0,0)
  sig <- matrix(c(s2_0,s01,s01,s2_1),ncol=2)
  u <- rmvnorm(J,mean=mu,sigma=sig,method="chol")

  pi_0 <- g_00 +g_01*z + as.vector(u[,1])
  pi_1 <- g_10 + g_11*z + as.vector(u[,2])
  eta <- rep(pi_0,n_j)+rep(pi_1,n_j)*x
  p <- exp(eta)/(1+exp(eta))

  y <- rbinom(N,1,p)

Now the parameter estimation .

  #### estimating parameters 
  library(MASS)
  library(nlme)

  sim_data_mat <- matrix(c(y,x,rep(z,n_j),rep(1:30,n_j)),ncol=4)
  sim_data <- data.frame(sim_data_mat)
  colnames(sim_data) <- c("Y","X","Z","cluster")
  summary(glmmPQL(Y~X*Z,random=~1|cluster,family=binomial,data=sim_data,,niter=200))

OUTPUT :

      iteration 1
      Linear mixed-effects model fit by maximum likelihood
      Data: sim_data 

      Random effects:
      Formula: ~1 | cluster
              (Intercept)  Residual
      StdDev: 0.0001541031 0.9982503

      Variance function:
      Structure: fixed weights
      Formula: ~invwt 
      Fixed effects: Y ~ X * Z 
                      Value Std.Error  DF   t-value p-value
      (Intercept) -0.8968692 0.2018882 118 -4.442404  0.0000
      X            0.5803201 0.2216070 118  2.618691  0.0100
      Z            0.2535626 0.2258860  28  1.122525  0.2712
      X:Z          0.3375088 0.2691334 118  1.254057  0.2123
      Correlation: 
           (Intr) X      Z     
      X   -0.072              
      Z    0.315  0.157       
      X:Z  0.095  0.489  0.269

      Number of Observations: 150
      Number of Groups: 30

Why does it take only $1$ iteration while I mentioned to take $200$ iterations inside the function glmmPQL by the argument niter=200 ?
Also p-value of group-level variable $(Z)$ and cross-level interaction $(X:Z)$ shows they are not significant . Still why in this article, they keep the group-level variable $(Z)$ and cross-level interaction $(X:Z)$ for further analysis ?
Also How are the degrees of freedom DF being calculated ?

It doesn't match with the relative bias of the various estimates of the table . I tried to calculate the relative bias as :

 #Estimated Fixed Effect parameters :

 hat_g_00 <- -0.8968692 #overall intercept
 hat_g_10 <- 0.5803201  # X
 hat_g_01 <-0.2535626   # Z
 hat_g_11 <-0.3375088   #X*Z

fixed <-c(g_00,g_10,g_01,g_11)
hat_fixed <-c(hat_g_00,hat_g_10,hat_g_01,hat_g_11)


#Estimated Random Effect parameters :

hat_s_0 <-0.0001541031  ##Estimated Standard deviation of random intercept 
hat_s_1 <-  0.9982503 

std  <- c(sqrt(0.13),1) 
hat_std  <- c(0.0001541031,0.9982503) 

##Relative bias of Fixed Effect :
rel_bias_fixed <- ((hat_fixed-fixed)/fixed)*100
[1] -10.31308  93.44003 -15.47913  12.50293

##Relative bias of Random Effect :
rel_bias_Random <- ((hat_std-std)/std)*100
[1] -99.95726  -0.17497

Why doesn't the relative bias match with the table ?

score 7 · Accepted Answer · edited Apr 13 '17 at 12:44

7

There are perhaps too many questions here. Some comments:

you might consider using glmer from the lme4 package (glmer(Y~X*Z+(1|cluster),family=binomial,data=sim_data)); it uses Laplace approximation or Gauss-Hermite quadrature, which are generally more accurate than PQL (although the answers are very similar in this case).
The niter argument specifies the maximum number of iterations; only one iteration was actually necessary
I'm not sure what your question is about the interaction term. Whether you should drop non-significant interaction terms or not is a bit of a can of worms, and depends both on your statistical philosophy and on the goals of your analysis (e.g. see this question)
the denominator degrees of freedom are being calculated according to a simple 'inner-outer' heuristic a simple 'inner-outer' rule described on page 91 of Pinheiro and Bates (2000), which is available on Google Books ... it is generally a reasonable approximation but the computation of degrees of freedom is complex, especially for GLMMs
if you're trying to replicate "A simulation study of sample size for multilevel logistic regression models" by Moineddin et al. (DOI: 10.1186/1471-2288-7-34), you need to run a large number of simulations and compute averages, not just compare a single run. Furthermore, you should probably try to get closer to their methods (coming back to my first point, they state that they use SAS PROC NLMIXED with adaptive Gauss-Hermite quadrature, so you'd be better off with e.g. glmer(...,nAGQ=10); it still won't match exactly, but it'll probably be closer than glmmPQL.

edited Apr 13 '17 at 12:44

Community

1

answered Jun 14 '15 at 17:16

Ben Bolker

43,543

Could you please explain it a more little bit that to replicate http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1955447/table/T1/ , I need to run a large number of simulations and compute averages . Does it mean , say , I have to simulate data $300$ times from multilevel logistic distribution and estimate their parameters each time and to take the average of the estimates ? But If I do say , won't the value of estimated parameter become equal to true value of parameter according to $\mathbb E[\hat\theta]=\theta$ ? – ABC Jun 15 '15 at 00:09
glmer() estimates the variance of random intercept , $\sigma_0^2$ . But I am not getting any estimate of other variance component (residual variance component) , $\sigma_1^2$ from the result produces by summary(glmer(Y~X*Z+(1|cluster),family=binomial,data=sim_data,nAGQ=10)) – ABC Jun 15 '15 at 00:53
will answer more tomorrow, but (1) yes, you do need to simulate many times. The definition of bias is the average difference between the estimated and the true value ... (2) technically, there is no residual variance component in GLMMs with fixed scale parameters (Bernoulli, binomial, Poisson responses), although there are various ways of quantifying analogous values from the output of a GLMM fit – Ben Bolker Jun 15 '15 at 02:10
It will be very helpful if you explain it a little bit more about simulating many times . (2)How can i quantify analogous value of residual variance from the output of GLMM fit ? – ABC Jun 15 '15 at 14:45
What specifically is confusing you about simulating many times? For example, the paper you are trying to replicate explains that they generated 1000 data sets for EACH combination of parameters. I recommend replicating what they did more precisely before comparing results. – Ryan Simmons Jun 17 '15 at 13:51
@RyanSimmons that is , i need to generate data and estimate parameters $1000$ times . and then taking the average of the each estimate . Then Why should there will still relative bias $\frac{\hat\theta-\theta}{\theta}\times 100$ ? I hoped this relative bias would $0$ . Because according to unbiased property $\mathbb E(\hat\theta)=\theta$ .So the numerator of relative bias will become $0$ . – ABC Jun 17 '15 at 20:44
@RyanSimmons Can you please give me some reference (example) about simulating many times and estimating parameters from any distribution . – ABC Jun 17 '15 at 20:48
In this example http://stats.stackexchange.com/questions/134132/determine-maximum-likelihood-estimate-mle-of-loglogistic-distribution/134618#134618 , is there the idea of simulating many times ? – ABC Jun 17 '15 at 20:57
perhaps the idea of simulating many times involves in this answer http://stats.stackexchange.com/questions/22406/power-analysis-for-ordinal-logistic-regression/22410#22410 . right ? – ABC Jun 17 '15 at 21:10
2

you're assuming that the approximations we use for GLMM estimation are unbiased. That's probably not true; most of the better approximations (not PQL) are asymptotically unbiased, but they're still biased for finite-size samples. – Ben Bolker Jun 18 '15 at 03:31
1

@ABC: Yes, both of those links contain examples for how to replicate a chunk of code multiple times. It should be easy to wrap your code in a function and run the replicate command, for example. – Ryan Simmons Jun 18 '15 at 13:22
1

@ABC: As for the other part of your question, I am a bit confused as to what is bothering you. You are generating random numbers; without rounding or an infinitely large number of replications, you are never going to get exactly 0 with the bias (or, indeed, an exactly precise estimate of ANY parameter). However, with a large enough number of replications (say, 1000), you are likely to get a very small (close to 0) bias. The paper you cite that you are trying to replicate demonstrates this. – Ryan Simmons Jun 18 '15 at 13:24
For the model in equation (4) in the post , I specified the glmer model as Y~X*Z+(1|cluster) , i.e., glmer(Y~X*Z+(1|cluster),family=binomial,data=sim_data) . Is it correct ? It seems to me glmer model for equation (4) will be Y~X*Z+(X|cluster) , i.e., glmer(Y~X*Z+(X|cluster),family=binomial,data=sim_data) . Is the latter glmer model correct ? – ABC Jul 20 '15 at 07:02
Assuming the model Y~X*Z+(X|cluster) fit the model in equation (4) best , I tried glmer(Y~X*Z+(X|cluster),family=binomial,data=sim_data,nAGQ=10) . But It shows Error in updateGlmerDevfun(devfun, glmod$reTrms, nAGQ = nAGQ) : nAGQ > 1 is only available for models with a single, scalar random-effects term . I have read the link https://github.com/lme4/lme4/issues/123 . But there is no , lme4.0 package for R version 3.2.1 and I am using lme4 1.1-8 version . – ABC Jul 22 '15 at 02:27

Estimating Multilevel Logistic Regression Models

OUTPUT :

1 Answers1

Linked