3

Lets say individuals are nested within each ID and I am trying to a predict level 1 outcome Y from a level 1 predictor X with random slopes and intercepts. Using the nlme package in R, I ran the following:

> library(nlme)
> set.seed(123)
> Y=rnorm(100)+seq(.01,1,.01)
> X=rep(1:10,10)
> ID=sort(rep(1:10,10))
> Model=lme(method="ML",Y~X,random=~X|ID)
> Model
Linear mixed-effects model fit by maximum likelihood
  Data: NULL 
  Log-likelihood: -136.9366
  Fixed: Y ~ X 
(Intercept)           X 
 0.45724997  0.02511926 

Random effects:
 Formula: ~X | ID
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev     Corr  
(Intercept) 0.42405364 (Intr)
X           0.01936927 -0.87 
Residual    0.91060089       

Number of Observations: 100
Number of Groups: 10 
> RandomEffects=random.effects(Model)
> RandomEffects[1:5,]
  (Intercept)            X
1 -0.29776233  0.009528802
2 -0.11581782  0.002489661
3 -0.58731522  0.022490872
4  0.09029364 -0.004322318
5 -0.11383183  0.004338493
> cor(RandomEffects)
            (Intercept)          X
(Intercept)   1.0000000 -0.9952382
X            -0.9952382  1.0000000
>

As you can see, the slope/intercept correlation for the output (-0.87) vs the one I calculated (-.995) are not consistent, although they are close. Why is this the case? How can I manually calculate and come up with the outputted correlation?

I find this inconsistency in every analysis, although the correlations are always close. I welcome any ideas. Thank you for your input.

Hotaka
  • 1,174
  • 9
  • 14

2 Answers2

9

You're conflating two slightly different concepts. In short, the output of VarCorr() is based on the estimated variance-covariance matrix of the unconditional distribution of the random effects, and is not dependent on the observed data $\mathbf{\mathcal{Y}}$. The correlations that you calculated manually are based on the conditional modes of the random effects, which are where the density of $\mathbf{\mathcal{Y}}$ is maximal given the random effects, which means that they are conditional on the observed data (paraphrased from here).

The correlation output of VarCorr() (what you see in the output of print(model)) is based on the estimate of $\mathbf{\Sigma}$ for the marginal, or unconditional, distribution of the random effects, given by

$\mathbf{\mathcal{B}} \backsim \mathcal{N}(\mathbf{0},\mathbf{\Sigma})$

Where $\mathbf{\mathcal{B}}$ is the multivariate normal vector of random effects, with mean 0 and q x q variance-covariance matrix $\mathbf{\Sigma}$, and q is the number of columns in the random effects model matrix (see the lme4 paper for details, and you can check in lme4 with getME(model, "q")).

The conditional modes are where the density of the conditional distribution of $(\mathbf{\mathcal{Y}}|\mathbf{\mathcal{B}} = \mathbf{\mathcal{b}})$ takes the highest value. They are calculated for each level of the grouping factor of the random effect, and so the correlation of these can change depending on what data you've observed.

As for figuring out the manual calculation, see fortunes::fortune(250). The relevant code is in nlme:::VarCorr.pdMat or lme4:::mkVarCorr (I think the lme4 code is a little easier to read). I think the way lme4 does it is:

  1. Takes $\sigma$ (the residual standard deviation/scale factor), random effect component names, number of terms for each component, and $\theta$ (the random effects parameters as Cholesky factors) as input.

  2. For each random effects term $\mathcal{i}$,

    • Takes the cross product of the scale factor and the transpose of $\Lambda_i$ (which is dependent on $\theta$), which gives the variance-covariance matrix
    • Calculates the standard deviation as the square root of the diagonal of the vcov matrix
    • Calculates the correlation as covariance (non-diagonal terms) divided by the standard deviation

A (not very flexible) version of how mkVarCorr calculates the correlation:

library(lme4)

fm1 <- lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy, REML = FALSE)

getRanCorr <- function(mod) {
    n_ranefs <- ncol(ranef(mod)[[1]])
    theta <- getME(mod, "theta")
    sig <- sigma(mod)    
    Li <- diag(nrow = n_ranefs)

    Li[lower.tri(Li, diag = TRUE)] <- theta
    val <- tcrossprod(sig * Li)
    stddev <- sqrt(diag(val))    
    corr <- t(val/stddev)/stddev # or cov2cor(val), but I suspect there's a
                                 # reason for this
    corr[2,1]
    }

getRanCorr(fm1)
##[1] 0.08131969

VarCorr(fm1)
##Groups   Name        Std.Dev. Corr 
##Subject  (Intercept) 23.7806       
##         Days         5.7168  0.081
##Residual             25.5918       

And with your model,

mod <- lmer(Y ~ X + (X|ID), data = df, REML = FALSE)

getRanCorr(mod)
##[1] -0.870463

attr(VarCorr(mod)$ID,"correlation")[2,1]
##[1] -0.870463
0

This reminds me of a similar inquiry a professor brought up to me when studying autocorrelation which was why do they measure the correlation based on the residuals and not the x and y variables considering they both gave similar but different correlations like your situation and it came down to the residuals being a more conservative test... maybe the test ran in your mixed model is more conservative in some way to your approach.

DaveRowan
  • 108
  • Well correlations are not a significance test, so what do you mean by conservative? Maybe the output shows the correlation in the population while mine is sample-dependent? – Hotaka May 21 '15 at 10:29