3

I’m working with a toy data set to get a basic understanding of longitudinal data analysis and I’m having a difficult time with a conceptual understanding of the intraclass correlation (ICC) as a proportion of variance. As I understand it, one way to interpret the ICC in a random-intercept model is it’s the proportion of original outcome variance attributable to the between-cluster mean differences over time. In mathematical terms, if the model is represented as $$y_{ti} = \beta_{0} + U_{0i} + e_{ti}$$ Where $U_{0i}\sim\mathcal{N}(0, \tau^{2}_{U_{0}})$ and $e_{ti}\sim\mathcal{N}(0, \sigma^{2}_{e})$, then the ICC is defined as $$\text{ICC} = \frac{Var(U_{0i})}{Var(U_{0i}) + Var(e_{ti})} = \frac{\tau^{2}_{U_{0}}}{\tau^{2}_{U_{0}} + \sigma^{2}_{e}}.$$ Here are a couple of figures from the data. Plot A is a histogram of the responses, with the solid red line representing the grand mean. Plot B is a spaghetti plot of the individual trajectory across $t = 3$ time points. I have color coded a few individuals to show different patterns in the response.
histogramandspaghetti

library(tidyverse)
library(ggthemes)
library(ggpubr)
library(nlme)

df <- structure( list( id = rep(1:89, each = 3), wave = rep(1:3, 89), piat = c(18, 35, 59, 18, 25, 28, 18, 23, 32, 18, 31, 50, 18, 33, 53, 18, 28, 31, 17, 28, 28, 17, 29, 41, 28, 26, 26, 16, 20, 21, 18, 28, 42, 18, 31, 57, 18, 41, 49, 18, 31, 52, 18, 33, 44, 18, 43, 46, 37, 28, 29, 16, 27, 27, 18, 18, 26, 18, 38, 47, 18, 22, 26, 22, 27, 45, 14, 37, 37, 18, 40, 35, 29, 31, 43, 17, 23, 21, 19, 36, 57, 4, 27, 28, 18, 24, 45, 18, 44, 51, 18, 31, 51, 24, 39, 48, 18, 34, 29, 17, 36, 37, 18, 21, 28, 16, 38, 47, 18, 40, 47, 22, 36, 34, 18, 41, 48, 19, 40, 46, 18, 28, 36, 18, 28, 47, 20, 36, 45, 18, 28, 41, 16, 34, 43, 25, 45, 49, 18, 24, 31, 17, 27, 29, 19, 32, 48, 43, 55, 61, 20, 24, 32, 18, 46, 47, 18, 29, 40, 28, 56, 61, 18, 20, 40, 21, 30, 38, 21, 22, 47, 18, 22, 21, 16, 21, 32, 29, 37, 48, 21, 32, 47, 19, 20, 26, 18, 34, 41, 28, 30, 31, 21, 28, 43, 18, 25, 32, 20, 24, 39, 22, 30, 42, 26, 47, 45, 22, 45, 40, 22, 41, 53, 18, 24, 40, 26, 28, 28, 26, 28, 30, 18, 24, 26, 31, 41, 62, 17, 19, 28, 42, 29, 30, 28, 42, 50, 23, 39, 55, 28, 47, 53, 26, 44, 32, 19, 35, 51, 18, 32, 40, 29, 47, 45, 16, 20, 59, 22, 49, 64, 22, 23, 23, 23, 40, 45) ), row.names = c(NA, -267L), class = c("tbl_df", "tbl", "data.frame") )

########## Making Figures -----------------------------------
g <- ggplot(df) + 
  aes(x = piat) + 
  geom_histogram(color = "grey", fill = "grey80") + 
  coord_flip() + 
  theme_bw()

rand_ids <- c(16, 86, 88)

rand_df <- df %>% 
  filter(id %in% rand_ids)

p1 <- g + 
  geom_vline(data = rand_df, aes(xintercept = piat, color = factor(id))) + 
  ggthemes::scale_color_colorblind() +
  geom_vline(xintercept = mean(df$piat), color = "red", size = 2, alpha = 0.5) + 
  scale_y_continuous(limits = c(0, 70))

p2 <- ggplot(df) + 
  aes(x = wave, y = piat, group = id) + 
  geom_point(alpha = 0.1) + 
  geom_line(alpha = 0.1) + 
  geom_point(data = rand_df, aes(color = factor(id)), size = 2) + 
  geom_line(data = rand_df, aes(color = factor(id)), size = 1) + 
  ggthemes::scale_color_colorblind()

ggarrange(p1, p2, labels = c("A", "B"), common.legend = TRUE)

When the random intercept model is run, we are estimating the grand mean over time term ($\beta_{0}$), the cluster-specific deviations from the grand mean (as measured by its variance, $\tau^{2}_{U_{0}}$), and the residual variance ($\sigma^{2}_{e}$). Here the random intercept variance is extremely small, and therefore the ICC estimate is essentially 0, and I’m having a hard time conceptualizing why this is the case.

########## Empty-Means Model -----------------------------------
empty_means <- lme(piat ~ 1, random = ~ 1|id, data = df)
summary(empty_means)

Linear mixed-effects model fit by REML Data: df AIC BIC logLik 2084.744 2095.495 -1039.372

Random effects: Formula: ~1 | id (Intercept) Residual StdDev: 0.001475647 11.91709

Fixed effects: piat ~ 1 Value Std.Error DF t-value p-value (Intercept) 31.22472 0.7293139 178 42.81382 0

Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.2845111 -0.8579881 -0.2705962 0.7783177 2.7502760

Number of Observations: 267 Number of Groups: 89

########## ICC by hand -----------------------------------
0.001475647^2/(0.001475647^2 + 11.91709^2)
[1] 1.533291e-08

Here are my questions:

  1. Does the output imply that the estimates for the cluster-level means across time ($\beta_{0} + U_{0i}$) are all essentially at the grand mean for every cluster?
  2. If so, why is this the case, since empirical cluster-level means have a wide range? Is this due to massive shrinkage towards the grand mean?
  3. How would I go about reporting the ICC? Would it be better to add in the effect of time and talk about the variation between clusters once wave is accounted for?
LISpace
  • 33

1 Answers1

1

I was really intrigued by this, so I searched around a bit. There is a technical explanation related to estimation of small variances in mixed models. To answer your first and second questions, the model is reporting near-zero random intercept variance due to a very small, calculated between-ID variance. As explained here nlme will report near-zero values and lme4 will report precisely zero values when the model finds very small proportions of variance in the random intercept.

You demonstrate that you do have non-zero variance among ID-clusters; however, recall that ICC can be defined as variance between clusters AND/OR average correlation within clusters. You only have three observations per ID and there does not seem to be clear differences in overall "levels" between clusters in this data. The deviations among clusters is likely being interpreted as less meaningful/systematic due to the small within-cluster sample as explained here.

I would answer affirmatively to your third question. If I run the following (or without random slopes for wave), the model gives interpretable estimates.

empty_means <- lmer(piat ~ wave + (1 + wave | id) , data = df)
summary(empty_means)

On a side note, I would suggest including a meaningful 0 value for your wave variable. If you use wave = rep(0:2, 89), you can interpret the fixed-model intercept as estimated value at the first wave.

dcoy
  • 337
  • 1
    I think that clears it up. Let me repeat it back to you in my own words to see if I understand correctly. The within-cluster variance is high because of the function of its a) sample size and b) uncorrelated responses. As shown in the spaghetti plot, relatively large initial values are no guarantee of large values at future time points, and therefore the within-cluster data are noisy to a point where the between-cluster differences that might exist gets drowned out. If the within-cluster responses were more correlated (e.g. large time 1 responses associated with large time 2 responses, etc.) – LISpace Jul 08 '22 at 00:24
  • or there were more time points in general, then the within-cluster variance would likely be smaller and the between-cluster differences could be more easily parsed. – LISpace Jul 08 '22 at 00:24
  • Yes, I think what you described must be the reason for the low variance. And because it is low enough to impede precise calculation, the linear mixed models have a built-in and necessary procedure of reporting it as near zero (for nlme) or precisely zero (for lme4). For the "under the hood" discussions, you can refer to those more detailed responses. I didn't mark this as a repeat question, because I think yours had more nuance in some areas. This and this from the package owner might work as a footnote reference. – dcoy Jul 08 '22 at 15:22