I’m working with a toy data set to get a basic understanding of longitudinal data analysis and I’m having a difficult time with a conceptual understanding of the intraclass correlation (ICC) as a proportion of variance. As I understand it, one way to interpret the ICC in a random-intercept model is it’s the proportion of original outcome variance attributable to the between-cluster mean differences over time. In mathematical terms, if the model is represented as
$$y_{ti} = \beta_{0} + U_{0i} + e_{ti}$$
Where $U_{0i}\sim\mathcal{N}(0, \tau^{2}_{U_{0}})$ and $e_{ti}\sim\mathcal{N}(0, \sigma^{2}_{e})$, then the ICC is defined as
$$\text{ICC} = \frac{Var(U_{0i})}{Var(U_{0i}) + Var(e_{ti})} = \frac{\tau^{2}_{U_{0}}}{\tau^{2}_{U_{0}} + \sigma^{2}_{e}}.$$
Here are a couple of figures from the data. Plot A is a histogram of the responses, with the solid red line representing the grand mean. Plot B is a spaghetti plot of the individual trajectory across $t = 3$ time points. I have color coded a few individuals to show different patterns in the response.

library(tidyverse)
library(ggthemes)
library(ggpubr)
library(nlme)
df <- structure(
list(
id = rep(1:89, each = 3),
wave = rep(1:3, 89),
piat = c(18, 35, 59, 18, 25, 28, 18, 23, 32, 18, 31, 50, 18, 33, 53, 18, 28,
31, 17, 28, 28, 17, 29, 41, 28, 26, 26, 16, 20, 21, 18, 28, 42, 18,
31, 57, 18, 41, 49, 18, 31, 52, 18, 33, 44, 18, 43, 46, 37, 28, 29,
16, 27, 27, 18, 18, 26, 18, 38, 47, 18, 22, 26, 22, 27, 45, 14, 37,
37, 18, 40, 35, 29, 31, 43, 17, 23, 21, 19, 36, 57, 4, 27, 28, 18,
24, 45, 18, 44, 51, 18, 31, 51, 24, 39, 48, 18, 34, 29, 17, 36, 37,
18, 21, 28, 16, 38, 47, 18, 40, 47, 22, 36, 34, 18, 41, 48, 19, 40,
46, 18, 28, 36, 18, 28, 47, 20, 36, 45, 18, 28, 41, 16, 34, 43, 25,
45, 49, 18, 24, 31, 17, 27, 29, 19, 32, 48, 43, 55, 61, 20, 24, 32,
18, 46, 47, 18, 29, 40, 28, 56, 61, 18, 20, 40, 21, 30, 38, 21, 22,
47, 18, 22, 21, 16, 21, 32, 29, 37, 48, 21, 32, 47, 19, 20, 26, 18,
34, 41, 28, 30, 31, 21, 28, 43, 18, 25, 32, 20, 24, 39, 22, 30, 42,
26, 47, 45, 22, 45, 40, 22, 41, 53, 18, 24, 40, 26, 28, 28, 26, 28,
30, 18, 24, 26, 31, 41, 62, 17, 19, 28, 42, 29, 30, 28, 42, 50, 23,
39, 55, 28, 47, 53, 26, 44, 32, 19, 35, 51, 18, 32, 40, 29, 47, 45,
16, 20, 59, 22, 49, 64, 22, 23, 23, 23, 40, 45)
),
row.names = c(NA, -267L),
class = c("tbl_df", "tbl", "data.frame")
)
########## Making Figures -----------------------------------
g <- ggplot(df) +
aes(x = piat) +
geom_histogram(color = "grey", fill = "grey80") +
coord_flip() +
theme_bw()
rand_ids <- c(16, 86, 88)
rand_df <- df %>%
filter(id %in% rand_ids)
p1 <- g +
geom_vline(data = rand_df, aes(xintercept = piat, color = factor(id))) +
ggthemes::scale_color_colorblind() +
geom_vline(xintercept = mean(df$piat), color = "red", size = 2, alpha = 0.5) +
scale_y_continuous(limits = c(0, 70))
p2 <- ggplot(df) +
aes(x = wave, y = piat, group = id) +
geom_point(alpha = 0.1) +
geom_line(alpha = 0.1) +
geom_point(data = rand_df, aes(color = factor(id)), size = 2) +
geom_line(data = rand_df, aes(color = factor(id)), size = 1) +
ggthemes::scale_color_colorblind()
ggarrange(p1, p2, labels = c("A", "B"), common.legend = TRUE)
When the random intercept model is run, we are estimating the grand mean over time term ($\beta_{0}$), the cluster-specific deviations from the grand mean (as measured by its variance, $\tau^{2}_{U_{0}}$), and the residual variance ($\sigma^{2}_{e}$). Here the random intercept variance is extremely small, and therefore the ICC estimate is essentially 0, and I’m having a hard time conceptualizing why this is the case.
########## Empty-Means Model -----------------------------------
empty_means <- lme(piat ~ 1, random = ~ 1|id, data = df)
summary(empty_means)
Linear mixed-effects model fit by REML
Data: df
AIC BIC logLik
2084.744 2095.495 -1039.372
Random effects:
Formula: ~1 | id
(Intercept) Residual
StdDev: 0.001475647 11.91709
Fixed effects: piat ~ 1
Value Std.Error DF t-value p-value
(Intercept) 31.22472 0.7293139 178 42.81382 0
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.2845111 -0.8579881 -0.2705962 0.7783177 2.7502760
Number of Observations: 267
Number of Groups: 89
########## ICC by hand -----------------------------------
0.001475647^2/(0.001475647^2 + 11.91709^2)
[1] 1.533291e-08
Here are my questions:
- Does the output imply that the estimates for the cluster-level means across time ($\beta_{0} + U_{0i}$) are all essentially at the grand mean for every cluster?
- If so, why is this the case, since empirical cluster-level means have a wide range? Is this due to massive shrinkage towards the grand mean?
- How would I go about reporting the ICC? Would it be better to add in the effect of time and talk about the variation between clusters once wave is accounted for?
nlme) or precisely zero (forlme4). For the "under the hood" discussions, you can refer to those more detailed responses. I didn't mark this as a repeat question, because I think yours had more nuance in some areas. This and this from the package owner might work as a footnote reference. – dcoy Jul 08 '22 at 15:22