Repeated measures ANOVA unequal sample size - What's the alternative?

Question

I am investigating the improvement of learning a task over time. I have five individuals that take the same test over period of time. My assumption is that the number of minutes they need to complete the task will decrease with the number of tries. However, I don't have an equal sample size to use a repeated measures ANOVA to examine whether the time needed decreased significantly. What are my alternatives? I am completely lost.

(Data shows how often the individual has taken the test and how many minutes it took to complete)

Individual1 <- c(20,18,15,14,11,11,10,9)
Individual2 <- c(35,29,29,25,13,10,7,6)
Individual3 <- c(19,19,13,12,12)
Individual4 <- c(15,10,9,4)
Individual5 <- c(23,15,17,14,11,13,8,5)

Generalized least squares is also possible. A table in Chapter 7 of Frank Harrell's course notes compares several approaches to analyzing longitudinal data like these. As you noticed, repeated-measures ANOVA has many limitations that other approaches can overcome, although there is no one method best for all data. — EdM, Dec 14 '22 at 22:08
If you only have 5 subjects total, you could treat them as fixed effects and include an interaction between subject and trial-number in a regression model. That might work better than a mixed model with so few subjects. — EdM, Dec 14 '22 at 22:26

score 1 · Accepted Answer · answered Dec 15 '22 at 22:44

A table in Chapter 7 of Frank Harrell's course notes compares several approaches to analyzing longitudinal data like these. As you noticed, repeated-measures ANOVA has strict requirements. Other approaches provide more flexibility, although each has its own assumptions and there is no one method best for all data.

Let's put your data together into a useful data frame. I assume that what's of interest is the number of trials attempted rather than time elapsed.

trialDat <- data.frame(
   timeToComplete=c(Individual1,Individual2,Individual3,Individual4,Individual5),
   ID = c(rep("1",length(Individual1)),rep("2",length(Individual2)),rep("3",length(Individual3)),rep("4",length(Individual4)),rep("5",length(Individual5))),
   trialNo=c(1:length(Individual1),1:length(Individual2),1:length(Individual3),1:length(Individual4),1:length(Individual5))
   )

Plotting the data is always a good idea. The lines are simple linear regression lines for each Individual.

ggplot(trialDat,
       aes(x=trialNo,y=timeToComplete,group=ID,color=ID)) + 
       geom_point() + geom_smooth(method=lm,formula=y~x)

With only 5 individuals you might just treat ID as a fixed effect and allow for different slopes in the training curves among the individuals. That pretty much recapitulates what was done in the above plot. The trialNo is treated as linearly associated with timeToComplete. More flexible modeling might be called for in general.

lmMod <- lm(timeToComplete~ID*trialNo,data=trialDat)

That provides a different fit for each individual. You can use the Anova() function in the R car package to give a simple summary of the results.

car::Anova(lmMod)
# Anova Table (Type II tests)
# 
# Response: timeToComplete
#             Sum Sq Df F value    Pr(>F)    
# ID          600.51  4  34.050 2.358e-09 
# trialNo    1040.35  1 235.954 1.389e-13 
# ID:trialNo  218.24  4  12.374 1.634e-05 
# Residuals   101.41 23

That suggests a significant association between timeToComplete and trialNo, and that the association differs among individuals (ID:trialNo interaction).

A linear mixed model, as suggested by @utobi, models some regression coefficients as having Gaussian distributions among the individuals, rather than fitting separate lines for each. The lme4 package is often used for such modeling. This model allows both for different intercepts and different slopes among the individuals, with a correlation between slopes and intercepts. This site's lmer cheat sheet shows how to set up such models.

lmerMod <- lme4::lmer(timeToComplete ~
            trialNo + (1+trialNo|ID), data = trialDat)

Generalized least squares has many applications. It's suited to longitudinal data; for this purpose you specify a type of correlation of observations within an individual. The nlme package implements that approach. This example assumes a "continuous autoregressive" form for the correlation.

glsMod <- nlme::gls(timeToComplete~trialNo, data=trialDat,
              correlation=nlme::corCAR1(form= ~trialNo|ID))

The Harrell reference cited above pays particular attention to generalized least squares. His rms package has a Gls() interface to the nlme function.

Thank you so much for this detailed guideline and explanations! Do I understand it correctly that I use Generalized least squares to analyze each individual separately? I would aim for getting to know how many and especially who improved significantly. Lme will help me to get an idea of overall performance of the group. — User108, Dec 16 '22 at 11:16
@User108 no, gls assumes a correlation structure within each individual, but the analysis is among all individuals together. It's might be considered more of a "marginal," population-level approach than a mixed model (lmer), which also estimates the distribution of variations among individuals. Either method could be used, depending on the assumptions that you are willing to make. See the linked Harrell course notes. It's often considered best to have 6-10 individuals before using a mixed model, although that might have more to do with getting good estimates of the variances among individuals. — EdM, Dec 16 '22 at 15:24

Repeated measures ANOVA unequal sample size - What's the alternative?

1 Answers1