6

I'm new to mixed effects models and am trying to use the lmer() function from the lme4 R package to specify a random effects structure.

In my experiment, subjects are spread out over 11 non-overlapping groups. Groups were fairly big (hundreds of subjects) and were tested on at least 4 successive days. On each day, subject performance was measured under two experimental conditions (hungry, satiated) and during each of these conditions, each subject contributed zero to potentially many data points. Groups were tested sequentially, i.e., on each day only one group was tested.

After reading through lectures, tutorials, and posts on here, I think my model should look like this

response_time ~ experimental_condition + (1 | group_id/day) + (1 | subject_id)

and the subject_id needs to be unique across the entire data set.

Does this look okay? And under which (hypothetical) circumstances would one nest experimental_condition within day?

Thanks for any help!

Edit: I should have mentioned that day is currently coded as the day of the experiment. In other words, it is not unique across groups (i.e., there is a day 1, day 2, ... for each group).

Tim
  • 115
  • What is your research question ? Presumably your primary interest is in estimating the effect of experimental_condition, while accounting for the non-independence of observations due to the random structure your experiment has. Please confirm or provide further detail. – Robert Long Nov 13 '23 at 11:53
  • So each subject has been measured first while hungry, and then while satiated ? Do the groups represent treatment levels, or they just happened because of the experimental setup ? – CaroZ Nov 13 '23 at 12:30
  • @Robert Long: That's correct, I'm interested in the effect of experimental condition on response_time. Sorry for the omission. – Tim Nov 13 '23 at 13:28
  • @CaroZ: Yes, the subjects were first measured when hungry and then measured again when satiated. And the groups represent honey bee colonies, not treatment levels. – Tim Nov 13 '23 at 13:47
  • For the nested random effects, you have specified (1 | group_id/day) which implies that day is nested within group_id. This means that each day is unique within a group but not across groups so for example, day 1 for group 1 is different from day 1 for group 2). Is that the case ? – Robert Long Nov 13 '23 at 15:24
  • Maybe there is something I misunderstood about the day: do the days correspond to something synchronized for the colonies, like for example the colony you measured at day 6 was actually 4 days post manipulation, and the colony you measured at day 7 was also 4 days after manipulation ? – CaroZ Nov 13 '23 at 18:13
  • @Robert Long: I'm unsure. As per experimental design, day 1 is the same for all groups. However, in practice this was not the case: Some days were cloudy and my subjects (honey bees) didn't feed as much as on sunny days, which affected experimental_condition (i.e., how satiated the bees were). Also, each group (bee colony) started feeding itself on a different day. For example, in group 1 the treatment was effective from day 6 to day 13 while in group 2 it was effective from day 10 to day 13. Because of these things, I thought that day is unique within group. What do you think? – Tim Nov 13 '23 at 21:08
  • @CaroZ: Please see my previous comment. Does this answer your question? – Tim Nov 13 '23 at 21:12
  • It does. I just understood : you cannot make sure you got the same bee on different days, so you have to give a unique subject_id to each bee. In this case, I even wonder if this variable brings you anything. – CaroZ Nov 13 '23 at 23:49
  • @CaroZ: I actually do know which bee I measure. When we create an experimental colony, we attach a tiny barcode to each bee, which allows us to uniquely identify bees during the experiment. So subject_id is actually meaningful. I added it to the model to account for the fact that I repeatedly measure the same bee (within and across days). The reason for it being unique across the data set is that I think it needs to be a crossed random effect within group. – Tim Nov 14 '23 at 07:20
  • No I would say the same bee needs to have the same subject_ID, the bee is being repeated. There has to be +(1|group/bee) somewhere. – CaroZ Nov 14 '23 at 10:41
  • @CaroZ: I agree. Sorry for wording things ambiguously. What I meant to say is that subject_id is unique within a group and not repeated across groups. And if a particular bee was measured repeatedly, these measurements were associated with that bee's subject_id, not different subject IDs. – Tim Nov 14 '23 at 13:25

3 Answers3

5

Does this look okay?

Based on your description, and given your research question of estimating the effect of experimental_condition, while accounting for the non-independence of observations due to the random structure your experiment has, this does not look OK to me. The issue is with the random structure, and how to handle the day variable.

It appears that each and every subject belongs to one and only one group. Thus, subjects are nested within groups, so you need the term:

... + (1 | group_id / subject_id) + ...

which will fit random intercepts for each group and each subject within a group.

This leaves the question of how to treat the day variable: fixed or random. There isn't necessarily a black and white answer to this, but see the list of threads at the end of my answer for help on how to choose. The first thing to note is that day has only 4 levels. This isn't necessarily a problem if day is nested within group_id, since there will then be $n_{day} \times n_{group} = 44$ intercepts.

So, if treating day as random and nested within group we would have:

response_time ~ experimental_condition + (1|group_id/subject_id) + (1|group_id/day)

which expands to

response_time ~ experimental_condition + (1|group_id) + (1|group_id:subject_id) + (1|group_id)+ (1|group_id:day)

which then simplifies to:

response_time ~ experimental_condition + (1|group_id) + (1|group_id:subject_id) + (1|group_id:day)

Alternatively if day is not nested within group we wouldn't fit random intercepts with only 4 levels, so treating day as fixed would make more sense in that scenario:

response_time ~ experimental_condition + day + (1|group_id/subject_id)

In the this latter model you should consider whether to fit an interaction term in the fixed part if the effect of the experimental condition differs by day:

response_time ~ experimental_condition * day + (1|group_id/subject_id)

And under which (hypothetical) circumstances would one nest experimental_condition within day?

Nesting experimental_condition within day makes sense if each experimental_condition belongs to one and only one day. That does not seem to be the case with your design. This would also bring up the problem of whether to fit a factor as random or variable. See the following threads for much discussion on that topic:

What is the difference between fixed effect, random effect and mixed effect models?

How to determine random effects in mixed model

Understanding Random Effects in Linear Mixed Models

Can a variable be included in a mixed model as a fixed effect and as a random effect at the same time?

Choosing Random Effects to Include in a Linear Mixed Model

Robert Long
  • 60,630
  • I completely disagree. If subject_id is already unique and nested in group_id then (1|group_id / subject_id) = (1|group_id) + (1|subject_id), so your first point is wrong. Source(last bullet point): https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#nested-or-crossed – Lukas Lohse Nov 13 '23 at 21:05
  • inversely, since day is probably not uniquely specified and it makes perfect sense that different beehives are in different places and get different weather, etc... on the same day, it should be nested in group – Lukas Lohse Nov 13 '23 at 21:10
  • Well the question says "and the subject_id needs to be unique across the entire data set", so... Also is that link right? It just points to here – Lukas Lohse Nov 13 '23 at 21:35
  • Oh, you deleted your comment. I do want to walk back the "completely": I agree with you about nesting experimental_condition and if they only measured 1 group a day, day is also automatically nested. – Lukas Lohse Nov 13 '23 at 22:46
  • @Lukas Lohse : Were you referring to my previous comment ? I put it back under my answer. – CaroZ Nov 13 '23 at 22:57
  • @CaroZ uhm no, Robert Long had written one, but i actually like your comment. Good points. This whole chain is getting very confusing and I'm gonna go to bed and maybe delete/edit some stuff tomorrow – Lukas Lohse Nov 13 '23 at 23:12
  • @Lukas Lohse: day not being uniquely coded and the weather and other factors that I had no control over being different on the same day of the experiment for different groups are the reasons why I nested day within group in my preliminary model. I've added clarifications about this as a comment under my OP. You might want to have a look! – Tim Nov 14 '23 at 08:00
  • @LukasLohse Apologies for deleting my comment, you were rather too quick to respond and I hadn't noticed your reply plus it was also late where I am. Anyway, also to the OP: I have updated my answer. I think we all agree that subjects are nested in groups, leaving the question of how to treat day. – Robert Long Nov 14 '23 at 10:17
  • @Tim: what do you mean by day not being uniquely coded ? Each day should have a unique ID right ? – CaroZ Nov 14 '23 at 11:33
  • @CaroZ: Please see my edit to the OP. Does this answer your question? – Tim Nov 14 '23 at 13:33
  • @RobertLong no worries, i just came on a bit strong in my 1st comment and wanted to walk it back a little. I still disagree with you saying that "you need the term:..." when Tims setup doesn't need that term, even if you might prefer it for clarity. – Lukas Lohse Nov 14 '23 at 15:11
2

I think your model is fine. If subject_id is unique between groups, then you don't have to specify as nested in the model formula. See last bullet point here: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#nested-or-crossed

Also a quick demonstration in R:

library(lme4)
n <- 1000
set.seed("0815")
subject_id <- sample(1:100, size = n, replace = T)
subject_random_effect <- rnorm(100)
# randomly generate groups for the subjects
group_id <- sample(1:10, size = length(unique(subject_id)), replace = T)
group_random_effect <- rnorm(10)
# simluate response, by exploiting how R does indexing
y <- subject_random_effect[subject_id] + group_random_effect[group_id[subject_id]] + rnorm(n, sd = 0.5)

df <- data.frame(y,subject_id, group_id = group_id[subject_id])

perfectly identical models

summary(lmer(y ~ (1|group_id/subject_id), data = df)) summary(lmer(y ~ (1|group_id) + (1|subject_id), data = df))

I also agree with you that it makes perfect sense that different beehives are in different places and get different weather, etc... on the same day, day should be nested in group, although with only one group getting measured per day it doesn't actually matter. Optimally you'd have weather data and include adjustments based on strong domain knowledge, but that is likely unrealistic.

Spatial correlation, i.e. groups that are closer together are more similar/experience more similar weather is something you might consider, but the technical challenge is high: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#spatial-and-temporal-correlation-models-heteroscedasticity-r-side-models

With 11 groups you could also consider treating group as a fixed effect.

And under which (hypothetical) circumstances would one nest experimental_condition within day?

Nesting it would be weird, but you might want to look at effect variation, i.e. whether the difference in response time between hungry and satiated bees changes.

A formula that both teats group as fixed effect and looks at effect variation for experimental condition would be:

response_time ~ experimental_condition + group_id + (experimental_condition|group_id:day) + (1|subject_id)
Lukas Lohse
  • 2,482
  • I could obtain weather data and have considered adding adjustments to the model based on it it. However, from the feeding apparatus, I also know how much food the bees consumed (e.g. 152 ml in group 2 on day 11), which is probably a more relevant adjustment than the weather, as it pertains to the satiated experimental_condition more directly. What do you think? – Tim Nov 14 '23 at 09:03
  • I'll probably forgo making adjustments based on spatial correlations as the groups where tested in roughly the same place and the technical challenges seem high. – Tim Nov 14 '23 at 09:05
  • I sounds like you and Robert Long agree that nesting experimental_condition within day doesn't make much sense with respect to the research question. But thanks for explaining what that would do and for showing the corresponding model. That was educational. – Tim Nov 14 '23 at 09:17
  • your 1st comment: Effects on reaction time are more relevant and only variables that have an effect on both or the difference between the groups need to be involved. There exists extensive theory about this: https://en.wikipedia.org/wiki/Bayesian_network – Lukas Lohse Nov 14 '23 at 15:30
  • 2nd comment, sure, 3rd comment: What i defined is more of a random slopes model, which in this case is only subtlety different, but nevertheless i wouldn't mind it. – Lukas Lohse Nov 14 '23 at 15:32
0

Since individuals belong to colonies, I would nest the subject ID in the group :

response_time ~ experimental_condition + (1|group_id/subject_id)

Since on each group has its own unique days where it was tested, I would not further include "day" as a random factor.

CaroZ
  • 755
  • I would ave thought that you need the day variable in the model somewhere ? – Robert Long Nov 13 '23 at 16:12
  • @Lukas Lohse: were you referring to my previous comment here ? It was : "Since each day contains measurements of only 1 group, then group and day are completely colinear. I thought the variance caused by the day cannot be discriminated from the variance caused by the group." Actually I am wondering if +(group_id/subject_id)+(1|group_id/day) would be a good thing. – CaroZ Nov 13 '23 at 22:57
  • @CaroZ: I thought day needs to be in the model because I measure each group on more than one day, and there is uncontrolled day to day variability in subject performance. – Tim Nov 14 '23 at 08:25
  • Yes group is repeated across day, which is why I was proposing +(group_id/subject_id)+(1|group_id/day) , but I wonder if it is correct, and then if it is an overkill, because the variance caused by the days where you sampled is contained in the variance caused by the group. but I suppose you can make the same argument for (group_id/subject_id). – CaroZ Nov 14 '23 at 11:24