4

It is confusing because I get so many different sources claiming different ideas. I am going to try to make it simple.

Several individuals organised in 3 groups. We measure multiple variables at two time points, but of course, we are interested in how the health intervention influence at the end. Here I present several possibilities.

  1. lme(variable ~ time:group, random = ~ time | subject)
  2. lme(variable ~ time:group + group, random = ~ time | subject)
  3. lme(variable ~ time:group + time, random = ~ time | subject)
  4. lme(variable ~ time:group + group + time, random = ~ time | subject)

Which one would you choose? And why? Which are the differences, strictly speaking, about the adjusting?

My interest lies in capturing the significant statistical difference across groups and time.

  • 2
    By 2 time points, do you mean that you observe the subjects pre- and post-treatment? And can you also describe what kind of variable variable is? Continuous? Binary? It's better to provide relevant information about your study, than to make the question too abstract. – dipetkov Mar 20 '23 at 13:26
  • The variables are gene expression, quantitative and conitinuous. If you get an example database you can see that running the 4 models, you get different terms and different values – Javier Hernando Mar 20 '23 at 14:00
  • As you work with gene expression data, have you looked into whether you can perform the analysis with a specialized package, which may make sense than lme4 or nlme? The package that's often recommended on CV is bioconductor::limma. – dipetkov Mar 20 '23 at 14:27
  • The package you suggest is usually used to microarray. I am working with RT-qPCR. There is another package focused on this, but to perform just relative quantification, not adjusting for covariates as age or sex – Javier Hernando Mar 20 '23 at 14:30
  • Okay. I can see what you mean by keeping it simple. Then I'll just point out that none of the models you propose allows for the variance of variable to vary across groups. – dipetkov Mar 20 '23 at 14:35
  • I don't really understand what you mean. Why is that about the variance of my models? I was keeping it simple for not talking about all the previous steps of normalization – Javier Hernando Mar 20 '23 at 14:38
  • You say you are interested in capturing "the variance across groups and time". It's not clear, at least to me, whether you mean variance in the statistical sense or variability more generally. Do you want a model that allows that the variance of the outcome is different across both groups and time: $\operatorname{Var}(Y) = \sigma^2_{\text{time},\text{group}}$. – dipetkov Mar 20 '23 at 14:50
  • Sorry for the communication, I was meaning variance in a general sense. What I mean to obtain is if there is significant differences between groups across time. Wheter the parameter is the variance or other, it was unintended for the post. – Javier Hernando Mar 20 '23 at 15:00
  • Great. The clarifications ended up being unnecessary but I think it's still better to have clarity. And you already got a good answer from @LuckyPal. – dipetkov Mar 20 '23 at 15:02

2 Answers2

5

With only 2 observations per subject, you probably cannot reliably estimate random slopes (subject-specific slopes) for time. Assuming "group" refers to the health intervention, the model I would first consider (in lme4 lmer) is

lmer(variable ~ (1|subject)+time*group)

Which gives you the interaction between time and group (which is what you need to see whether your intervention had an effect on variable), taking into account the non-independence of your observations.

Another possibility would be to use single-level (lm) regression with clustered standard errors

Sointu
  • 1,846
  • (1) what's wrong with nlme? When you want to fit GLMMs or have big data, lme4 is preferable, but in this case I don't see a reason to switch to lme4. (2) the number of time points (as long as it is >1) does not affect the estimation of the random slope of time. It is the subjects (and their individual time effect) which are random, not the time points, so the requirement concerns only the number of subjects. – LuckyPal Mar 20 '23 at 12:43
  • 1
    (1) Nothing as such, I avoid it for certain reasons but they don't apply here. (2) Really? That's interesting. I have been under the impression that while you can estimate random slopes with only 2 observations per cluster, it's typically too few in practice for a reliable estimation. Thanks for pointing this out! – Sointu Mar 20 '23 at 13:09
  • As far as I' ve been reading through the posts, it's been said several times this about modeling random slopes with only 2 time points. But I don't know for sure the fundamentals of this statement. – Javier Hernando Mar 20 '23 at 14:06
  • 2
    @JavierHernando it sounds like you are new to mixed models. lme4::lmer is usually recommended over nlme::lme as nlme, while being more flexible, also assumes you know what you're doing and will let you fit models that actually don't meet MLM assumptions. In this case, including a random slope with only two time points. – David B Mar 20 '23 at 15:52
  • @DavidB As far as I read your comment I don't find any knowledge on it. Why is lme4 recommended over? Also assumes you know what you're doing? I know about the residuals correlation and cross-nested experiments. Besides that, I'm pretty new. – Javier Hernando Mar 20 '23 at 16:19
  • 1
    @Sointu sorry, I have to take back my comment regarding the number of time points required for a random slope. You were right: 2 is not enough to reliably fit a model in nlme or lme4. I am coming more from the Bayesian side, where we sometimes have quite a different view on what it means to "reliably estimate" a parameter :-) @JavierHernando your comment to @DavidB comes across quite rude, particularly since his point about lme4 not allowing to even estimate the random slope was absolutely valid. – LuckyPal Mar 20 '23 at 22:18
  • @LuckyPal, that's cool! I'm currently trying to learn Bayesian statistics and yes, things seem to be different there :) – Sointu Mar 21 '23 at 08:16
  • @JavierHernando no need to use lme4, I edited that part of my answer. LuckyPal provided you with a good nlme code. – Sointu Mar 21 '23 at 08:20
  • @LuckyPal thank you for the insight. The random slope concept I worked before was using 3 time points, and I was quite doubtful about including it or not. Didn't intend to be rude, in fact I sensed it the other way around, I came here because the lack of knowledge. Nevertheless, apologies . Appreciate your comments Sointu. Probably my model will be like lme(variable ~ time:group + group + time, random = 1 | subject ; omitting the random slope. I have run models 1 to 4, and there are surprising statistical differences – Javier Hernando Mar 21 '23 at 09:07
  • 1
    @DavidB I didn't remember you and I have exchanged information before in another post. I understand your comment now. Sorry, I misunderstood your comment – Javier Hernando Mar 22 '23 at 13:05
5

Your models 1-3 ignore one or both main effects (except when your variables are coded as factors; see https://stackoverflow.com/questions/40729701/how-to-use-formula-in-r-to-exclude-main-effect-but-retain-interaction). You rarely want to ignore main effects. Since only model 4 contains all main effects and the interaction, this is most likely the most appropriate one. This is usually written as time*group.

For a discussion of interaction effects without main effects, look for example here: Including the interaction but not the main effects in a model.

Edit: I actually just learned that lme4 would not even let you estimate a random slope if you have just two time points, because you are trying to estimate just as many random effects as you have observations (or even fewer if there is missing data). As Ben Bolker points out, although with nlme the estimation appears to work, the model will not be able to actually distinguish between random slope variance and residual variation.

With health interventions whose outcome develops over time, the standard is to use an ANCOVA adjusted for baseline, which will turn out to be quite similar to the mixed model with random intercept. So if you have measured at baseline (i.e. at the time of randomization, or when the treatment was assigned), the usual way to estimate the intervention effect would be:

lm(variable ~ baseline + time*group)

Or if you are really interested in the variance of the random intercept, you could use

lme(variable ~ baseline + time*group, random= ~ time|subject)

but that will give (almost) identical results as the other model, and the variance of the random intercept will be (almost) the same as the variance of the baseline values.

Be also aware that the interpretation of the results can be somewhat challenging in the presence of interaction effects. You might want to read further into the topic, and examine the fitted model further with pairwise comparison (e.g. with the package emmeans) to get a better understanding of the group- and time-specific estimates.

LuckyPal
  • 1,860
  • Don't really get why should I provide the baseline value. As far as I know, linear mixed-effects model take into account individual variability at the beginning. It's not like an ANOVA of repeated measurements, in which you need to adjust for baseline value. I will look for the interaction term without the term, because trust me I have so many books of statistics but it is hard to find sthg about interaction with time and which covariate and factor capture which effect. – Javier Hernando Mar 20 '23 at 14:04
  • 2
    To read about why it's recommended to include the baseline, see Best practice when analysing pre-post treatment-control designs. – dipetkov Mar 20 '23 at 15:01
  • @JavierHernando in the medical field, adjustment for baseline is the standard, although my impression (based on some quite basic simulations) is that the additional fixed effect for baseline reduces the residual variance only slightly in presence of a random intercept, and the treatment effects are actually identical. Sorry, I do not really understand the second half of your comment - did my post answer your question? – LuckyPal Mar 20 '23 at 15:13
  • Does the term time cover the baseline value? Taking into account that I have 2 temporal points, I was assuming that baseline value is already in the equation when I am including the time, and therefore including the baseline value of the variable – Javier Hernando Mar 20 '23 at 15:26
  • @JavierHernando the effect of time does not cover the effect of the baseline value, but the random intercept does capture it (almost) entirely. Still, standard practise is to include baseline in the model (in randomized medical interventions), but I don't believe there has been explicit statistical research on it in the context of mixed effects models. You can fit both models - my guess is you will see only very small differences in the estimates. – LuckyPal Mar 20 '23 at 15:42
  • I add link where @Dimitris Rizopoulos provide some insight here https://stats.stackexchange.com/questions/431374/baseline-adjustment-in-mixed-models-with-two-assessments

    https://stats.stackexchange.com/questions/478310/is-there-a-mathematical-proof-for-change-being-correlated-with-baseline-value/478335#478335

    Thanks

    – Javier Hernando Mar 20 '23 at 15:49