3

I'm dealing with longitudinal data, and to take into account the dependence of observations within the cluster, I shall rely on a generalized linear mixed model. I have a continuous response variable, and I'd like to fit a Gaussian mixed model. However, plotting the density of the response (even after a log transformation) does not seem to be normal. It has two local maxima (where the second local is also a global maximum).

Is it appropriate to work with a Gaussian model?

utobi
  • 11,726
  • it's not the response that has to be normal, but the response conditionally on the regressors, i.e. the residuals should be normal. – utobi Jan 04 '23 at 11:22
  • 1
    You are right, stupid point. Indeed what I should care about is the conditional distribution, and not the marginal.Basically, I should fit the model and then test for the normality of the residuals, right? – Maximilian Jan 04 '23 at 11:27

3 Answers3

5

It is unfortunately a common misunderstanding that Linear Mixed-Effects (LME) models, like any classical Linear Model (LM), assume that the response is normally distributed with suitable parameters. The truth is that LM(E) assume that the response is normal with suitable parameters conditionally on the covariates.

Reading David's answer made me recall that there is a subtle but important difference between the residuals of an LM and that of an LME. This difference is due to the presence of random effects. To check the residuals of an LME one thus has to decide first what to do with the random effects. Two alternatives are possible:

(1) marginal residuals

(2) conditional residuals

Since the random effects are mere random variables, we could integrate them from the model and then compute the residuals implied residuals; those compute this way are called marginal residuals.

On the other hand, random effects are also parameters, albeit random ones. In some contexts, it is of interest also an estimation of the random effects. Thus having an estimate of the random effects, it is possible to consider residuals for the model that are obtained conditionally on these estimates; these are called conditional residuals. For a full account of these issues see Pinheiro and Bates (2004) "Mixed-Effects Models in S and S-PLUS", Springer.

From the point of view of assumption verification (if that's ever useful, see the Side Note), this means that you should never check if the distribution of the response is normal-looking (e.g. by histograms, normality tests, etc.). You should instead look at the distribution of the residuals of that model.

Side Note. Some statisticians would argue that checking the normality of the residuals is not useful at all. You can find many threads on this here on this site, e.g. here

utobi
  • 11,726
  • Thanks so much. In a few rows, if possible, how would you validate (by generally speaking), the normality assumption for the conditional distribution of the response? – Maximilian Jan 04 '23 at 11:52
  • 1
    Extracting the residuals of the model and doing either QQ-plot hunting for heavy tails or strong skewness + (eventually) a normality test. But again, I would not worry too much about normality but perhaps about heteroscedasticity; check the link in my post for further details. hope this helps – utobi Jan 04 '23 at 11:55
  • 1
    Still a good point, a Gaussian model is homoskedastic, checking for homoskedasticity is crucial, and perhaps a graphical visualization by using the QQ-plot is useful. Thanks a lot. – Maximilian Jan 04 '23 at 11:58
  • The vary last point: one of the most famous test for homoskedasticity is the Breusch-Pagan test. In a mixed model context, let's say I have a random intercept only. The variance of the random effect describes the variablity between clusters and the variance of the residuals describes the variability within clusters. Now, clusters are assumed to be independent and the errors are assumed to be independent of the random effects. Is still okay to perform the Breush Pagan test, right? – Maximilian Jan 04 '23 at 12:09
  • 1
  • Thanks for your further answer, but I do not agree on the fact that random effects are parameters. Indeed, if you have a simple LMM with random intercept only, say $b_i$, and even if the clusters are 100, you have only an additional parameter to estimate, i.e., $\sigma_b^2$ – Maximilian Jan 05 '23 at 21:42
  • 1
    @Maximilian I agree with you. Form a frequentist perspective it’s not easy to see them as parameters, since they are variable. However, if you see from the eyes of a Bayesian then all become unknown parameters. – utobi Jan 05 '23 at 21:48
  • I see, I'm working within a frequentist approach. – Maximilian Jan 05 '23 at 21:50
  • 1
    But even in the frequentist approach it’s not uncommon te see random parameters estimated. this is one example. Another one is factor analysis, in particular, estimation of factor scores. These are also random parameters estimated from data. – utobi Jan 05 '23 at 21:53
3

There are some issues when incorporating mixed effects, as you have two sources of residual variation, stemming from your level 1 and level 2 effects. It's been a while since I looked into this, but if I recall correctly, there is a debate going on regarding the appropriateness of different types of residuals. I think, Santos Nobre and da Motta Singer (2007) give a good overview over the challenges in modelling as well as show the most commonly used methods.

If you're working in R, I suggest looking into the HLMdiag package. I remember finding it particularly helpful when diagnosing mixed models. For a Bayesian approach, I looked into DHARMa, which might also be worth checking out if it applies to your use case.

David
  • 119
  • That's a good point. Indeed, as pointed out in another comment, we have two distinct sources of variability: the one explained by the random effects and the one explained by the fixed effects. Both the random effect and the residuals are assumed to be normally distributed. Does this imply that should I also check for the normality of the random effects? Thanks for the references, I'm going to consult them – Maximilian Jan 04 '23 at 12:42
  • 1
    If I remember correctly, normality of random effects is an assumption for LMMs, so I would check them, too. – David Jan 04 '23 at 13:16
  • nice to point out conditional vs marginal LME residuals; I completely forgot about that issue (+1). – utobi Jan 05 '23 at 10:02
1

If you have longitudinal data it might be a better idea to plot the response (y) as lines on the time axis (x). Then you can think about what model to use. You might prefer something different from a Gaussian mixed model, such as a GEE. What's the difference? Here

There are also other approaches that might be useful, but I don't have enough information on your problem to tell more.

utobi
  • 11,726
jmarkov
  • 683
  • 4
  • 11
  • In your opinion, can I fit a Gaussian model and then test for the normality of residuals to see if the conditional density is normally distributed ? – Maximilian Jan 04 '23 at 11:42