3

I am computing the following model in R, using lme4::lmer:

m3 = lmer(e ~ (X*Y*Z) + (1|ID/R), data = data_transform)

e is a continuous variable. X, Y, and Z are categorical factors. ID is the subject ID and R is the measurement repetition. The error was measured multiple times for each subject under the conditions, given by X, Y, and Z (repeated measures). The design is slightly unbalanced.

My goal is to compute an ANOVA, using the lmer model:

m3_anova = Anova(m3, type = 3, test.statistic = "F")

which does work and produces plausible results in the context of my experiment.

Now, I am checking the assumptions of the mixed-effects model, which are, as far as I understand, homoscedasticity and normality of the residuals as follows:

res = residuals(m3)
interaction_factor = with(data_transform, interaction(X,Y,Z))
leveneTest(res ~ interaction_factor, center = median)

which is significant with p << 0.001. Furthermore:

shapiro.test(res[1:5000])

is significant with p << 0.0001. I am already transforming my target variable error, using rcompanion::transformTukey.

The residual variances vary from ~0.3 to ~1.

My questions are:

  1. Am I checking the assumptions correctly?
  2. Can I still run the ANOVA and if not what can I do to get normally distributed residuals and homoscedasticity?
  3. Edit: From the plots below, are my assumptions valid or not? I feel like I am lacking the experience to rate them.

Edit:

I am using the packages:

library(lme4)
library(car)
library(lmerTest)

I also checked the assumptions of normality using a qqplot:

ggqqplot(res)

enter image description here

I know that for normal distribution the points are suppposed to lie on the main diagonal. From this plot, it is obvious that there is some deviation in the area around "x=2".

Furthermore, I plotted the residuals variaces across the factor level combinations:

data_transform$residuals = res
var3 = data_transform %>% group_by(X,Y,Z) %>% summarise(Variance = var(residuals, na.rm = TRUE))
boxplot(Variance ~ X + Y + Z, data = var3)

which looks like:

enter image description here

Additionally, residuals vs. predicted values look like:

plot(m3)

gives:

enter image description here

As suggested, I tried the DHARMa package:

plot(simulateResiduals(m3))

leading to:

enter image description here

2 Answers2

6

A few quick comments:

  • Don't use hypothesis tests to check model assumptions. Use plots of residuals (q-q plot, histogram, residuals vs. predicted values). You are testing a sample size of 5000. Such a large sample size is likely to be found non-normal and heteroscedastic even for small deviations from these ideals.

  • The transformTukey() function transforms a single vector of values. I'm not sure how you're using it, but it probably doesn't do what you want. Or, at least, it's not guaranteed that if you transform your dependent variable that your error will be normally distributed. The MASS package does have a Box-Cox transformation function, which may be more of what you want. But probably a generalized linear model may be more appropriate.

  • Does car support lmer models ? (I didn't check). But you might want to use the lmerTest package instead.

Sal Mangiafico
  • 11,330
  • 2
  • 15
  • 35
  • I added some plots to my question above. From the plots, I am not sure if my assumptions are met. For example, regarding the qq-plot: How do I know which deviation from the main diagonal is acceptable and which isn't.

    I will llok into the MASS package and try that transformation.

    You are right, I am actually using lmerTest here.

    – hilberthotel Mar 19 '24 at 15:22
  • 4
    The q-q plot isn't perfect. But nothing in the real world is. Nothing in these plots would strike me as making the inference from the linear model invalid. – Sal Mangiafico Mar 19 '24 at 16:42
  • 1
    A simple reality check for plots would be: take the fitted model. Simulate data from it. Fit the same model to the simulated data. Plot the diagnostic plots for this model. Do this 19 times and compare the 19 diagnostic plots to the original diagnostic plot. Your original diagnostic plot will likely not stand out very much. The advantage of this is that it gives you an idea of the "natural variability" in data, which is usually much higher than we expect. – Stephan Kolassa Mar 19 '24 at 17:28
5
  1. You could be using the plot(simulateResiduals(model)) of the DHARMa package to check your model assumptions.
  2. You can use the varIdent function of the nlme package, or something equivalent to it in an other package (see here for example)

... but instead of transforming your data, you could use a model which fits your data. GLMs for a negative binomial distribution are good at dealing with heteroscedasticity. This depends of the nature of your data of course. You can try doing both and checking model assumptions afterwards.

CaroZ
  • 755
  • I will look into the DHarma package. I tried to use nlme. However the models did not converge. lme4 also has the "weights" parameter, which seems to be an option to handle heteroscedasticity. However, I do not know how to specify the weights. – hilberthotel Mar 19 '24 at 15:24
  • Could the lack of convergence come from a complete separation between levels of explanatory variables, i.e., in one treatment, the response variable always has the same value, for example, only 0s for the control ? Or are some levels of your random factors colinear ? You have a very high sample size ! I agree with @Sal Mangiafico. – CaroZ Mar 19 '24 at 22:07
  • No, the response variable has very distinct values. The factors are colinear, that is why lmer tells me that the fixed-effects model matrix is rank-deficient. However, this is expected, since not all combinations of Y and Z are present in the data. This leads to another problem that I pointed out here: https://stats.stackexchange.com/questions/643050/nested-effects-in-lmer-model/643054#643054 (comments on this are also appreciated). – hilberthotel Mar 20 '24 at 13:32