How do I analyze bimodal distibuted data with a linear mixed model

Question

The experiment I am working on has the following design:

A B C D E F
B A D E F C
A B E F C D
B A F C D E

Each Letter represents a different level of the single factor called “system” analyzed in this experiment. The dataset contains eight years and the dependent variable we are analyzing is yield.
A and B can be grouped together, as well as C to F according to their system type. I am aware of the missing randomization between groups AB and CDEF, which was necessary due to regulations, as well of the missing randomization within these two Groups, which has simply not been made, sadly.
I am investigating if there are sigificant differnces in yield between the systes (A-F)

My data looks like this:

> str(data)
'data.frame':   192 obs. of  6 variables:
 $ year  : Factor w/ 8 levels "2012","2013",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ type  : Factor w/ 2 levels "org","pest": 1 1 1 1 1 1 1 1 1 1 ...
 $ system: Factor w/ 6 levels "dgst_org","cc_pest",..: 3 3 3 3 5 5 5 5 6 6 ...
 $ row   : Factor w/ 4 levels "row_1","row_2",..: 1 2 3 4 2 3 4 1 3 4 ...
 $ column: Factor w/ 6 levels "column_1","column_2",..: 6 5 4 3 6 5 4 3 6 5 ...
 $ yield : num  26.2 41.4 43.4 45 40.8 52.3 47.1 47.2 40.1 42.4 ...
> summary(data)
      year      type             system      row          column       yield

 2012   :24   org :128   dgst_org   :32   row_1:48   column_1:32   Min.   : 26.20

 2013   :24   pest: 64   cc_pest    :32   row_2:48   column_2:32   1st Qu.: 52.30

 2014   :24              cc_org     :32   row_3:48   column_3:32   Median : 62.95

 2015   :24              manure_pest:32   row_4:48   column_4:32   Mean   : 73.79

 2016   :24              manure_org :32              column_5:32   3rd Qu.:103.83

 2017   :24              fmyd_org   :32              column_6:32   Max.   :127.10
> head(data,20)
    year type     system   row   column yield
377 2012  org     cc_org row_1 column_6  26.2
378 2012  org     cc_org row_2 column_5  41.4
379 2012  org     cc_org row_3 column_4  43.4
380 2012  org     cc_org row_4 column_3  45.0
417 2012  org manure_org row_2 column_6  40.8
418 2012  org manure_org row_3 column_5  52.3
419 2012  org manure_org row_4 column_4  47.1
420 2012  org manure_org row_1 column_3  47.2
461 2012  org   fmyd_org row_3 column_6  40.1
462 2012  org   fmyd_org row_4 column_5  42.4
463 2012  org   fmyd_org row_1 column_4  39.5
464 2012  org   fmyd_org row_2 column_3  35.7
505 2012  org   dgst_org row_4 column_6  57.8
506 2012  org   dgst_org row_1 column_5  48.8
507 2012  org   dgst_org row_2 column_4  52.3
508 2012  org   dgst_org row_3 column_3  64.1
537 2013  org     cc_org row_1 column_6  41.2
538 2013  org     cc_org row_2 column_5  43.3
539 2013  org     cc_org row_3 column_4  57.2
540 2013  org     cc_org row_4 column_3  51.1

I tried to come up with a proper linear mixed effect model but ran into some Problems because of the poor experiment design.

The yield showed a bimodal distribution, which was as expected an effect of the system type.

I know understand that this is no Problem as long as the residuals of the model are normally distributed, whitch they are

> m1 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:column) + (1|year:row), data = data)
> summary(m1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: yield ~ system + (1 | year) + (1 | year:system) + (1 | year:column) +      (1 | year:row)
   Data: data
REML criterion at convergence: 1262.4
Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2604 -0.4993  0.0596  0.5585  2.3880
Random effects:
 Groups      Name        Variance Std.Dev.
 year:column (Intercept)  0.01384 0.1176

 year:system (Intercept) 43.85302 6.6222

 year:row    (Intercept)  2.27887 1.5096

 year        (Intercept) 22.30702 4.7230

 Residual                26.42919 5.1409

Number of obs: 192, groups:  year:column, 48; year:system, 48; year:row, 32; year, 8
Fixed effects:
                  Estimate Std. Error      df t value Pr(>|t|)

(Intercept)         62.981      3.028  27.986  20.801  < 2e-16 ***
systemcc_pest       46.566      3.552  34.309  13.110 6.42e-15 ***
systemcc_org        -9.744      3.552  33.574  -2.743  0.00969 ** 
systemmanure_pest   47.147      3.552  34.309  13.274 4.49e-15 ***
systemmanure_org    -8.369      3.552  33.574  -2.356  0.02444 *

systemfmyd_org     -10.722      3.552  33.574  -3.019  0.00482 **

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
            (Intr) systmcc_p systmcc_r systmmnr_p systmmnr_r
systmcc_pst -0.587

systemcc_rg -0.587  0.500

systmmnr_ps -0.587  0.500     0.500

systmmnr_rg -0.587  0.500     0.500     0.500

systmfmyd_r -0.587  0.500     0.500     0.500      0.500

My first idea was then to separate the whole dataset into two datasets (AB and CDEF) with each one having normally distributed data and checking for significant differences between the system, at first separately and then together.
My lmer model for the group CDEF was:
m1 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:row) + (1|year:column))
I tried to add an additional random effect accounting for the interaction between row and column +(1|row:column)
but got an error message: boundary (singular) fit: see ?isSingular
The Model for the Group AB was:
m2 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:row))
since only the rows where single replicates. I checked with the emmeans package if there are significant differences between the groups and found ones between F, with a higher yield, and CDE with lower yield. No differences were found between system A and B. After that I didn’t know how to continue and compare the two groups.

My second idea was to add a grouping variable taking account for the system type and creating a model that could compare the whole experiment at once.
The lmer model I came up with was:
m3 <- lmer(yield ~ type + system + (1|year) + (1|year:system) + (1|year:type) + (1|year:row))
again I ran into some Problems, I didn’t know how to properly nest my fixed effects, since they clearly are nested and how to take account for the columns.

As mentioned from Russ Lenth in the comments it does not make sense to split the Population since it’s an effect from the treatment

My Questions therefore are:

~~Should I divide my dataset and analyze the two system types (AB and CDEF) separately, if so how do I include columns in the AB model and what possibility do I have to compare AB and CDEF afterwards ?~~
Or should I make one model to rule them all and create a new grouping variable for system type and nest them properly and ignore the random effect for column ?
Or do you have any other Idea how this design could be handled ?

New Models

> m1 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:row), data = data)
> summary(m1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: yield ~ system + (1 | year) + (1 | year:system) + (1 | year:row)
   Data: data
REML criterion at convergence: 1262.4
Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2609 -0.4988  0.0592  0.5590  2.3885
Random effects:
 Groups      Name        Variance Std.Dev.
 year:system (Intercept) 43.868   6.623

 year:row    (Intercept)  2.276   1.509

 year        (Intercept) 22.305   4.723

 Residual                26.442   5.142

Number of obs: 192, groups:  year:system, 48; year:row, 32; year, 8
Fixed effects:
                  Estimate Std. Error      df t value Pr(>|t|)

(Intercept)         62.981      3.028  28.260  20.799  < 2e-16 ***
systemcc_pest       46.566      3.552  35.000  13.108  4.6e-15 ***
systemcc_org        -9.744      3.552  35.000  -2.743  0.00954 ** 
systemmanure_pest   47.147      3.552  35.000  13.272  3.2e-15 ***
systemmanure_org    -8.369      3.552  35.000  -2.356  0.02421 *

systemfmyd_org     -10.722      3.552  35.000  -3.018  0.00472 **

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
            (Intr) systmcc_p systmcc_r systmmnr_p systmmnr_r
systmcc_pst -0.587

systemcc_rg -0.587  0.500

systmmnr_ps -0.587  0.500     0.500

systmmnr_rg -0.587  0.500     0.500     0.500

systmfmyd_r -0.587  0.500     0.500     0.500      0.500
> m2 <- lmer(yield ~ system + (1|year) + (1|year:row) +  (1|year:column), data = data)
> summary(m2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: yield ~ system + (1 | year) + (1 | year:row) + (1 | year:column)
   Data: data
REML criterion at convergence: 1302.3
Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0617 -0.5748  0.1023  0.5824  2.7636
Random effects:
 Groups      Name        Variance Std.Dev.
 year:column (Intercept) 27.2467  5.2198

 year:row    (Intercept)  0.2432  0.4932

 year        (Intercept) 25.0757  5.0076

 Residual                38.6421  6.2163

Number of obs: 192, groups:  year:column, 48; year:row, 32; year, 8
Fixed effects:
                  Estimate Std. Error      df t value Pr(>|t|)

(Intercept)         62.981      2.281  12.319  27.616 1.87e-12 ***
systemcc_pest       46.566      2.229  75.612  20.889  < 2e-16 ***
systemcc_org        -9.744      1.554 116.002  -6.270 6.39e-09 ***
systemmanure_pest   47.147      2.229  75.612  21.149  < 2e-16 ***
systemmanure_org    -8.369      1.554 116.002  -5.385 3.84e-07 ***
systemfmyd_org     -10.722      1.554 116.002  -6.899 2.93e-10 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
            (Intr) systmcc_p systmcc_r systmmnr_p systmmnr_r
systmcc_pst -0.405

systemcc_rg -0.341  0.349

systmmnr_ps -0.405  0.757     0.349

systmmnr_rg -0.341  0.349     0.500     0.349

systmfmyd_r -0.341  0.349     0.500     0.349      0.500

A bimodal response is not a problem in itself. The distribution of the response is not important. I don't really understand the experimental design. Can you explain it in more detail ? What is your research question ? What is the total size of the dataset ? Please can you inlcude the output str(data), summary(data) and head(data, 20). Your model code implies that system is nested within year - so that any particular system belongs to one and only one year. That doesn't seem clear from the description. — Robert Long, Oct 08 '20 at 11:44
It's the distribution of the ERRORS that matters! It appears that you have a bimodal distribution due to treatment effects. If you subdivide the data based on that, it is throwing out the baby with the bathwater. — Russ Lenth, Oct 08 '20 at 12:43
Thanks Robert for your comments on providing additional information, I did like you recommended and added further data insights to my post. My research question is if there are significant differences in the systems mean yield over the whole time. The experimental design is kind of a non-resolvable row-column design at least the CDEF part, system should not be nested within year, I thought that this term takes account for random interaction effects between system and year. — Thomas Baumgartner, Oct 08 '20 at 16:18
Also Thank you Russ, as mentioned in my Post it is now clear to me why it would be a bad Idea to split the dataset, my problem still is how to take account for the huge differences between the two types. — Thomas Baumgartner, Oct 08 '20 at 16:18
thanks for the edits. Please you also include the output of summary(mymodel) for the model you posted the QQ plot of residuals for. It's not clear to me why you need so many random effects. system is not nested in year according to the data you showed above, so I don't know why you want (1|year) + (1|year:system) — Robert Long, Oct 09 '20 at 09:08
Thanks again @RobertLong for your Answer, I included the model summery output. I thougt since I have multiple years that I have to take account for a possible interaction of system and year which I wanted to do with the term (1|year:system). Leafing that term out of my model led to a singular fit warning in lmer, I tried the same now with lme without the warning and am now unsure whats the right model — Thomas Baumgartner, Oct 12 '20 at 08:38
Note that the variance of the year:column term is effectively zero compared to the other variance components, so I would remove that - you could make the same argument for year:row although it's not so obvious in that case. — Robert Long, Oct 12 '20 at 08:44
Thanks @RobertLong I actually already tryed removing it from my new model, it made a lot sense to do so since columns aren't complete replicas of all 6 systems due to the experimental design, but when removing the columns from the model I ran into singular fit Problems, I added the new model summary to the Question. The rows however are complete ones therefore I would stick to them. The Problem is, without the interaction term system I got Problems with the variances of the columns due to the design, but if I stick to the interaction my Post Hoc test wouldn't detect differences between C DEF — Thomas Baumgartner, Oct 12 '20 at 09:40

score 3 · Accepted Answer · answered Oct 13 '20 at 09:40

I try to sum up what I‘ve learned from the comments to close the question:

Linear mixed effect models do not necessarily need normally distributed data; here is a link to another Post dealing with the same question
Not the data itself but the residuals of the model should be normally distributed
One of the most important things to look at while working with lme models, is to find the right model syntax representing your experiment correctly, resources which helped me finding that are the following ones:
- A Hitchhiker's Guide to Mixed Models for Randomized Experiments by Piepho et al.
- Categorical random effects with lme4 by lionel
- This Post from amoeba R's lmer cheat sheet

How do I analyze bimodal distibuted data with a linear mixed model

1 Answers1