6

The experiment I am working on has the following design:

A B C D E F
B A D E F C
A B E F C D
B A F C D E

  • Each Letter represents a different level of the single factor called “system” analyzed in this experiment. The dataset contains eight years and the dependent variable we are analyzing is yield.
    A and B can be grouped together, as well as C to F according to their system type. I am aware of the missing randomization between groups AB and CDEF, which was necessary due to regulations, as well of the missing randomization within these two Groups, which has simply not been made, sadly.
  • I am investigating if there are sigificant differnces in yield between the systes (A-F)

My data looks like this:

> str(data)
'data.frame':   192 obs. of  6 variables:
 $ year  : Factor w/ 8 levels "2012","2013",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ type  : Factor w/ 2 levels "org","pest": 1 1 1 1 1 1 1 1 1 1 ...
 $ system: Factor w/ 6 levels "dgst_org","cc_pest",..: 3 3 3 3 5 5 5 5 6 6 ...
 $ row   : Factor w/ 4 levels "row_1","row_2",..: 1 2 3 4 2 3 4 1 3 4 ...
 $ column: Factor w/ 6 levels "column_1","column_2",..: 6 5 4 3 6 5 4 3 6 5 ...
 $ yield : num  26.2 41.4 43.4 45 40.8 52.3 47.1 47.2 40.1 42.4 ...

> summary(data) year type system row column yield
2012 :24 org :128 dgst_org :32 row_1:48 column_1:32 Min. : 26.20
2013 :24 pest: 64 cc_pest :32 row_2:48 column_2:32 1st Qu.: 52.30
2014 :24 cc_org :32 row_3:48 column_3:32 Median : 62.95
2015 :24 manure_pest:32 row_4:48 column_4:32 Mean : 73.79
2016 :24 manure_org :32 column_5:32 3rd Qu.:103.83
2017 :24 fmyd_org :32 column_6:32 Max. :127.10

> head(data,20) year type system row column yield 377 2012 org cc_org row_1 column_6 26.2 378 2012 org cc_org row_2 column_5 41.4 379 2012 org cc_org row_3 column_4 43.4 380 2012 org cc_org row_4 column_3 45.0 417 2012 org manure_org row_2 column_6 40.8 418 2012 org manure_org row_3 column_5 52.3 419 2012 org manure_org row_4 column_4 47.1 420 2012 org manure_org row_1 column_3 47.2 461 2012 org fmyd_org row_3 column_6 40.1 462 2012 org fmyd_org row_4 column_5 42.4 463 2012 org fmyd_org row_1 column_4 39.5 464 2012 org fmyd_org row_2 column_3 35.7 505 2012 org dgst_org row_4 column_6 57.8 506 2012 org dgst_org row_1 column_5 48.8 507 2012 org dgst_org row_2 column_4 52.3 508 2012 org dgst_org row_3 column_3 64.1 537 2013 org cc_org row_1 column_6 41.2 538 2013 org cc_org row_2 column_5 43.3 539 2013 org cc_org row_3 column_4 57.2 540 2013 org cc_org row_4 column_3 51.1

I tried to come up with a proper linear mixed effect model but ran into some Problems because of the poor experiment design.

The yield showed a bimodal distribution, which was as expected an effect of the system type. Histogramm

I know understand that this is no Problem as long as the residuals of the model are normally distributed, whitch they are

> m1 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:column) + (1|year:row), data = data)
> summary(m1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: yield ~ system + (1 | year) + (1 | year:system) + (1 | year:column) +      (1 | year:row)
   Data: data

REML criterion at convergence: 1262.4

Scaled residuals: Min 1Q Median 3Q Max -3.2604 -0.4993 0.0596 0.5585 2.3880

Random effects: Groups Name Variance Std.Dev. year:column (Intercept) 0.01384 0.1176
year:system (Intercept) 43.85302 6.6222
year:row (Intercept) 2.27887 1.5096
year (Intercept) 22.30702 4.7230
Residual 26.42919 5.1409
Number of obs: 192, groups: year:column, 48; year:system, 48; year:row, 32; year, 8

Fixed effects: Estimate Std. Error df t value Pr(>|t|)
(Intercept) 62.981 3.028 27.986 20.801 < 2e-16 *** systemcc_pest 46.566 3.552 34.309 13.110 6.42e-15 *** systemcc_org -9.744 3.552 33.574 -2.743 0.00969 ** systemmanure_pest 47.147 3.552 34.309 13.274 4.49e-15 *** systemmanure_org -8.369 3.552 33.574 -2.356 0.02444 *
systemfmyd_org -10.722 3.552 33.574 -3.019 0.00482 **


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects: (Intr) systmcc_p systmcc_r systmmnr_p systmmnr_r systmcc_pst -0.587
systemcc_rg -0.587 0.500
systmmnr_ps -0.587 0.500 0.500
systmmnr_rg -0.587 0.500 0.500 0.500
systmfmyd_r -0.587 0.500 0.500 0.500 0.500

Q-Q Plot

  1. My first idea was then to separate the whole dataset into two datasets (AB and CDEF) with each one having normally distributed data and checking for significant differences between the system, at first separately and then together.
    My lmer model for the group CDEF was:
    m1 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:row) + (1|year:column))
    I tried to add an additional random effect accounting for the interaction between row and column +(1|row:column)
    but got an error message: boundary (singular) fit: see ?isSingular
    The Model for the Group AB was:
    m2 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:row))
    since only the rows where single replicates. I checked with the emmeans package if there are significant differences between the groups and found ones between F, with a higher yield, and CDE with lower yield. No differences were found between system A and B. After that I didn’t know how to continue and compare the two groups.
  1. My second idea was to add a grouping variable taking account for the system type and creating a model that could compare the whole experiment at once.
    The lmer model I came up with was:
    m3 <- lmer(yield ~ type + system + (1|year) + (1|year:system) + (1|year:type) + (1|year:row))
    again I ran into some Problems, I didn’t know how to properly nest my fixed effects, since they clearly are nested and how to take account for the columns.

As mentioned from Russ Lenth in the comments it does not make sense to split the Population since it’s an effect from the treatment

My Questions therefore are:

  • Should I divide my dataset and analyze the two system types (AB and CDEF) separately, if so how do I include columns in the AB model and what possibility do I have to compare AB and CDEF afterwards ?

  • Or should I make one model to rule them all and create a new grouping variable for system type and nest them properly and ignore the random effect for column ?

  • Or do you have any other Idea how this design could be handled ?

New Models

> m1 <- lmer(yield ~ system + (1|year) + (1|year:system) + (1|year:row), data = data)
> summary(m1)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: yield ~ system + (1 | year) + (1 | year:system) + (1 | year:row)
   Data: data

REML criterion at convergence: 1262.4

Scaled residuals: Min 1Q Median 3Q Max -3.2609 -0.4988 0.0592 0.5590 2.3885

Random effects: Groups Name Variance Std.Dev. year:system (Intercept) 43.868 6.623
year:row (Intercept) 2.276 1.509
year (Intercept) 22.305 4.723
Residual 26.442 5.142
Number of obs: 192, groups: year:system, 48; year:row, 32; year, 8

Fixed effects: Estimate Std. Error df t value Pr(>|t|)
(Intercept) 62.981 3.028 28.260 20.799 < 2e-16 *** systemcc_pest 46.566 3.552 35.000 13.108 4.6e-15 *** systemcc_org -9.744 3.552 35.000 -2.743 0.00954 ** systemmanure_pest 47.147 3.552 35.000 13.272 3.2e-15 *** systemmanure_org -8.369 3.552 35.000 -2.356 0.02421 *
systemfmyd_org -10.722 3.552 35.000 -3.018 0.00472 **


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects: (Intr) systmcc_p systmcc_r systmmnr_p systmmnr_r systmcc_pst -0.587
systemcc_rg -0.587 0.500
systmmnr_ps -0.587 0.500 0.500
systmmnr_rg -0.587 0.500 0.500 0.500
systmfmyd_r -0.587 0.500 0.500 0.500 0.500

> m2 <- lmer(yield ~ system + (1|year) + (1|year:row) + (1|year:column), data = data) > summary(m2) Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest'] Formula: yield ~ system + (1 | year) + (1 | year:row) + (1 | year:column) Data: data

REML criterion at convergence: 1302.3

Scaled residuals: Min 1Q Median 3Q Max -3.0617 -0.5748 0.1023 0.5824 2.7636

Random effects: Groups Name Variance Std.Dev. year:column (Intercept) 27.2467 5.2198
year:row (Intercept) 0.2432 0.4932
year (Intercept) 25.0757 5.0076
Residual 38.6421 6.2163
Number of obs: 192, groups: year:column, 48; year:row, 32; year, 8

Fixed effects: Estimate Std. Error df t value Pr(>|t|)
(Intercept) 62.981 2.281 12.319 27.616 1.87e-12 *** systemcc_pest 46.566 2.229 75.612 20.889 < 2e-16 *** systemcc_org -9.744 1.554 116.002 -6.270 6.39e-09 *** systemmanure_pest 47.147 2.229 75.612 21.149 < 2e-16 *** systemmanure_org -8.369 1.554 116.002 -5.385 3.84e-07 *** systemfmyd_org -10.722 1.554 116.002 -6.899 2.93e-10 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects: (Intr) systmcc_p systmcc_r systmmnr_p systmmnr_r systmcc_pst -0.405
systemcc_rg -0.341 0.349
systmmnr_ps -0.405 0.757 0.349
systmmnr_rg -0.341 0.349 0.500 0.349
systmfmyd_r -0.341 0.349 0.500 0.349 0.500

  • 4
    A bimodal response is not a problem in itself. The distribution of the response is not important. I don't really understand the experimental design. Can you explain it in more detail ? What is your research question ? What is the total size of the dataset ? Please can you inlcude the output str(data), summary(data) and head(data, 20). Your model code implies that system is nested within year - so that any particular system belongs to one and only one year. That doesn't seem clear from the description. – Robert Long Oct 08 '20 at 11:44
  • 3
    It's the distribution of the ERRORS that matters! It appears that you have a bimodal distribution due to treatment effects. If you subdivide the data based on that, it is throwing out the baby with the bathwater. – Russ Lenth Oct 08 '20 at 12:43
  • 1
    Thanks Robert for your comments on providing additional information, I did like you recommended and added further data insights to my post. My research question is if there are significant differences in the systems mean yield over the whole time. The experimental design is kind of a non-resolvable row-column design at least the CDEF part, system should not be nested within year, I thought that this term takes account for random interaction effects between system and year. – Thomas Baumgartner Oct 08 '20 at 16:18
  • Also Thank you Russ, as mentioned in my Post it is now clear to me why it would be a bad Idea to split the dataset, my problem still is how to take account for the huge differences between the two types. – Thomas Baumgartner Oct 08 '20 at 16:18
  • 2
    thanks for the edits. Please you also include the output of summary(mymodel) for the model you posted the QQ plot of residuals for. It's not clear to me why you need so many random effects. system is not nested in year according to the data you showed above, so I don't know why you want (1|year) + (1|year:system) – Robert Long Oct 09 '20 at 09:08
  • Thanks again @RobertLong for your Answer, I included the model summery output. I thougt since I have multiple years that I have to take account for a possible interaction of system and year which I wanted to do with the term (1|year:system). Leafing that term out of my model led to a singular fit warning in lmer, I tried the same now with lme without the warning and am now unsure whats the right model – Thomas Baumgartner Oct 12 '20 at 08:38
  • 2
    Note that the variance of the year:column term is effectively zero compared to the other variance components, so I would remove that - you could make the same argument for year:row although it's not so obvious in that case. – Robert Long Oct 12 '20 at 08:44
  • Thanks @RobertLong I actually already tryed removing it from my new model, it made a lot sense to do so since columns aren't complete replicas of all 6 systems due to the experimental design, but when removing the columns from the model I ran into singular fit Problems, I added the new model summary to the Question. The rows however are complete ones therefore I would stick to them. The Problem is, without the interaction term system I got Problems with the variances of the columns due to the design, but if I stick to the interaction my Post Hoc test wouldn't detect differences between C DEF – Thomas Baumgartner Oct 12 '20 at 09:40

1 Answers1

3

I try to sum up what I‘ve learned from the comments to close the question:

  1. Linear mixed effect models do not necessarily need normally distributed data; here is a link to another Post dealing with the same question
  2. Not the data itself but the residuals of the model should be normally distributed
  3. One of the most important things to look at while working with lme models, is to find the right model syntax representing your experiment correctly, resources which helped me finding that are the following ones: