Interaction not significant, but one simple effect significant: linear mixed model with lmer() in R

Question

Currently cross-posted at https://stackoverflow.com/questions/63492814/interaction-not-significant-but-one-simple-effect-significant-linear-mixed-mod because I wasn't sure which site was more appropriate, but StackOverflow tends to get more traffic and responses. I will take suggestions on where to best post, with the hope of getting useful feedback.

Background: I have fit a linear mixed model using lmer() (lme4 package) in R with two binary categorical predictors as dummy variables. One (Intervention) is within-subjects, while the other (Sex) is between-. The model accounts for two levels of correlation with random effects (data structure and model code described below). The outcome is proportions, but they're very well-behaved - the mean is around 0.5, with a range of about 0.2 to 0.9, and they're very normally distributed. Subsequently, the residuals show assumptions (normality, equal variance) are met. Thus, I don't think what I'm observing is due to violating assumptions of a linear (mixed) model.

Issue: The following is true no matter what random effects structure I use (which I list below): In every case, the test statistic for the interaction term between the two binary categorical predictors is about 1.7 in magnitude, while that for one of the binary predictors is always about 2.8 (the test stat for the other is ~1.3). Although there is question about how to accurately calculate p-values for these types of models (and whether or not we even should - I'm aware of this discussion point), it is clear that no matter the degrees of freedom used, the interaction term would be not considered statistically significant (with, say, $\alpha$ = 0.05), while the one predictor would. Note here the estimate for the individual predictor is a simple effect, since it is binary and dummy-coded. I used emmeans() to look at all four possible simple effects, and there is only one that is statistically significant (that with the test statistic of about 2.8).

I cannot figure out how the interaction could lack significance, but one of four possible simple effects is significant. I could see if the test statistics/p-values were "borderline," making it a potential issue of power. However, here the ballpark p-value for the interaction term (test stat ~1.7) is about 0.09, while a rough p-value for the simple effect (test stat ~2.8) is about 0.007. It seems problematic to me that they could differ by a magnitude, and makes me concerned that I am inherently modeling the data incorrectly, although if so, I can't see where I am in error.

Data structure: Each subject has an observed proportion across six different images (out of 12 possible they could have been randomly assigned): Three images were viewed pre-intervention, and three were viewed post-intervention. Thus, there is potential correlation due to subject and image, so these are considered as random effects. Lastly, Intervention is within-subjects, while Sex is between-.

Here is a small dummy dataset (not actual data, where number of unique subjects is 59 (29 of one sex, 30 of the other)):

structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L), Image = c("B", "A", 
"G", "E", "C", "I", "C", "G", "L", "A", "D", "F", "E", "A", "K", 
"B", "C", "I", "D", "F", "H", "J", "L", "B", "D", "F", "A", "L", 
"C", "E", "J", "K", "F", "B", "A", "D"), Intervention = c("Pre", "Pre", "Pre", "Post", 
"Post", "Post", "Pre", "Pre", "Pre", "Post", "Post", "Post", "Pre", 
"Pre", "Pre", "Post", "Post", "Post", "Pre", "Pre", "Pre", 
"Post", "Post", "Post", "Pre", "Pre", "Pre", "Post", "Post", "Post", 
"Pre", "Pre", "Pre", "Post", "Post", "Post"), Sex = c("Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Male", "Male", "Male", "Male", 
"Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", 
"Male", "Male", "Male", "Male", "Male", "Male"), Prop = c(0.488277, 
0.236734, 0.41036, 0.745403, 0.464705, 0.625076, 0.5602122, 0.590909, 0.333266, 0.365954, 0.374941, 0.662141, 0.64877, 0.434947, 0.721343, 0.5288113, 0.782714, 
0.603777, 0.4480342, 0.629813, 0.347684, 0.41906, 0.553854, 0.639324, 0.389804, 0.49155, 0.355763, 0.695487, 0.537433, 0.650022, 0.54022, 0.58907, 0.666208, 
0.713883, 0.625882, 0.434924)), class = "data.frame", row.names = c(NA, -36L))

Candidate models considered, each with varying random effects:

Model 1 (gave convergence warning): Note the output is that from my actual data (not the dummy dataset given above):

largest_lmer <- lmer(Prop ~ factor(Sex)*factor(Intervention) +
                            (1 | Image) +
                            (1 + Intervention | Subject), 
                     data = data01)
coef(summary(largest_lmer))
Estimate Std. Error   t value
(Intercept)                              0.51415277 0.03503742 14.674389
factor(Sex)Male                          0.04019813 0.03006458  1.337059
factor(Intervention)Pre                  0.05123982 0.01830275  2.799569
factor(Sex)Male:factor(Intervention)Pre -0.04238911 0.02509809 -1.688938
install.packages("emmeans")
library(emmeans)
largest_lmer_emm_Int <- emmeans(largest_lmer, ~ factor(Sex) | factor(Intervention))
pairs(largest_lmer_emm_Int)
Intervention = Post:
contrast      estimate     SE   df t.ratio p.value
Female - Male -0.04020 0.0301 57.3 -1.336  0.1867
Intervention = Pre:
contrast      estimate     SE   df t.ratio p.value
Female - Male  0.00219 0.0307 57.2  0.071  0.9434
Degrees-of-freedom method: kenward-roger
largest_lmer_emm_Sex <- emmeans(largest_lmer, ~ factor(Intervention) | factor(Sex))
pairs(largest_lmer_emm_Sex)
Sex = Female:
contrast   estimate     SE   df t.ratio p.value
Post - Pre -0.05124 0.0184 56.5 -2.789  0.0072 This is the significant simple effect

Sex = Male:
contrast   estimate     SE   df t.ratio p.value
Post - Pre -0.00885 0.0172 55.0 -0.515  0.6084
Degrees-of-freedom method: kenward-roger

Model 2: All output similar to that from Model 1 (not repeated here):

medium_lmer <- lmer(Prop ~ factor(Sex)*factor(Intervention) + 
                           (1 | Image) +
                           (1 | Subject) +
                           (1 | Intervention:Subject), 
                    data = data01)

Model 3: All output similar to that from Model 1 (not repeated here):

smallest_lmer <- lmer(Prop ~ factor(Sex)*factor(Intervention) + 
                             (1 | Image) +
                             (1 | Subject), 
                      data = data01)

As I mentioned, all of these candidate models gave roughly the test statistics noted above - they did not vary depending on the random effects included. Assumptions of the model (normality, equal variance) were met. Is there something else I'm missing? Or is it mathematically possible to have an insignificant interaction, but a significant simple effect that differ as much as these two do with regard to their test statistic/p-value?

As for where to post... I would say it is like looking for lost keys: look near where they should be, not where there is more light or foot traffic. Here you have lots of statisticians albeit less traffic. I suspect statisticians will know better how to handle your question. — kurtosis, Aug 20 '20 at 15:58
Thanks, @kurtosis. I see a lot of statisticians on StackOverflow too, as the intersection between stats and coding is very blurred. — Meg, Aug 20 '20 at 16:13
"the intersection between stats and coding is very blurred" Will have to agree to disagree. Guess it depends on how we define "statistician." :-) — kurtosis, Aug 20 '20 at 16:30
Since even theoretical statistical papers generally require a simulation and/or application to a dataset, they are indeed blurred in my eyes. And in my job, I do plenty of theory and application, each requiring the other, so I cannot separate the two. — Meg, Aug 20 '20 at 16:38

kurtosis · Answer 1 · 2020-08-20T15:59:52.623

1

I think there are a few potential issues here.

Your results tend to be the same using different random effects setups. That is not so surprising: Liang and Zeger talk about how approximate random effects models are often sufficient to get close to the truth and produce useful standard errors. The fixed effects should not change much if at all between the three models since they are the same in all three. This is the good part.

The troubling part is that you seem to insist that the interaction should be significant. Do you have some theoretical reason for that belief, or is it just a prior not based on theory? You don't want to be the analyst who tortures the data until it falsely confesses, so it really sounds like you need to be willing to accept that the interaction is insignificant. That should not be surprising: interactions are often less significant than the main effects.

Another possible issue is you may have a problem with heteroskedasticity. Proportions tend to be more variable when they are near 0.5 than when they are near 0 or 1. A typical correction for this is to transform the response to $\tilde{Y} = \sin^{-1}(\sqrt{Y})$ to stabilize the variance. That is a little bit of a pain because you need to transform back your predictions and the model coefficients are less intuitive, but the results will likely be cleaner. Weisberg's Applied Linear Regression, 2nd Ed. discusses this in Chapter 8.

Finally, you ask "is it mathematically possible to have an insignificant interaction, but a significant simple effect that differ as much as these two do with regard to their test statistic/$p$-value?" Absolutely. Suppose we gather school children from Smallville and Littletown, show some of them videos on word roots and guessing at spelling, and then give them all spelling tests. We might see that town is almost significant (say Smallville has better schools), the treatment is very significant, but that the interaction of town and treatment is not at all significant (i.e. both town's kids learn equally well from the video, so the interaction is immaterial). That would not even be unusual: I probably saw a hundred datasets like that in graduate school.

To summarize: I would be glad for your random effects modeling, transform your response, and be open to your interaction term not being significant. Don't torture the data; those confessions are rarely true. Good luck; hope it goes well!

edited Aug 20 '20 at 15:59

answered Aug 20 '20 at 15:54

kurtosis

1,650

1/4 Thanks, @kurtosis. Although in theory the random effects hopefully don’t change our results much, this is not always true. In a logistic mixed model of these data, the results are sensitive to which random effects are/are not included. There could be other issues with the model leading to this, but one cannot assume that any old random effects will do. As a matter of fact, there are many conflicting opinions in the literature on how to best choose random effects so as to balance the type I error rate and power. – Meg Aug 20 '20 at 16:09
2/4 I am not insisting the interaction be significant. I originally stumbled upon the significant simple effect in light of the insignificant interaction because I have also considered a similar logistic model (using 0/1 data instead of the proportions), and wanted estimates of the odds ratios between all groups as measures of effect size to report anyway, despite lack of statistical significance. That is, until I saw one of the simple effects was significant. That’s what led to me questioning if/how/when this could happen. – Meg Aug 20 '20 at 16:10
3/4 As I mentioned in my post, I had no evidence of heteroskedasticity in the residuals after I fit the model. For completeness, however, I had already also tried an empirical logit transformation on the proportions, and the results are equivalent. But, as I said, I don’t think a transformation is necessary here because the proportions are already so well-behaved (as in my post). – Meg Aug 20 '20 at 16:10
4/4 I also think you’re misunderstanding main vs. simple effects. I understand that you can readily have a significant main effect without a significant interaction (and, indeed, the Sex main effect is significant here). What I’m asking about are the simple effects: All four combinations of Sex and Intervention (change from male to female when holding intervention at “pre,” e.g.). It makes less (no?) sense to me how one of these simple effects can be significant if the interaction is not. – Meg Aug 20 '20 at 16:10
Good to hear that a logit transform performed similarly; and, yes, the random effects sometimes make a difference. One of your simple effects can be significant if the others are incredibly noisy. Also, presumably one of those simple effects is aliased with your baseline, so that is likely an issue with looking at the simple effects. – kurtosis Aug 20 '20 at 16:26
Would I expect aliasing in a 2x2 ANOVA (fit as a linear mixed model, so random effects could be incorporated)? – Meg Aug 20 '20 at 16:46
Typo two comments above: "and, indeed, the Sex main effect is significant here" should be, "and, indeed, the Intervention main effect is significant here".

Meg

Aug 20 '20 at 16:58

Aliasing is a direct result of identifiability. It happens but how it happens depends on how your factors are coded, not on the size of groups and treatments. So yes, you should expect this in a 2x2 setup. Most discussions of contrasts in $R$ discuss how this happens. – kurtosis Aug 20 '20 at 17:04

My data are coded as dummy variables (0/1), which is the natural way to then estimate simple effects, and I have not seen an example where aliasing has caused an issue with doing so. Indeed, I have seen the 2x2 ANOVA case as a straightforward example for the sake of illustrating simple effects, so it's unclear why it would be an issue here. (Note I have only seen aliasing discussed with effects coding (say, -1/1).) I have spent some time now searching various terms, and have not found an indication that aliasing would be at play here. Do you have a source for the 2x2, dummy-coded case? – Meg Aug 20 '20 at 17:24

See here: https://bbolker.github.io/stat4c03/notes/contrasts.pdf Note that aliasing happens in your case; that is why you do not see an effect estimated for each of the simple effects. So Sex0:Intervention0, Sex1:Intervention0, and Sex0:Intervention1 would all not be reported and the model summary would instead report the intercept, Sex effect, and Intervention effect. Only Sex1:Intervention1 would not be aliased and reported as Sex1:Intervention1. – kurtosis Aug 20 '20 at 17:48

1

Oh, yes, I know that not all simple effects will be output, but they can subsequently be calculated (with algebra using 0s and 1s, or using something like emmeans). p-values can be obtained by refitting the model with different baselines, or, again, using emmeans. I thought by aliasing you meant some things may never be estimable, but all simple effects in this 2x2 situation should be estimable, just by using algebra/changing baseline/using emmeans. It is unclear to me if/how this is related to the lack of a significant interaction despite a significant simple effect. – Meg Aug 20 '20 at 18:00

1

Ah, no. Everything you have is identifiable. It's just what gets reported. That is how the significance gets affected: how something is coded often implies a basis for comparison. So coding as $\pm$1 would be comparing to 0 while the usual treatment contract compares to the baseline. I suspect your significant simple effect is (most of but not all of) what is driving the significant InterventionPre effect. – kurtosis Aug 20 '20 at 18:53

Thanks for clarifying everything is identifiable. And yes, I want to compare back to baseline so I can get simple effects here. And I agree - the one significant simple effect is probably driving the significant Intervention main effect. What I can't get over is this: I would have missed this simple effect altogether had I never "stumbled upon" it (again, when estimating ORs when treating the outcome as binary), because an insignificant interaction should generally indicate no need to go on to look for significant simple effects. So this goes back to my question: How is this happening? – Meg Aug 20 '20 at 19:10

I'm not sure why you think an insignificant interaction means no significant simple effects. It just means one of your simple effects (Sex1:Intervention1) is insignificant. Does that help? – kurtosis Aug 20 '20 at 19:24

1/2 In theory, no significant interaction tells you that there is no need to differentiate between the four groupings - the main effect for Sex and/or Intervention (here, just Intervention) is "enough.” So, now you collapse (get rid of the interaction), and conclude that pre- and post-interventions differ, but not that – specifically - females post-intervention differ from females pre-intervention. – Meg Aug 20 '20 at 19:35

2/2 You’ve lost this level of detail (i.e., the simple effect). The only way to get this detail is with the interaction in the model. However, when an interaction is not significant, it is generally argued there is no need to leave it in the model, and now your chance to capture that simple effect is gone. Is there something I’m missing? – Meg Aug 20 '20 at 19:42

Let us continue this discussion in chat. – kurtosis Aug 20 '20 at 21:05

score 1 · Answer 2 · answered Aug 20 '20 at 19:04

1

My sense is that you are putting too much importance on the binary significant/not-significant distinction. As this answer from Jeromy Anglim put it:

This binary thinking is generally not what we are most interested in. Once you think about your research question, you will almost always find that you are actually interested in estimating parameters. You are interested in the actual difference between group means, or the size of the correlation, or the size of the regression coefficient, or the amount of variance explained.

What you are seeing is what you would expect if there is a true interaction but your data set is simply too small to document it at the standard p < 0.05 level. Your test statistic for a coefficient is the ratio of its point estimate to the standard error of its estimate. The standard error of its estimate will go approximately with the square root of the number of cases. With about 50% more cases the point estimate you found for the interaction coefficient, -0.04239, might well have been deemed "significant" by that standard.

You can't really read much into p-values that "differ by a magnitude." A significance test is based on a null hypothesis; if that holds the p-values among multiple experiments are uniformly distributed and p < 0.05 is taken to be "significant." The distance of the p-value below 0.05 is hard to interpret further; you need to know the true alternate hypothesis for which you just provided evidence. See this page and play with the Pvalue.norm.sim or Pvalue.binom.sim functions in the TeachingDemos package for R to see how variable p-values can be among experiments under alternate hypotheses.

answered Aug 20 '20 at 19:04

EdM

92,183
10
92
267

1/5 Thanks, @EdM. I agree p-values shouldn't be the be-all and end-all, but we’re still a ways from convincing the scientific community. Also, note I stumbled upon this finding when estimating ORs (treating the outcome as binary instead of as proportions) as measures of effect size to report despite a lack of significance (this same “phenomenon” occurred in logistic models). I would have not found this simple effect had I not gone on to estimate ORs (I got the p-values for the four comparisons while estimating the ORs), and it remains that I want to know when/why this happens. – Meg Aug 20 '20 at 20:11
2/5 As your quote says: I am interested in estimating parameters – the ORs (logistic regression), the estimated probabilities of the four groupings here. I am also interested in the magnitude of the OR and the difference in these probabilities. It’s actually all of this (i.e., looking at my results to see if they made sense, the direction and magnitudes of the differences, etc.) that led me to this – what I would consider – anomaly. But the reality is the journal also wants p-values, not just estimates, etc. – Meg Aug 20 '20 at 20:11
3/5 I was wondering if power was the issue, but n = 354 isn’t “that” small (although there is correlation (each subject contributes six probabilities, one for each image)). I appreciate your comment about the magnitudes of the p-values not necessarily being comparable: Maybe there is a general lack of power, and p for the simple effect and interaction would both be smaller with more data? – Meg Aug 20 '20 at 20:11
4/5 What I can’t understand is how a simple effect with p = 0.007 wouldn’t “drive down” the p-value for the interaction to a “similar” level. It seems they should be more well-aligned. But maybe this is a logical fallacy, or my definition of what might be considered “well-aligned” is inappropriate/unreasonable. – Meg Aug 20 '20 at 20:11
5/5 Also, humor me: From a practical standpoint (since this is the world in which we still live): How do you reconcile that most would see p = 0.09 for the interaction, deem it unimportant, remove it, and never see a potentially significant simple effect (which existed here)? Is the argument to always look at the simple effects if they’re of interest (which they are here), despite the p-value for the interaction? Should we not still be able to explain why this happens “sometimes”? Is “lack of power” enough of an explanation? I just can’t shake that something isn’t right. – Meg Aug 20 '20 at 20:12
1

@Meg the "simple effect" of Post-Pre for women is still reliable even if the interaction term isn't significant, as that "simple effect" properly takes into account the covariances among the coefficient estimates. I'd say to focus on that. (The fixed-effect coefficient covariance matrix might be informative in this context.) Someone who just throws away a p = 0.09 isn't thinking; that might be something worthy of further study even if you can't publish it as "significant." Inform your data analysis with your knowledge of the subject matter and an open mind. – EdM Aug 20 '20 at 21:27
@Meg the way you have this coded with Female as the reference category, the significant "simple effect" is identical (except for change of sign) to the fixed-effect coefficient for "Pre". What you have is a significant "main effect" with a near-significant interaction. That's a tough but frequent situation. There was a very similar question yesterday: the "best" model by AIC included an interaction that didn't pass the "significance" threshold. Which result was correct? Both were, by their respective criteria. – EdM Aug 20 '20 at 21:45
In my current model with female as baseline, the simple effect for female/pre is only equivalent to the "main" effect when the interaction is in the model. In this way, it's not really a "main" effect (which I would define as being the effect of Intervention overall, not parsed out by males or females). If the interaction were removed, then you have a true main effect for Intervention (and another for Sex). Right now, with two binary predictors and an interaction, all "main effects" are actually simple effects. – Meg Aug 20 '20 at 23:14
I think I can make a case for looking at the simple effects given the journal to which we plan to submit, as I think it will be open to the "borderline" significance of the interaction term, but time will tell. I think this discipline/journal is probably more open to this than some. I know in some fields in which I help colleagues analyze their data, this wouldn't fly. I think this disconnect between p-values and the "bigger picture" is something we will probably continue to fight, maybe indefinitely (at least in some disciplines). Thanks. – Meg Aug 20 '20 at 23:18

Interaction not significant, but one simple effect significant: linear mixed model with lmer() in R

Estimate Std. Error t value

(Intercept) 0.51415277 0.03503742 14.674389

factor(Sex)Male 0.04019813 0.03006458 1.337059

factor(Intervention)Pre 0.05123982 0.01830275 2.799569

factor(Sex)Male:factor(Intervention)Pre -0.04238911 0.02509809 -1.688938

Intervention = Post:

contrast estimate SE df t.ratio p.value

Female - Male -0.04020 0.0301 57.3 -1.336 0.1867

Intervention = Pre:

contrast estimate SE df t.ratio p.value

Female - Male 0.00219 0.0307 57.2 0.071 0.9434

Degrees-of-freedom method: kenward-roger

Sex = Female:

contrast estimate SE df t.ratio p.value

Post - Pre -0.05124 0.0184 56.5 -2.789 0.0072 This is the significant simple effect

Sex = Male:

contrast estimate SE df t.ratio p.value

Post - Pre -0.00885 0.0172 55.0 -0.515 0.6084

Degrees-of-freedom method: kenward-roger

2 Answers2