What's the appropriate effect size estimate (and power analysis) for post-hoc regression?

Question

I've seen some similar (ex ex2) questions, but hopefully this is not a duplicate. As it is mentioned in one of them, I'm using eemmeans to do pairwise comparisons after my linear mixed effects model. Question: WHAT is the most appropriate effect size estimate for these comparisons? eta ? partial eta?*

My model + my post-hoc:

mod1 <- lmer(MY_CONT ~  YEAR * GROUP_2 + (1|ID), data = data, REML = FALSE)
group <- emmeans(mod1,~MY_GROUP|YEAR) %>% pairs(adjust="Tukey")
year <- emmeans(mod1,~YEAR|MY_GROUP) %>% pairs(adjust="Tukey")

I've also seen eff_size (such as here) option from the same package, but I couldn't understand from its documentation which estimate it is actually doing. I need some help to comprehend which would be the best estimate for me and how to perform that in R. Thanks in advance!
tips on how to report these results would be very appreciated too :)
is eff_size equivalent to cohen's d?
EDIT for bounty:

Russel kindly answered my issue, but I still have some remaining questions:

A) What is the Cohen' D type I'm getting see
B) Am I estimating it right?

### the model is:
mod1 <- lmer(CONT_Y ~  MY_GROUP * YEAR + (1|ID), data = dfModels)
estimate eemmeans:
group <- emmeans(mod1,~ MY_GROUP|YEAR)
year <- emmeans(mod1,~ YEAR|MY_GROUP)
pairwise comparisons:
group_p <- emmeans(mod1,~ MY_GROUP|YEAR) %>% pairs(adjust="Tukey")
year_p <- emmeans(mod1,~ YEAR|MY_GROUP) %>% pairs(adjust="Tukey")
correctiong sigma and edf
sigmaValues <- VarCorr(mod1)
sigmaValues
sigma <- sqrt((0.25743)^2 + (0.15054)^2)
calculate Cohen's d:
eff1 <- eff_size(emm1, sigma = sigma(mod1), edf = df.residual(mod1)) ### before:
group_p ### check the lowest df
eff1 <- eff_size(group, sigma = sigma, edf = 60)   ## and for group
year_p ### check the lowest df
eff2 <- eff_size(year, sigma = sigma, edf = 60)   ## and for year

C) how can I adapt this for lmers ?

simp <- lm(CONT_Y ~  MY_GROUP * YEAR), data = dfModels)
eff_size(pairs(emm), sigma(fiber.lm), df.residual(fiber.lm), method = "identity")

NOTE: I guess it goes without saying, but I DON'T have a background on math, so please bear with me :)
data:

data <- structure(list(PARTICIPANTS = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                                        3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 
                                        7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 
                                        10L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 
                                        14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 
                                        17L, 17L, 17L, 18L, 18L, 18L, 18L, 19L, 19L, 19L, 19L, 20L, 20L, 
                                        20L, 20L, 21L, 21L, 21L, 21L), CONT_Y = c(19.44, 20.07, 19.21, 
                                                                                  16.35, 11.37, 12.82, 19.42, 18.94, 19.59, 20.01, 19.7, 17.92, 
                                                                                  18.78, 19.21, 19.27, 18.46, 19.52, 20.02, 16.19, 19.97, 13.83, 
                                                                                  15.93, 14.79, 21.55, 18.8, 19.42, 19.27, 19.37, 17.14, 14.45, 
                                                                                  17.63, 20.01, 20.28, 17.93, 19.36, 20.15, 16.06, 17.04, 19.16, 
                                                                                  20.1, 16.44, 18.39, 18.01, 19.05, 18.04, 19.69, 19.61, 16.88, 
                                                                                  19.02, 20.42, 18.27, 18.43, 18.08, 17.1, 19.98, 19.43, 19.71, 
                                                                                  19.93, 20.11, 18.41, 20.31, 20.1, 20.38, 20.29, 13.6, 18.92, 
                                                                                  19.05, 19.13, 17.75, 19.15, 20.19, 18.3, 19.43, 19.8, 19.83, 
                                                                                  19.53, 16.14, 21.14, 17.37, 18.73, 16.51, 17.51, 17.06, 19.42
                                        ), CATEGORIES = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                    1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("A", 
                                                                                                                            "B"), class = "factor"), MY_GROUP = structure(c(1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                                                                                            1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L
                                                                                                                            ), .Label = c("G1", "G2"), class = "factor")), row.names = c(NA, 
                                                                                                                                                                                         -84L), class = c("tbl_df", "tbl", "data.frame"))
rename column:
data <- data %>%  rename(., YEAR = CATEGORIES)

There are many threads on CV that warn against post-hoc power. Relevant reads: What is the post-hoc power in my experiment? How to calculate this?, If we shouldn't do post hoc power calculations, are post hoc effect size calculations also invalid?, Post-hoc power analysis for null results: how to use 95% confidence interval instead?. — dipetkov, Jan 31 '23 at 19:10
It doesn't seem that your interaction is significant. Why are you worried about the marginal effects? Also, why do you need cohen's d? I rarely, if ever, see it reported for mixed models. — David B, Feb 05 '23 at 15:04
for publication purposes, I was required to report the effect size, @David B — Larissa Cury, Feb 05 '23 at 15:46
@DavidB , long story short: I needed to get pairwise comparisons between the variables (if you check my last week's post, you'll see a post on t-tests as post hocs), then I saw that the pairwise eemeans comparisons would be a better suit than the t-tests, but now I was requested to report effect size... — Larissa Cury, Feb 05 '23 at 15:47
Regression coefficients are a valid effect size. I don't think there's any consensus on how to create something like a Cohen's d for mixed models. — David B, Feb 05 '23 at 19:22
Your Y's are language test scores. Presumably, these can be meaningfully interpreted on the raw scale, ie, it's well known what one more point on the test means? Even if effect sizes make sense for mixed models, is it going to be any more interpretable to look at normalized language test scores? — dipetkov, Feb 05 '23 at 21:07
@DavidB , so I believe that what I'm being requested is not to calculate for the model's betas, but rather for the pairwise comparisons between the estimated means. I've just came across this article (it's open access), although authors report the betas for the pairwise comparisons (instead of the t-tests, which was what I was doing), they also report Cohen's d: https://www.tandfonline.com/doi/full/10.1080/13825585.2021.1991262 — Larissa Cury, Feb 06 '23 at 11:18
@dipetkov yeah, they can be interpretable in this sense and I don't think I'd need to normalize then in this case (should I?) because there's not a scale conflit between predictors (since I only have year and type of score)? (I usually standartize when I have a continuons pred to estimate a better fake 'zero' using the mean). One way or the other, tho, the thing is that I'm being required to calculate the effect sizes for the pairwise comparisons in order to publish it (I'm not saying I agree or not, but that's something I need to do now) — Larissa Cury, Feb 06 '23 at 11:21
What definition of "effect size" are you working with? Who is compelling you compute effect sizes? Cohen's d is usu. defined as a standardized mean difference: https://en.wikiversity.org/wiki/Cohen%27s_d. That's why I ask why standardize if you have perfectly interpretable mean differences. — dipetkov, Feb 06 '23 at 11:34
@dipetkov , hi, so, "What definition of "effect size" are you working with?" that's basically why I created this post. I'm confused of what kind is eff_size() returning me. I know it's Cohen's from my issue on Github, but I don't know what kind of Cohen's d (link above) it due to the change in the calculation . I came across this article yesterday, I guess that I'm being required to do something similar to calculate Cohen's d: https://www.tandfonline.com/doi/full/10.1080/13825585.2021.1991262 (but they reported the betas, i was reporting as a t-test) — Larissa Cury, Feb 06 '23 at 11:48
I read the discussion on GitHub and I think the advice is: you can compute something that's called "Cohen's d" but these quantities are not well defined for mixed models. The linked paper seems to sidestep clarity by referring to "Cohen's d", without clarification. I suppose you can do that too. Or you can present your fixed effect comparisons and explain why they are the appropriate summaries for your analysis. — dipetkov, Feb 06 '23 at 12:07
Agree w/ @dipetkov that, while you can calculate something like cohen's d with a mixed model, there is debate about the best way. I personally think calling something 'cohen's d', as in that paper, is a bit disingenuous. You say you are required to preport pairwise effect sizes. Who requires? What is the exact requirement you were given? — David B, Feb 06 '23 at 13:46
@DavidB , so that's why I got confuded. I've seen many different types of Cohen's d (I've linked one source above with 5 different). I was afraid to just say 'cohen's d' for this exact reason, this is why I wanted to know what was the one I was actually computatting (I think it's even different from the linked paper I've commented) (who = my supervisors) 'we have to have effect sizes such as in the literature for comparing pairs'. 'we won't submit without them' — Larissa Cury, Feb 06 '23 at 13:49
Ok cool. I don't think there's a satisfactory answer to be had here on CV. If your advisor has something in mind, you should ask them. In particular, you can ask how they recommend pooling the sd across observations. — David B, Feb 06 '23 at 15:50
Russ Lenth recently posted an answer to a similar question dealing with "effect sizes" in a mixed model here. In part: "in any but the simplest situations, the whole idea of effect size is flaky; people improvise something such as what I suggest, and then nobody really knows what you're talking about. You wind up with numbers, but they answer a question that can't be stated clearly." — EdM, Feb 21 '23 at 17:19

EdM · Accepted Answer · 2023-02-12T17:44:15.903

You've gotten a lot of good advice in the comments here and in your discussion with Russ Lenth elsewhere. Here's some advice for a simpler way to deal with the effect-size issue and for what might be a more useful model of your data.

WHAT is the most appropriate effect size estimate ...?

I've found "effect size" to be something that sounds like it should be important but ends up being disappointing or even misleading. I find actual differences in magnitudes easier to think about. That might represent my background in biology and biochemistry rather than in social science, where I gather that "effect size" estimates tend to be expected.

In comments, you've found a good deal of skepticism that there is an "appropriate effect size estimate," particularly in a model with random effects. Cohen's $d$ is the ratio between a difference in some type of outcome estimate and a standard deviation estimate. With random effects it's not at all clear what should be included in that standard deviation estimate, as there are both among-participant and residual variances to evaluate.

Nevertheless, you've been required to report some type of "effect sizes" for your study. Part of research training should involve learning how to deal with demands from supervisors or reviewers in a way that is honest to the data and respectful to the demands (however unrealistic or outdated those demands might be). I think that the Meltzer et al paper that you cited gives you a simple way to meet that demand. They say:

We compute Cohen’s $d$ as: $d = \frac{M_E-M_C}{s}$. For effect sizes of training effects (post – pre) within single groups ..., $M_E$ is the average score of the group post-training, $M_C$ is the average score pre-training, and $s$ is the standard deviation of the pre-training scores pooled across all ... groups. For effect size estimates of differences between groups, $M_E$ is the average difference score (post minus pre) of the experimental group... $M_C$ is the average difference score within the Control group, and $s$ is the standard deviation of difference scores within the Control group.

In your case, A presumably represents pre-training (or its equivalent), B represents post-training, and you have groups G1 and G2. I'd suggest reworking your data into a wide form, both for this task and for later suggestions about your model.

dataWide <- tidyr::pivot_wider(data,names_from=YEAR,values_from=CONT_Y)
## YEAR_B - YEAR_A differences for each PARTICIPANT/GROUP combination
dataWide[,"BAdiff"] <- dataWide$B-dataWide$A
sd(dataWide$A)
## [1] 2.272576
aggregate(BAdiff ~ MY_GROUP, data = dataWide, FUN = function(x) mean(x)/sd(dataWide$A))
##  MY_GROUP    BAdiff
## 1       G1 0.5033097
## 2       G2 0.2382444

What's shown as BAdiff in the output above are Cohen's $d$ values, as defined by Meltzer et al., for the post-pre (B-A)differences divided by the pooled standard deviation (2.27) of pre (A) values. You can report those Cohen's $d$ values and cite Meltzer et al.

For the other comparison, I'm not sure which would be considered the "control group" in your situation (or if that's even a consideration in your study). You would have to make a choice. The post-pre differences and their standard deviations for your groups are:

aggregate(BAdiff ~ MY_GROUP, data = dataWide, FUN = mean)
##   MY_GROUP    BAdiff
## 1       G1 1.1438095
## 2       G2 0.5414286
aggregate(BAdiff ~ MY_GROUP, data = dataWide, FUN = sd)
##   MY_GROUP   BAdiff
## 1       G1 2.317972
## 2       G2 2.803794

Reporting the "effect sizes" this way allows you to met the demands placed on you in a simple way, supported in the literature, that obviates your other questions (at least for this data set and its completely balanced design): "What is the Cohen' D type I'm getting?" "Am I estimating it right?" "How can I adapt this for lmers?" A reader like me might choose to ignore those Cohen's $d$ values and focus instead on the formal statistical analysis.

Suggestion for your model

Your mod1 seems to violate the assumptions about the distribution of residuals in a way that's troubling. Look at plot(mod1) and qqmath(mod1). The first suggests a rise in residuals with higher predicted values, and the second shows some substantial deviations from normality.

The raw data hint at what might be going on. Try this plot:

library(ggplot2)
ggplot(data = data, mapping = aes(x = YEAR, y = CONT_Y,
    group = MY_GROUP, color = MY_GROUP)) + geom_point() + 
    facet_wrap(facets = vars(PARTICIPANTS)) + geom_line()

It looks like there are typically very big changes in scores between the years when the YEAR_A score is very low (about 15 or lower), but not so much when the YEAR_A score is higher. That suggests you need to take the YEAR_A score into account in some way.

You thus might consider a different way to handle repeated measures over time. It can make sense to use the initial (pre-training, A) values as predictors in a model of later (post-training, B) values. That's particularly helpful when there are multiple later time points, but I think it can help you here too. I got a warning when I tried to do that with lmer, but for your type of study a generalized least squares (GLS) model can account for inter-individual correlations appropriately. See Chapter 7 of Frank Harrell's notes on Regression Modeling Strategies. You have to specify a correlation form; corCompSymm is equivalent to the assumption in repeated-measures ANOVA.

gls1 <- nlme::gls(B ~ A * MY_GROUP,       
         correlation = nlme::corCompSymm(form=~1|PARTICIPANTS), data = dataWide)

GLS is just an extension of linear regression (e.g., lm(B ~ A * MY_GROUP)) that takes the within-PARTICIPANT correlations into account. "GLS is equivalent to applying ordinary least squares to a linearly transformed version of the data." Wikipedia

The residuals seem much better behaved than in your mod1 (see plot(gls1) and qqnorm(gls1,abline=c(0,1))). With a continuous predictor A, explore the emtrends() function in emmeans.

emtrends(gls1, pairwise ~ MY_GROUP, var = "A", mode = "df.error", infer = TRUE)$contrasts
#  contrast estimate    SE df lower.CL upper.CL t.ratio p.value
#  G1 - G2     0.478 0.188 36   0.0969    0.859   2.544  0.0154
# 
# Degrees-of-freedom method: df.error 
# Confidence level used: 0.95

In GROUP_G1 the YEAR_B values (called B in this model) are more positively associated with the values in YEAR_A (called A in this model) than is the case for GROUP_G2. I think that illustrates the difference between your two groups in a more helpful way than your mod1. You still can report the 4 comparisons of original interest, but I don't think that alone does justice to your data.

Hi, @EdM , thank you very much. I still have some questions, would you mind answered them? The first is: Why shouldn't I calculate the effect sizes based on the emmeans output if concerning that the effect sizes that I need are for the derived pairwise differences of the model? Concerning the variables, — Larissa Cury, Feb 12 '23 at 11:28
Participants were testes twice (Year A and Year B) in two different tests each year (G1 and G2). Hence, each student has 4 scores: Score 1: Year A - Test G1 Score 2: Year A - Test G2 Score 3: Year B - Test G1 Year B - Test G2 (in which G1 and G2 represent different languages of testing, basically) and I need to know 4 things: A) Is G1 and G2 different in Year A? + Is G2 and G1 different in Year B? + Is G1 and G1 different between Year A and Year B? and finally Is G2 and G2 different between Year A and Year B? (basically, I an effect of group and of year) — Larissa Cury, Feb 12 '23 at 11:30
concerning the model suggestion, I really appreciate it, I was also kindly suggested that before (https://stats.stackexchange.com/questions/588224/lmer-violating-residuals-normality-assumption-what-should-i-do-when-enough-d) but the thing is: I think don't know how to explain this estimation rather than OLS, so I didn't go with that for my knowledge limitation at the moment — Larissa Cury, Feb 12 '23 at 11:35
back to the effect's estimation, when I did: eff1 <- eff_size(group, sigma = sigma, edf = 60) being it :sigma = sqrt((error SD)^2 + (subject SD)^2)) , basically my equation was (using group as ex): diff in group G1 and G2 means of Year A (or B, for the 2nd output)/ total sd, right? being it total sd = sum(model sd + subject sd) Question: but why do I have to take the squared root and ^2? since I didn't input the variance but the SD already? Why couldn't I just sum up the results of VarCorr(mod1)'s Std.Dev for both the model (residuals) and ID ? — Larissa Cury, Feb 12 '23 at 11:57
@LarissaCury You can use the emmeans effect sizes, but as comments note it's not clear how "appropriate" they are for mixed models. The Meltzer et al effect-size calculation is easy to explain and can be applied to all your comparisons. For the sigma value, the variance of a sum of independent variables is the sum of the variances; that doesn't work for standard deviations.That's why you square both the error and subject SDs, sum, then take the square root for the final estimate of sigma. — EdM, Feb 12 '23 at 15:57
hi, @EdM , thanks for the reply! My question is: If I'm reporting the pairwise comparisons between the estimated means, why isn't reporting the way Meltzer et al reported wrong? I mean, I've reported estimated means, but the effect-size is being calculated on raw means, not? — Larissa Cury, Feb 12 '23 at 17:35
@LarissaCury Meltzer et al. (and I) see "effect sizes" as being something that can (and should, in my opinion) be considered separately from formal statistical analysis (via emmeans in your case). They say: "In addition to statistical significance testing, we estimated effect sizes for all outcome measures using a simple traditional approach..." (Emphasis added.) There is no harm in that separation of "effect size" reporting from statistical testing; there is no inference being made on effect sizes. Your choice. I added more about my suggested model, which is just a type of least squares. — EdM, Feb 12 '23 at 17:57
thanks once more, @EdM ! could you go over as being something that can (and should, in my opinion) be considered separately from formal statistical analysis (via emmeans in your case) ? I still don't get it, but I really want to. How can we report the tests on estimated means but the ef on raw data? It's my first time reporting pairwise comparisons from emmeans (I was doind t-tests before, what later I realised didn't make sense) — Larissa Cury, Feb 12 '23 at 18:42
@LarissaCury the eff_size help says: "There is substantial disagreement among practitioners on what is the appropriate sigma to use in computing effect sizes; or, indeed, whether any effect-size measure is appropriate for some situations." If you are comfortable with your rationale for sigma (I think the value is wrong in your question; for VarCorr() on your model I get SDs of PARTICIPANTS of 0.68 and Residual of 1.72) then go ahead and use the eff_size() values (which include CI for effect sizes). The Meltzer method is "traditional" and makes no assumptions. — EdM, Feb 12 '23 at 20:51
hi, @EdM, thanks agan! So, I want to 'play it safe', so maybe I should go with Meltzer's. However, I couldn't grasp how to calculate the effects for my comparisions from your examples because I don't have post and pre test groups, indeed, people are the same (thus the lmer), like this: — Larissa Cury, Feb 13 '23 at 02:46
My tests are: Test 1: G1 YEAR A x G1 YEAR B + Test 2: G2 YEAR A x G2 YEAR B + Test 3: G1 YEAR A x G2 YEAR A + Test 4: G1 YEAR B x G2 YEAR B . I shouldn't have used 'groups', instead, they're types of texts being it G1 a language and G2 another language, all data come from the same participants, who were tested twice each year. In the big picture, I'm seeing an effect of Year (tests 1 and 2) and of my_group (tests 3 and 4) — Larissa Cury, Feb 13 '23 at 02:48
ps: funny thing, I just checked and I'm getting the same as you VarCorr(mod1):
0.67511 and 1.71697 funny thing, I don't know what happened! — Larissa Cury, Feb 13 '23 at 03:00
@LarissaCury you could discuss this with those who are asking for an "effect size." You have found two ways to report it: emmeans based on the lmer model, but with some necessary assumptions that might be disputed, and "traditional" calculations supported by the Meltzer et al. reference. Now that you have the correct SDs, the differences between the methods won't be large. Which do they prefer? If Meltzer, how would they apply the principles to your data? Or just go with emmeans if you are OK with the sigma choice. — EdM, Feb 13 '23 at 09:02

What's the appropriate effect size estimate (and power analysis) for post-hoc regression?

estimate eemmeans:

pairwise comparisons:

correctiong sigma and edf

calculate Cohen's d:

eff1 <- eff_size(emm1, sigma = sigma(mod1), edf = df.residual(mod1)) ### before:

rename column:

1 Answers1

Linked