Differences between factor effects model and dummy coding model

Question

I have a set of data and see that when I create a factor effects model rather than a dummy coding model, I get different results. In my model, tbStatus and treatment each has two factor levels, with weight gain as a continuous output variable.

These are the results for the factor effects model:

Model 1:
lm(formula = weightGain ~ tbStatus + treatment + treatment *
tbStatus, data = tb_data, contrasts = list(tbStatus = contr.sum,
treatment = contr.sum))
Coefficients:
                      Estimate   Std. Error   t value   Pr(>|t|)
(Intercept)           1.375      0.335        4.11      0.00031
tbStatus1            -1.656      0.335       -4.95      3.2e-05
treatment1            0.156      0.335        0.47      0.644
tbStatus1:treatment1  0.812      0.335        2.43      0.022

These are the results for a dummy coding version of the same model:

Model 2:
lm(formula = weightGain ~ tbStatus + treatment + treatment *
tbStatus, data = tb_data)
Coefficients:
                      Estimate   Std. Error   t value   Pr(>|t|)
(Intercept)           0.687      0.669        1.03      0.313
tbStatus1             1.687      0.946        1.78      0.085
treatment1           -1.937      0.946       -2.05      0.050
tbStatus1:treatment1  3.250      1.338        2.43      0.022

In my first model, the treatment is clearly not significant, and in the second model, suddenly it is borderline significant at an alpha of 0.05. Why is this? Why did I get different results for these two models?

I've also noticed that the results of an ANOVA for each model are identical, but the conclusions made for these variables are different and I don't understand why.

dipetkov · Answer 1 · 2022-03-14T08:48:03.163

The data is the same, the formula is the same, the model fitting function is the same -> it's the same fitted model. But the intercept and the coefficients represent different quantities, even though they are given the same name in the summary.

In the first table, you use the "sum to zero contrast" for both treatment and status. This means that the intercept represents the average weight gain (of a person?) with average status and average treatment. Not sure what status (hopefully, not gender!) and treatment stand for, but it's likely "average status" and "average treatment" don't represent a meaningful reality.

In the second table, you use the default "treatment contrast". This means that the intercept represents the average weight gain of a person with status 0 given treatment 0. Here treatment1 is the expected weight gain of a person with status 0 given treatment 1. (The weight gain of a person with status 1 given treatment 1 is tbStatus1 + tbStatus1:treatment1.)

The difference between the sum contrast (contr.sum) and the treatment contrast (the default) is easier to illustrate with a single categorical predictor that takes one of two values.

n <- 1000
Make the groups imbalanced to illustrate that, with the sum contrast,
the intercept is the average of the treatment effects, not the sample average.
treatment <- c(rep("0", 3 * n / 10), rep("1", 7 * n / 10))
effect0 <- 0
effect1 <- 1
Make the error variance small so that the statistics are close to their expectations.
y <- ifelse(treatment == "0", effect0, effect1) + rnorm(n, sd = 0.3)
The sample average is different from the average treatment effect
mean(y)
#> [1] 0.6920208
With contr.sum coding, the intercept is the average of the treatment effects
(effect0 + effect1) / 2
#> [1] 0.5
and treatment1 is the difference between treatment "0" and the average
Yep, it's that confusing: contr.sum orders the levels from 1st to (k-1)st
and the default coefficient names are not helpful.
(effect0) - (effect0 + effect1) / 2
#> [1] -0.5
broom::tidy(lm(y ~ treatment, contrasts = list(treatment = contr.sum)))
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)    0.495    0.0103      48.0 1.95e-261
#> 2 treatment1    -0.492    0.0103     -47.6 4.80e-259
With contr.treatment coding, the intercept is the effect of treatment "0"
effect0
#> [1] 0
and treatment1 is the difference between treatment "1" and treatment "0"
Since there is a single binary predictor, treatment1 under contr.treatment
is twice treatment1 under contr.sum -> More likely to be statistically
different from 0.
(effect1 - effect0)
#> [1] 1
broom::tidy(lm(y ~ treatment))
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)  0.00391    0.0173     0.226 8.21e-  1
#> 2 treatment1   0.983      0.0207    47.6   4.80e-259

The question about sum vs treatment contrast coding has been asked many times, eg. treatment and sum contrasts, inconsistent results.

Nicely explained illustration of the underlying issue. Good idea to include the link to a similar question. (+1) — EdM, Mar 14 '22 at 13:40

Differences between factor effects model and dummy coding model

1 Answers1

Make the groups imbalanced to illustrate that, with the sum contrast,

the intercept is the average of the treatment effects, not the sample average.

Make the error variance small so that the statistics are close to their expectations.

The sample average is different from the average treatment effect

With contr.sum coding, the intercept is the average of the treatment effects

and `treatment1` is the difference between treatment "0" and the average

Yep, it's that confusing: contr.sum orders the levels from 1st to (k-1)st

and the default coefficient names are not helpful.

With contr.treatment coding, the intercept is the effect of treatment "0"

and `treatment1` is the difference between treatment "1" and treatment "0"

Since there is a single binary predictor, `treatment1` under contr.treatment

is twice `treatment1` under contr.sum -> More likely to be statistically

different from 0.

Linked

Related