Multivariable Logistic in R, without the "reference" level in a categorical predictor

Question

I am dealing with the following problem:
I have to do a multivariable logistic in R, using the command glm with the argument family = binomial. The dependent variable is (obv) a binary variable (worker) with a Yes/No outcome and the predictors are gender, student and type of event. gender and student are binary variables, while type of event is categorical with 5 possible outcomes.
When applying the logistic, in the outcome are shown only 4 levels of the variable type of event, using the other remaining level of that variable as a "reference" one. What I would like to have in the outcome is to maintain 1 level for the two binary variables (in my case genderMALE and studentYES), and to have the 5 levels for the categorical variable type of event, but I don't know how to do it.
I was wondering if there were any arguments of this function that could solve this.
I thought of using something as model.matrix but I am not sure this is the best solution.

Perhaps you have already tried this, but you could have the 5 levels for type of event by doing worker ~ 0 + gender + student + type_of_event. The 0 in the formula tells R that you don't want an intercept. However, this would also give you the 2 levels for both gender and student. I don't know how to do exactly what you are looking for (or whether it is possible at all), but someone else might be able to help. — Adrià Luz, Nov 18 '22 at 13:41
@AdriàLuz, many thanks for your comment. In any case, I am not sure to leave out the intercept is good... :( — autu_mn, Nov 18 '22 at 13:58
@AdriàLuz whether it is possible or not, I am pretty sure that is possible to do it on STATA, but don't know in R — autu_mn, Nov 18 '22 at 14:16
@autu_mn you are misreading how R's formula works in this case. The contrast is fundamentally changed when you drop the intercept and include a factor in the model. Instead of estimating a log odds ratio comparing other groups to referent, you get the log odds of response for each category. Convert the binary to continuous to keep the desired interpretation. — AdamO, Nov 18 '22 at 17:25
If I understand the question, this is a programming question, and a common one from R users. I think what you are looking for is an anova-like table and estimated marginal means. You can use e.g. library(car); Anova(model), where model is the name of the model object you defined with glm(). From there, you can use the emmeans package to get estimates for each category in the independent variable, and comparisons among those categories. — Sal Mangiafico, Nov 18 '22 at 17:43

AdamO · Answer 1 · 2022-11-18T17:58:15.643

To make Adria's comment crystal clear: R uses a so called ANOVA (sum) contrast when there's a polytomous factor variable with multiple levels in a linear model and the intercept is suppressed. This does NOT amount to removing the intercept, it just reframes the estimation of the model's effects. Again this is only in the case of factor variable adjustments. Here's a workable example

n<-1000
set.seed(123)
y <- rbinom(n, 1, 0.5)
x <- factor(sample(1:5, n, replace=T))
w <- rbinom(n, 1, 0.5)
f1<-glm(y ~ x + w, family=binomial)
f2<-glm(y ~ 0 + x + w, family=binomial)

provides the following outputs:

Default: treatment contrasts

> f1
Call:  glm(formula = y ~ x + w, family = binomial)
Coefficients:
(Intercept)           x2           x3           x4           x5            w

   -0.17073      0.42541      0.28683      0.19687      0.06605     -0.10189
Degrees of Freedom: 999 Total (i.e. Null);  994 Residual
Null Deviance:      1386 
Residual Deviance: 1380     AIC: 1392

Alternative: sum contrasts

> f2
Call:  glm(formula = y ~ 0 + x + w, family = binomial)
Coefficients:
      x1        x2        x3        x4        x5         w

-0.17073   0.25468   0.11610   0.02614  -0.10468  -0.10189
Degrees of Freedom: 1000 Total (i.e. Null);  994 Residual
Null Deviance:      1386 
Residual Deviance: 1380     AIC: 1392

Note, the degrees of freedom of the model are conserved, and the predictions are equal:

> all.equal(predict(f1), predict(f2))
[1] TRUE

the x1=(Intercept) from the two approaches, and the x2 coefficient in the ANOVA contrast is equal to the x2 from the treatment contrast model plus the intercept. Similarly for all other levels of X. In other words, specifying 0+ only drops the intercept by name only. Literally!

score 4 · Answer 2 · answered Nov 18 '22 at 18:11

This is better solved at the summary stage. There is really a hidden coefficient for the omitted level, but as it is fixed by zero, the usual summaries ignores it. There is a special R package that does include those, gtsummary.

For details and examples see What to do in a multinomial logistic regression when all levels of DV are of interest?

score 3 · Answer 3 · answered Nov 18 '22 at 18:01

To flesh-out my suggestion in the comments. It's possible that this is the type of output you are looking for. (Using AdamO's example data).

n<-1000
set.seed(123)
y <- rbinom(n, 1, 0.5)
x <- factor(sample(1:5, n, replace=T))
w <- rbinom(n, 1, 0.5)
f1<-glm(y ~ x + w, family=binomial)
summary(f2)
library(car)
Anova(f1)
Analysis of Deviance Table (Type II tests)

Response: y
LR Chisq Df Pr(>Chisq)
x   5.7791  4     0.2163
w   0.6419  1     0.4230
library(emmeans)
marginal = emmeans(f1, ~ x, type = "response")
marginal
x  prob     SE  df asymp.LCL asymp.UCL
1 0.445 0.0342 Inf     0.379     0.512
2 0.551 0.0359 Inf     0.480     0.620
3 0.516 0.0338 Inf     0.450     0.582
4 0.494 0.0367 Inf     0.423     0.565
5 0.461 0.0361 Inf     0.392     0.532

Results are averaged over the levels of: w
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
pairs(marginal)

contrast odds.ratio    SE  df null z.ratio p.value
x1 / x2       0.654 0.131 Inf    1  -2.120  0.2115
x1 / x3       0.751 0.145 Inf    1  -1.481  0.5750
x1 / x4       0.821 0.166 Inf    1  -0.976  0.8663
x1 / x5       0.936 0.188 Inf    1  -0.329  0.9975
.
.
< snip >

As a side note, to compare the two model's in @AdamO 's answer: The Anova table is different between f1 and f2 but the emmeans results are the same. Also, emmeans::joint_tests() could be used for an anova-like table. These results are also the same for f1 and F2. — Sal Mangiafico, Nov 18 '22 at 18:11
Good point, yes, the ANOVA contrast table has as the null hypothesis for each coefficient that it is equal to 0, rather than a test of equality to the referent level. One would assume based on the nature of desired output that that is the test of interest - but I've been wrong about many things before, and this may well be yet another. — AdamO, Nov 18 '22 at 18:21

Multivariable Logistic in R, without the "reference" level in a categorical predictor

3 Answers3

Default: treatment contrasts

Alternative: sum contrasts

Analysis of Deviance Table (Type II tests)

Response: y

LR Chisq Df Pr(>Chisq)

x 5.7791 4 0.2163

w 0.6419 1 0.4230

x prob SE df asymp.LCL asymp.UCL

1 0.445 0.0342 Inf 0.379 0.512

2 0.551 0.0359 Inf 0.480 0.620

3 0.516 0.0338 Inf 0.450 0.582

4 0.494 0.0367 Inf 0.423 0.565

5 0.461 0.0361 Inf 0.392 0.532

Results are averaged over the levels of: w

Confidence level used: 0.95

Intervals are back-transformed from the logit scale

contrast odds.ratio SE df null z.ratio p.value

x1 / x2 0.654 0.131 Inf 1 -2.120 0.2115

x1 / x3 0.751 0.145 Inf 1 -1.481 0.5750

x1 / x4 0.821 0.166 Inf 1 -0.976 0.8663

x1 / x5 0.936 0.188 Inf 1 -0.329 0.9975

.

.

< snip >