I am dealing with the following problem:
I have to do a multivariable logistic in R, using the command glm with the argument family = binomial.
The dependent variable is (obv) a binary variable (worker) with a Yes/No outcome and the predictors are gender, student and type of event. gender and student are binary variables, while type of event is categorical with 5 possible outcomes.
When applying the logistic, in the outcome are shown only 4 levels of the variable type of event, using the other remaining level of that variable as a "reference" one. What I would like to have in the outcome is to maintain 1 level for the two binary variables (in my case genderMALE and studentYES), and to have the 5 levels for the categorical variable type of event, but I don't know how to do it.
I was wondering if there were any arguments of this function that could solve this.
I thought of using something as model.matrix but I am not sure this is the best solution.
- 77,844
- 33
3 Answers
To make Adria's comment crystal clear: R uses a so called ANOVA (sum) contrast when there's a polytomous factor variable with multiple levels in a linear model and the intercept is suppressed. This does NOT amount to removing the intercept, it just reframes the estimation of the model's effects. Again this is only in the case of factor variable adjustments. Here's a workable example
n<-1000
set.seed(123)
y <- rbinom(n, 1, 0.5)
x <- factor(sample(1:5, n, replace=T))
w <- rbinom(n, 1, 0.5)
f1<-glm(y ~ x + w, family=binomial)
f2<-glm(y ~ 0 + x + w, family=binomial)
provides the following outputs:
Default: treatment contrasts
> f1
Call: glm(formula = y ~ x + w, family = binomial)
Coefficients:
(Intercept) x2 x3 x4 x5 w
-0.17073 0.42541 0.28683 0.19687 0.06605 -0.10189
Degrees of Freedom: 999 Total (i.e. Null); 994 Residual
Null Deviance: 1386
Residual Deviance: 1380 AIC: 1392
Alternative: sum contrasts
> f2
Call: glm(formula = y ~ 0 + x + w, family = binomial)
Coefficients:
x1 x2 x3 x4 x5 w
-0.17073 0.25468 0.11610 0.02614 -0.10468 -0.10189
Degrees of Freedom: 1000 Total (i.e. Null); 994 Residual
Null Deviance: 1386
Residual Deviance: 1380 AIC: 1392
Note, the degrees of freedom of the model are conserved, and the predictions are equal:
> all.equal(predict(f1), predict(f2))
[1] TRUE
the x1=(Intercept) from the two approaches, and the x2 coefficient in the ANOVA contrast is equal to the x2 from the treatment contrast model plus the intercept. Similarly for all other levels of X. In other words, specifying 0+ only drops the intercept by name only. Literally!
- 62,637
This is better solved at the summary stage. There is really a hidden coefficient for the omitted level, but as it is fixed by zero, the usual summaries ignores it. There is a special R package that does include those, gtsummary.
For details and examples see What to do in a multinomial logistic regression when all levels of DV are of interest?
- 77,844
To flesh-out my suggestion in the comments. It's possible that this is the type of output you are looking for. (Using AdamO's example data).
n<-1000
set.seed(123)
y <- rbinom(n, 1, 0.5)
x <- factor(sample(1:5, n, replace=T))
w <- rbinom(n, 1, 0.5)
f1<-glm(y ~ x + w, family=binomial)
summary(f2)
library(car)
Anova(f1)
Analysis of Deviance Table (Type II tests)
Response: y
LR Chisq Df Pr(>Chisq)
x 5.7791 4 0.2163
w 0.6419 1 0.4230
library(emmeans)
marginal = emmeans(f1, ~ x, type = "response")
marginal
x prob SE df asymp.LCL asymp.UCL
1 0.445 0.0342 Inf 0.379 0.512
2 0.551 0.0359 Inf 0.480 0.620
3 0.516 0.0338 Inf 0.450 0.582
4 0.494 0.0367 Inf 0.423 0.565
5 0.461 0.0361 Inf 0.392 0.532
Results are averaged over the levels of: w
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
pairs(marginal)
contrast odds.ratio SE df null z.ratio p.value
x1 / x2 0.654 0.131 Inf 1 -2.120 0.2115
x1 / x3 0.751 0.145 Inf 1 -1.481 0.5750
x1 / x4 0.821 0.166 Inf 1 -0.976 0.8663
x1 / x5 0.936 0.188 Inf 1 -0.329 0.9975
.
.
< snip >
- 11,330
- 2
- 15
- 35
-
As a side note, to compare the two model's in @AdamO 's answer: The Anova table is different between f1 and f2 but the emmeans results are the same. Also,
emmeans::joint_tests()could be used for an anova-like table. These results are also the same for f1 and F2. – Sal Mangiafico Nov 18 '22 at 18:11 -
1Good point, yes, the ANOVA contrast table has as the null hypothesis for each coefficient that it is equal to 0, rather than a test of equality to the referent level. One would assume based on the nature of desired output that that is the test of interest - but I've been wrong about many things before, and this may well be yet another. – AdamO Nov 18 '22 at 18:21
worker ~ 0 + gender + student + type_of_event. The0in the formula tellsRthat you don't want an intercept. However, this would also give you the 2 levels for both gender and student. I don't know how to do exactly what you are looking for (or whether it is possible at all), but someone else might be able to help. – Adrià Luz Nov 18 '22 at 13:41library(car); Anova(model), wheremodelis the name of the model object you defined withglm(). From there, you can use the emmeans package to get estimates for each category in the independent variable, and comparisons among those categories. – Sal Mangiafico Nov 18 '22 at 17:43