Logistic regression coefficients compared to the mean in R

Question

No doubt this is a stupid question but I can't seem to find help anywhere online.

I want to do a logistic regression with 2 independent variables.

Ideally, I would like to see how each variable compares to the mean but for some reason, in R, I'm not able to do this. Whatever I try, I always get comparison with at least one of the explanatory variables. See the example and output below:

set.seed(123)
group1 <- rep(c("A", "B", "C"), times = 100)
group2 <- rep(c("D", "E", "F"), each = 100)
dat <- rbinom(300, 1, 0.5)
model <- glm(dat ~ group1 + group2 + 0, family = binomial)
summary(model)
#Coefficients:
Estimate Std. Error z value Pr(>|z|)
#group1A  0.14471    0.25837   0.560    0.575
#group1B -0.17756    0.25991  -0.683    0.495
#group1C -0.33858    0.26139  -1.295    0.195
#group2E  0.12461    0.28455   0.438    0.661
#group2F  0.04537    0.28466   0.159    0.873

If I did something similar with a linear regression, I would get an intercept which would be the mean of all the data, and then the other variables would show how that differs from the mean. However here, I am pretty sure that all the variables are showing me how they differ from the mean at group D.

Surely it's not a matter of having to estimate too many variables. There are 300 data points. All I would like to know is the overall mean (log odds ratio) and how groups A-F differ from that mean. What am I missing?

EDIT: I went with a workaround that is a bit subpar. In a loop, I construct an indicator variable that is 1 if it's the group (i.e. group D) that I'm interested in and 0 otherwise. Then I loop through the groups (about 100 for the problem I have) constructing a model each time with the indicator variable as an explanatory variable as well as the other variables I wanted to account for. It takes a little bit of time to load but it's not completely infeasible.

Do let me know if there is another workaround where I could just put the raw data in to begin with and have everyone compared to the overall mean. A worked example would be nice.

Look into "deviation" contrast coding, accomplished in R using constr.sum. You can also compute contrasts post-hoc to make the comparisons you want rather than parameterize the model so that coefficients correspond to the quantities of interest. — Noah, Jan 04 '23 at 19:14
Your example is slightly unfortunate in that the coefficients for A and B are almost exactly equal and the coefficient for E is almost 0 so almost equal to the invisible coefficient D. A different seed could avoid that and allow you to concentrate on your actual question — Henry, Jan 04 '23 at 21:59

Sextus Empiricus · Answer 1 · 2023-01-05T13:37:50.137

Surely it's not a matter of having to estimate too many variables.

It is this matter. If you include 6 dummy variables for the groups then you get a regressor matrix that is not full rank.

You could increase all coefficients for group 1 and decrease all coefficients for group 2 by the same amount leaving the model the same.

So no unique solution will be possible that maximized the likelihood.

Ideally, I would like to see how each variable compares to the mean

Compute the mean and compare with the coefficients/variables.

If I did something similar with a linear regression, I would get an intercept which would be the mean of all the data

If by linear regression you mean ordinary least squares regression then it is just the same. That you do not get an intercept in your example is not because of the logistic regression, but because you use the formula dat ~ group1 + group2 + 0 instead of dat ~ group1 + group2. The zero makes that you do not include an intercept.

The intercept is not the mean of the data.

score 0 · Answer 2 · answered Jan 05 '23 at 15:41

Answer to the question in hand

All I would like to know is the overall mean (log odds ratio) and how groups A-F differ from that mean.

So a couple of clarifications right away - firstly, you probably want to be comparing the group-level means to each other, not to the global mean. Certainly calculate it as a reference, but the group means contribute to the global mean, so aren't independent from it, so standard statistical tests wouldn't apply to that comparison.

Secondly, every data point you've collected belongs to 2 groups: group1 is A, B or C; while group2 is D, E, or F. So you can compare A vs. B vs. C, or D vs. E vs. F, or you can compare all the unique pairs A&D vs. B&D vs. C&D vs. A&E... - but you can't simply compare your 6 groups to each other in one go.

What I think you want to do is compare the proportions across groups, one variable at a time. The most direct way to answer that question is not logistic regression but a chi-squared test for proportions. The implementation in R is maybe less intuitive than the regression model, but you'll find plenty of examples, including here on Cross Validated.

Here's one way to perform such a test on your data:

dataset <- data.frame(
  outcome = dat,
  group1 = group1,
  group2 = group2
)
prop.test(t(matrix(table(dataset$outcome, dataset$group1), ncol = 3)))
prop.test(t(matrix(table(dataset$outcome, dataset$group2), ncol = 3)))

It shows the proportions are all very similar and none significantly different from any of the others on the A,B,C then the D,E,F dimension.

Clarifications on regression

Surely it's not a matter of having to estimate too many variables.

As @Sextus Empiricus says in their answer, this is the issue here, and the problem is your categorical predictor variables (group1 and group2) not the binary outcome variable. Basically, whenever you use categorical predictors, one of your categories needs to serve as the reference - if you supplied all the categories, your model would be over-specified (because I know if an observation isn't B and isn't C, then it must be A - you don't provide any more information by supplying A explicitly).

In your case, the reference categories are A for group1 and D for group2. Your use of + 0 in the formula is basically cosmetic in this case, simply relabelling the intercept group1A (it should really say ANDgroup2D).

The way to interpret this intercept is that it is the log odds of getting 1 for an observation belonging to those reference categories. You can add the other terms to predict the log odds for another observation, so say you have group1 = B, and group2 = D, you would add the group1B coefficient to your intercept to get the log odds for that observation.

Logistic regression coefficients compared to the mean in R

Estimate Std. Error z value Pr(>|z|)

2 Answers2