0

I am doing a logistic regression, with few adjustment variables

When i add the interaction term between a numeric variable (age) and a binary one, the estimates and the OR are high.

I know that adding an interaction between two categorical variables, this can happen if one of the categories is not represented for example. But I don't get how this can happen with a numeric variable, what should i check to see what can cause the high estimates by adding this interaction ?

When I say high OR, in my example it goes to : OR (CI) = 2353592.821 [6.017; 2.16e+12]

I've seen this post, but it is based on an interaction between two categorical variables:

Adding interactions to logistic regression leads to high SEs

As suggested in the first answer, it could be due to a quasi-complete separation of the data.

# data used
d = structure(list(y = c(0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 
                         1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 
                         1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 
                         1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 
                         0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 
                         0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 
                         0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0), 
                   x1 = structure(c(75, 76, 77, 78, 79, 81, 82, 83, 81, 82, 80, 81, 98, 80, 83, 80, 84, 86, 
                                    82, 88, 86, 87, 87, 90, 90, 91, 85, 89, 89, 94, 91, 95, 85, 96, 
                                    88, 94, 91, 76, 86, 88, 88, 85, 83, 87, 85, 75, 79, 88, 88, 92, 
                                    77, 89, 86, 87, 87, 80, 88, 89, 81, 82, 82, 82, 82, 86, 94, 88, 
                                    77, 84, 83, 96, 83, 86, 94, 90, 79, 89, 80, 95, 79, 84, 88, 82, 
                                    92, 76, 89, 83, 83, 82, 87, 94, 83, 87, 75, 79, 78, 93, 81, 96, 
                                    87, 92, 76, 95, 82, 77, 85, 88, 76, 88, 77, 100, 84, 98, 86, 
                                    78, 95, 84, 82, 81, 86, 86, 84, 85, 82, 88, 87, 92, 82, 88, 95, 
                                    87, 85)), 
                   group = c("A", "A", "B", 
                             "B", "A", "B", "A", "A", "B", "A", "B", "B", "B", "B", "B", "B", 
                             "B", "B", "B", "B", "B", "B", "B", "A", "A", "B", "A", "B", "B", 
                             "A", "A", "A", "A", "A", "B", "B", "A", "A", "A", "B", "A", "A", 
                             "A", "A", "A", "A", "A", "A", "B", "B", "B", "A", "B", "A", "A", 
                             "B", "B", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", 
                             "A", "A", "B", "B", "B", "A", "B", "B", "A", "B", "B", "B", "B", 
                             "B", "A", "B", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
                             "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "A", "B", "B", 
                             "A", "A", "A", "B", "B", "B", "A", "B", "A", "A", "B", "B", "B", 
                             "B", "A", "B", "A", "A", "B", "B", "A", "B", "B", "A")), 
              row.names = c(NA, -131L), 
              class = c("tbl_df", "tbl", "data.frame"))

plots

plot(y ~ x1, subset = group == "A", data=d) plot(y ~ x1, subset = group == "B", data=d)

logistic regression

individual work fine

uni1 <- glm(y ~ x1, d, family=binomial) uni2 <- glm(y ~ group, d, family=binomial)

round(exp(cbind(OR=coef(uni1), confint(uni1))), 3) round(exp(cbind(OR=coef(uni2), confint(uni2))), 3)

adding the interaction

m2 <- glm(y ~ x1*group, d, family=binomial) summary(m2) round(exp(cbind(OR=coef(m2), confint(m2))), 3)

results of M2

                OR 2.5 %       97.5 %

(Intercept) 0.000 0.000 7.450000e-01 x1 1.097 0.986 1.231000e+00 groupB 199823.765 1.148 7.098758e+10 x1:groupB 0.885 0.764 1.017000e+00

As you can see if you run this code, the OR for the m2 with interaction is incredibly huge, but on the plots we can see no separation.. That is why i am lost to explain it

enter image description here

enter image description here

BPeif
  • 123

2 Answers2

2

It sounds like you are suffering from quasi-complete separation. While it's more straightforward when dealing with factors, a numeric predictor that just goes from 0% to 100% at some point also wants an infinite slope in a logistic regression. Presumably this (almost ?) is the case for age in one of your two categories.

For a more in depth explanation and some advice for mitigating the problem see here: https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-complete-or-quasi-complete-separation-in-logistic-regression-and-what-are-some-strategies-to-deal-with-the-issue/

I have also amended their example to fit your problem:

y  <- c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)
x1 <- c(11, 1, 10, 2, 6, 3, 5, 3, 1, 2, 3, 3, 5, 6, 10, 11)
group <- c(rep("A", 8), rep("B", 8))
# individual work fine
summary(glm(y ~ x1, family=binomial))
summary(glm(y ~ group, family=binomial))
# interaction fails
m2 <- glm(y ~ x1*group, family=binomial)
summary(m2)

plot(y ~ x1, subset = group == "A") plot(y ~ x1, subset = group == "B")

Lukas Lohse
  • 2,482
  • Thanks for your example! I looked more my data to see if it was like this but, while plotting i have no separation at all, i can edit my post with both graph as you suggest in your answer – BPeif Jun 07 '23 at 12:16
2

My guess is that your problem is that you didn't mean-center age.

When you interact two variables this changes the interpretation of the two main effect coefficients - each coefficient now tells you the "effect" of that variable when the other (the variable it's interacted with) is ZERO ... regardless of whether zero is a meaningful value.

So in your case if age is simply "how many years the person has been alive" then the group coefficient tells you the estimated effect of being in that group for someone who is zero years old ... but in your dataset the youngest person is like 70. So your model gives you a totally bizarre answer.

The solution is to mean-center continuous variables if you want to use them with an interaction term. That way "zero" is a meaningful value (the mean). The coefficient of the interaction term itself won't change, but the main effect terms will actually be interpretable.

  • 1
    Thank you for your answer! I will keep this in mind for my next regressions! And sorry i cant upvote your answer i dont have enough reputation – BPeif Jun 07 '23 at 13:06