What analysis to choose when I have one continuous and one nominal predictor and my dependent variable is binary?

Question

So in my data, I have a continuous variable (say from 0.1 to 1) and a nominal variable indicating the condition (3 conditions, no drug, drug and baseline) and a binary dependent variable (0 or 1).

So I know that if I want to know how my continuous variable affects my binary variable I would do a logistic regression. But what if I now also want to know how this effect is influenced by the administration of the drug (the condition)?

Dave · Answer 1 · 2022-08-06T18:45:48.190

If you had a continnuous outcome, the obvious first place to look would be to analysis of covariance (ANCOVA). Assume three groups $g$ (like you have), one of which is subsumed by the intercept (as usual), as well as a continuous covariate $x$.

$$ \mathbb E[Y\vert X] = \beta_0 + \beta_1 g_1 + \beta_2 g_2 + \beta_3 x $$

This just adds the covariate to the usual ANOVA model: $ \mathbb E[Y\vert X] = \beta_0 + \beta_1 g_1 + \beta_2 g_2 $.

A drawback of the ANCOVA model is that it forces the slope to be the same at each level. A remedy is to allow the covariate $x$ to interact with the group variables.

$$ \mathbb E[Y\vert X] = \beta_0 + \beta_1 g_1 + \beta_2 g_2 + \beta_3 x + \beta_4 g_1 x + \beta_5g_2x $$

You, however, would be inclined to use some kind of logistic regression. The beauty of generalized linear models like logistic regression is that, at least in some sense, all you have to do is hit the left side with the link function, and then leave the right side as it is.

$$ logit(\mathbb E[Y\vert X]) = \beta_0 + \beta_1 g_1 + \beta_2 g_2 + \beta_3 x\\\text{or...}\\ logit(\mathbb E[Y\vert X]) = \beta_0 + \beta_1 g_1 + \beta_2 g_2 + \beta_3 x + \beta_4 g_1 x + \beta_5g_2x $$

We then can do the usual tasks of inference on the regression parameters and of nested models, keeping in mind that we now have a binomial GLM instead of Gaussian.

Depending on how you want to approach the problem (forcing equal slopes or allowing them to differ), one of those two would be a reasonable approach.

The R simulation below, influenced by another Cross Validated post, shows an example that uses the interaction. By the way I designed the simulation, only the covariate x affects the binary outcome y, but one could tweak the simulation to have the groups and interactions influence y.

set.seed(2022)
N <- 75
x <- rnorm(N)
g <- as.factor(sample(c("drug", "no drug", "baseline"), N, replace = T))
z <- x
y <- rbinom(N, 1, 1/(1 + exp(-z)))
L <- glm(y ~ x*g, family = binomial)
summary(L)

If you want to test the null hypothesis that the group variable has no impact on y, it makes sense to do a "chunk test" of nested models: one being the full model above that has nested within it a model that only considers x. For technical reasons beyond the scope of this post (and likely addressed elsewhere on Cross Validated), I like to do this with a likelihood ratio test, available in the lmtest package as the function lrtest.

library(lmtest)
set.seed(2022)
N <- 75
x <- rnorm(N)
g <- as.factor(sample(c("drug", "no drug", "baseline"), N, replace = T))
z <- x
y <- rbinom(N, 1, 1/(1 + exp(-z)))
L_full <- glm(y ~ x*g, family = binomial)
L_nested <- glm(y ~ x, family = binomial)
lmtest::lrtest(L_nested, L_full) # p-value is 0.7476, 
                                 # consistent with the lack of a group effect

This is also available in base R without the lmtest package, but you have to work a bit harder to get the p-value.

test <- anova(L_nested, L_full)
1 - pchisq(test$Deviance[2], test$Df[2]) # same p-value of 0.7476142

What analysis to choose when I have one continuous and one nominal predictor and my dependent variable is binary?

1 Answers1