What is the best way to test for a relationship between a binary variable and a continuous variable?

Question

Suppose I have a dataset with a continuous numerical variable $x$ and a binary numerical variable $y$ (with values 0 or 1).

How can I test the null hypothesis that the value of $x$ has no effect on the binary outcome $y$?

My idea was to use a two-sample t-test where one sample is the values of $x$ when $y=0$ and another sample is the values of $x$ when $y=1$. Would this be appropriate?

Demetri Pananos · Accepted Answer · 2021-03-09T12:33:07.297

Your approach would not be the best. It treats the predictor as the outcome, which is dubious scientifically speaking. After all, your question is about $E(y \vert x)$ not $E(x \vert y)$.

The best way would be to perform a logistic regression. I'm not going to get into the details here, there are lots of resources for learning about logistic regression. Here is a small example in R.

I've generated a continuous predictor and a binary outcome. In the plot below, I've binned the predictor and computed the average of the outcomes. As the predictor increases, we seem to get more outcomes where $y=1$.

We can perform a test of association by fitting a logistic regression. In R,

model=glm(y~x, data=my_data, family=binomial())
summary(model)
> summary(model)
Call:
glm(formula = y ~ x, family = binomial(), data = my_data)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max

-1.0942  -0.8878  -0.8205   1.4394   1.8072
Coefficients:
            Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.79111    0.06868 -11.518  < 2e-16 ***
x            0.20272    0.06929   2.926  0.00344 **

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1244.5  on 999  degrees of freedom

Residual deviance: 1235.8  on 998  degrees of freedom
AIC: 1239.8
Number of Fisher Scoring iterations: 4

Look at the row for x in the summary. It has an associated p value and an estimate for the effect. The effect is the log-odds ratio (again, something you can easily read up on). If this estimate is positive, then as the predictor increases then so too will the probability we see a positive outcome. If it is negative, the opposite is true.

This is a little late but I thought I'd comment instead of asking a new question. Just to be clear, is the p-value being less than 0.05 sufficient evidence for concluding that the variable x has a relationship with the variable y? Furthermore, does the same hold true for multiple logistic regression? i.e if you throw in a bunch of different variables and all of them have p-values below the significance level, could you conclude that they are all significant predictors for the dependent variable y? — Machetes0602, Mar 10 '21 at 19:46

What is the best way to test for a relationship between a binary variable and a continuous variable?

1 Answers1

Linked