2

Suppose I have a dataset with a continuous numerical variable $x$ and a binary numerical variable $y$ (with values 0 or 1).

How can I test the null hypothesis that the value of $x$ has no effect on the binary outcome $y$?

My idea was to use a two-sample t-test where one sample is the values of $x$ when $y=0$ and another sample is the values of $x$ when $y=1$. Would this be appropriate?

1 Answers1

2

Your approach would not be the best. It treats the predictor as the outcome, which is dubious scientifically speaking. After all, your question is about $E(y \vert x)$ not $E(x \vert y)$.

The best way would be to perform a logistic regression. I'm not going to get into the details here, there are lots of resources for learning about logistic regression. Here is a small example in R.

I've generated a continuous predictor and a binary outcome. In the plot below, I've binned the predictor and computed the average of the outcomes. As the predictor increases, we seem to get more outcomes where $y=1$.

enter image description here

We can perform a test of association by fitting a logistic regression. In R,

model=glm(y~x, data=my_data, family=binomial())
summary(model)
> summary(model)

Call: glm(formula = y ~ x, family = binomial(), data = my_data)

Deviance Residuals: Min 1Q Median 3Q Max
-1.0942 -0.8878 -0.8205 1.4394 1.8072

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.79111 0.06868 -11.518 < 2e-16 *** x 0.20272 0.06929 2.926 0.00344 **


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1244.5  on 999  degrees of freedom

Residual deviance: 1235.8 on 998 degrees of freedom AIC: 1239.8

Number of Fisher Scoring iterations: 4

Look at the row for x in the summary. It has an associated p value and an estimate for the effect. The effect is the log-odds ratio (again, something you can easily read up on). If this estimate is positive, then as the predictor increases then so too will the probability we see a positive outcome. If it is negative, the opposite is true.

  • This is a little late but I thought I'd comment instead of asking a new question. Just to be clear, is the p-value being less than 0.05 sufficient evidence for concluding that the variable x has a relationship with the variable y? Furthermore, does the same hold true for multiple logistic regression? i.e if you throw in a bunch of different variables and all of them have p-values below the significance level, could you conclude that they are all significant predictors for the dependent variable y? – Machetes0602 Mar 10 '21 at 19:46