Correlation among 10 binary variables

Question

I have a dataset like this:

All these social determinants are binary variable. How can I find the correlation among them? By chisq.test? Since I have 11 variables and it will be 55 pairs. Is there any convenient way to do so?

R code, function, logic

Correlation is defined for continuous variables, not binary variables. What is your research question? Food score does not look binary. — user2974951, Sep 20 '23 at 06:42
Perhaps an interesting read for you "BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data" https://journal.r-project.org/archive/2017/RJ-2017-022/RJ-2017-022.pdf — , Sep 20 '23 at 07:44
What are you actually trying to do? Correlations isn't it, but I don't know what is. Explain it to us in substantive terms. — Peter Flom, Sep 20 '23 at 11:26
You can feed your binary matrix with $k$ columns to cor and get a $k\times k$ correlation matrix out, MM <- matrix(runif(10*100)<0.3,ncol=10)+0; cor(MM). Of course correlations are defined for binary variables - they are just probably useless. So I agree with the other commenters that it might be best if you told us what you are actually trying to achieve. — Stephan Kolassa, Sep 20 '23 at 11:30
You might look into correspondence analysis, search this site! — kjetil b halvorsen, Sep 20 '23 at 12:51
@user2974951 While I don't recommend Pearson's correlation for binary variables b/c there are often better alternatives, correlation is defined for binary variables. Consider its construction through substituting the mean, variance, and covariance of an indicator function. — Galen, Sep 20 '23 at 13:52
The covariance of an indicator function is an independence gap which has Frechet bounds. — Galen, Sep 20 '23 at 13:54

score 0 · Answer 1 · answered Sep 20 '23 at 07:39

If your question can be understood has "how to apply a comparison function to every pair of columns in a data.frame" then I would suggest the following:

set.seed(123)
df <- replicate(5, sample(0:1, 10, T)) |>
  as.data.frame() |>
  setNames(c("skip.med", "stable.housing", "utility.bills", "legal.issues", "addiction"))
comps <- combn(colnames(df),2) |>
  as.data.frame()
colnames(comps) <- sapply(comps, (x) paste(x[[1]],"-",x[[2]]))
lapply(comps,(x) {
  chisq.test(df[[x[[1]]]], df[[x[[2]]]])
})

As suggested in the comments to your question, chisq.test may not be the best option, but you can easily change the function used within the lapply call.

score 0 · Answer 2 · answered Sep 20 '23 at 11:29

I suggest going back to basics and using a measure that is tailored to binary responses. If two binary responses $A, B$ are independent then $\Pr(A=a, B=b) = \Pr(A=a)\times \Pr(B=b)$. You can use $\Pr(A=1,B=1) - \Pr(A=1)\times\Pr(B=1)$ as a measure of dependence of $A$ and $B$. This is estimated by computing the average product of the binary responses minus the product of the averages. This is like the numerator of a Pearson correlation coefficient.

This is implemented in the R Hmisc package varclus function - see similarity='bothpos' or 'ccbothpos, the latter being what I described above. You can print the similarity matrix and varclus uses it to cluster the variables.

Correlation among 10 binary variables

2 Answers2