0

I have a dataset like this:

enter image description here

All these social determinants are binary variable. How can I find the correlation among them? By chisq.test? Since I have 11 variables and it will be 55 pairs. Is there any convenient way to do so?

R code, function, logic

  • 2
    Correlation is defined for continuous variables, not binary variables. What is your research question? Food score does not look binary. – user2974951 Sep 20 '23 at 06:42
  • Maybe start with contingency tables? – Roland Sep 20 '23 at 06:54
  • Perhaps an interesting read for you "BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data" https://journal.r-project.org/archive/2017/RJ-2017-022/RJ-2017-022.pdf –  Sep 20 '23 at 07:44
  • 1
    What are you actually trying to do? Correlations isn't it, but I don't know what is. Explain it to us in substantive terms. – Peter Flom Sep 20 '23 at 11:26
  • You can feed your binary matrix with $k$ columns to cor and get a $k\times k$ correlation matrix out, MM <- matrix(runif(10*100)<0.3,ncol=10)+0; cor(MM). Of course correlations are defined for binary variables - they are just probably useless. So I agree with the other commenters that it might be best if you told us what you are actually trying to achieve. – Stephan Kolassa Sep 20 '23 at 11:30
  • You might look into correspondence analysis, search this site! – kjetil b halvorsen Sep 20 '23 at 12:51
  • @user2974951 While I don't recommend Pearson's correlation for binary variables b/c there are often better alternatives, correlation is defined for binary variables. Consider its construction through substituting the mean, variance, and covariance of an indicator function. – Galen Sep 20 '23 at 13:52
  • The covariance of an indicator function is an independence gap which has Frechet bounds. – Galen Sep 20 '23 at 13:54

2 Answers2

0

If your question can be understood has "how to apply a comparison function to every pair of columns in a data.frame" then I would suggest the following:

set.seed(123)
df <- replicate(5, sample(0:1, 10, T)) |>
  as.data.frame() |>
  setNames(c("skip.med", "stable.housing", "utility.bills", "legal.issues", "addiction"))

comps <- combn(colnames(df),2) |> as.data.frame() colnames(comps) <- sapply(comps, (x) paste(x[[1]],"-",x[[2]]))

lapply(comps,(x) { chisq.test(df[[x[[1]]]], df[[x[[2]]]]) })

As suggested in the comments to your question, chisq.test may not be the best option, but you can easily change the function used within the lapply call.

jkd
  • 384
0

I suggest going back to basics and using a measure that is tailored to binary responses. If two binary responses $A, B$ are independent then $\Pr(A=a, B=b) = \Pr(A=a)\times \Pr(B=b)$. You can use $\Pr(A=1,B=1) - \Pr(A=1)\times\Pr(B=1)$ as a measure of dependence of $A$ and $B$. This is estimated by computing the average product of the binary responses minus the product of the averages. This is like the numerator of a Pearson correlation coefficient.

This is implemented in the R Hmisc package varclus function - see similarity='bothpos' or 'ccbothpos, the latter being what I described above. You can print the similarity matrix and varclus uses it to cluster the variables.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397