Mixed dichotomous correlation matrix

Question

I have a data frame comprising more than two dozen variables, all of which are binary (0/1) with <5% missing data. These variables can be classified into groups that pertain to different aspects of health. For one group, disease for example, 0/1 represents the answer to a yes/no question: do you have a given disease? Another group, is based on ordinal questions (ie. how many days of the week to you perform a given activity: 1,2,3,4 or 5+?), which are transformed into binary variables (ie. 0=answer 1-3; 1=answer 4-5+). A third group are based on physical measures, and are transformed into binary variables based on established cutpoints, or an arbitrary one (ie. within 1st quartile or not).

I would like to perform some exploratory analyses (ie. partial correlation analysis, factor analysis, etc.) to look at the relationships among these variables. My understanding is that a phi correlation would be more appropriate for the first group type described, while a tetrachoric correlation would be more appropriate for the latter two. For generating a correlation matrix on all of my variables, is one more appropriate to use over the other, or should I be considering a different approach. Preliminary partial correlation networks using a phi correlation matrix look much more expected (disease groups cluster together, biological similar variables are connected) as compared to what results from a tetrachoric correlation matrix (more of a hairball in which seemingly everything is connected).

One immediate comment might arrive: why are you speaking of, say, tetrachoric correlation, which is a crutch to "revive" former unbinned variables when you, of a sort, have those unbinned variables at hand? — ttnphns, Sep 10 '17 at 13:36
Thank you for your comment, but would you be able to clarify what you mean? I think that major issue here may be a fundamental misunderstanding on my part of what a tetrachoric correlation is used for. — AtMac, Sep 11 '17 at 13:51
A third group are based on physical measures, and are transformed into binary variables If you have those original unbinned scale variables why wouldn't you just compute Pearson correlation for that group of variables? Without binning them. Tetrachoric correlation is the (inferred) Pearson correlation, after all. https://stats.stackexchange.com/a/186026/3277 — ttnphns, Sep 11 '17 at 13:59
I see the confusion, this is totally my fault. My question does not match my description. I guess really what I need to know is why use phi over tetrachoric for a matrix of all binary variables. I had yet to find an explanation that I could really understand until I read what you posted above. That really cleared things up for me, thank you very much! — AtMac, Sep 12 '17 at 15:01

Mixed dichotomous correlation matrix

0 Answers0