Comparing three clusters with only binary variables

Question

I have three clusters of patients, and for each patient I have 189 binary variables which are diagnosis codes (0 indicates the patient doesn't have the diagnosis and 1 otherwise)

I want to perform comparisons between the three group and to examine if there are variables with significant difference scores between groups ( e.g if there is cluster/group with high proportion of specific diagnosis ). Which test should I use? if its chi square what is the expected value for each variable?

It sounds like you should be exploring and characterizing the clusters, not testing them. To determine whether what you are seeing is real, you can cross-validate your procedure or, more simply, use your held-out data to see which hypotheses hold up. — whuber, Jun 23 '23 at 15:40
Thank you I did some exploratory analysis on the cluster and indeed they differ ( one cluster is sick and the other are healthy but they differ in Demographics — Solomon123, Jun 24 '23 at 13:31

score 0 · Answer 1 · answered Jun 23 '23 at 12:59

0

For the general problem of comparing three proportions, see this thread

But, why do a test? It seems to me that if you want a cluster analysis, you don't want a test on the variables that went into the clustering. You want to look at the statistics that are output by the clustering procedure. The goal of a clustering procedure is to look at how the observations (here, patients) go together and, hopefully, to gain some insight, either on that specific question (e.g. "Huh! I knew that people who had A were likely to have B, but I didn't know they were likely to have C." Or on other variables (e.g. "Huh! People who have A, B, and C are really likely to be XXX" where XXX is something other than the conditions, like a demographic variable.)

My intuition also says that this is a variation on the "Texas sharpshooter fallacy", where the guy shoots at a barn door and then paints bullseyes around the bullet holes.

answered Jun 23 '23 at 12:59

Peter Flom

119,535
36
175
383

1

Thank you Peter, I did clustering using Kernel PCA ( so the clustering was not directly on the observed variables) and now I want to discover the characteristic of each cluster , I looked at 10 patients from each cluster and it seems to me that one cluster consists of healthy patients and the other clusters consists of sick patients. Now I want also to see which diagnosis go together with each other ( correlation between variables for each cluster which test is useful ?) Chi square test is not helpful here ? ( it test if there is correlation between cluster status and diagnosis status) – Solomon123 Jun 23 '23 at 13:24
Maybe choose a different clustering method that does use the observed variables? Most are designed to answer the exact question of which dx go together. – Peter Flom Jun 23 '23 at 13:30
I got the best clustering metrics using Kernel PCA . Two clusters are sick and I want to discover how do they differ ( they have different variables correlation ? which diagnosis are dominant on each cluster ) what test can answer this question ( the Professor here at the Hebrew university suggested chi square to see if cluster status depends on diagnosis status ) – Solomon123 Jun 23 '23 at 13:42

Comparing three clusters with only binary variables

1 Answers1