1

I have three clusters of patients, and for each patient I have 189 binary variables which are diagnosis codes (0 indicates the patient doesn't have the diagnosis and 1 otherwise)

I want to perform comparisons between the three group and to examine if there are variables with significant difference scores between groups ( e.g if there is cluster/group with high proportion of specific diagnosis ). Which test should I use? if its chi square what is the expected value for each variable?

  • 1
    It sounds like you should be exploring and characterizing the clusters, not testing them. To determine whether what you are seeing is real, you can cross-validate your procedure or, more simply, use your held-out data to see which hypotheses hold up. – whuber Jun 23 '23 at 15:40
  • 1
    Thank you I did some exploratory analysis on the cluster and indeed they differ ( one cluster is sick and the other are healthy but they differ in Demographics – Solomon123 Jun 24 '23 at 13:31

1 Answers1

0

For the general problem of comparing three proportions, see this thread

But, why do a test? It seems to me that if you want a cluster analysis, you don't want a test on the variables that went into the clustering. You want to look at the statistics that are output by the clustering procedure. The goal of a clustering procedure is to look at how the observations (here, patients) go together and, hopefully, to gain some insight, either on that specific question (e.g. "Huh! I knew that people who had A were likely to have B, but I didn't know they were likely to have C." Or on other variables (e.g. "Huh! People who have A, B, and C are really likely to be XXX" where XXX is something other than the conditions, like a demographic variable.)

My intuition also says that this is a variation on the "Texas sharpshooter fallacy", where the guy shoots at a barn door and then paints bullseyes around the bullet holes.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • 1
    Thank you Peter, I did clustering using Kernel PCA ( so the clustering was not directly on the observed variables) and now I want to discover the characteristic of each cluster , I looked at 10 patients from each cluster and it seems to me that one cluster consists of healthy patients and the other clusters consists of sick patients. Now I want also to see which diagnosis go together with each other ( correlation between variables for each cluster which test is useful ?) Chi square test is not helpful here ? ( it test if there is correlation between cluster status and diagnosis status) – Solomon123 Jun 23 '23 at 13:24
  • Maybe choose a different clustering method that does use the observed variables? Most are designed to answer the exact question of which dx go together. – Peter Flom Jun 23 '23 at 13:30
  • I got the best clustering metrics using Kernel PCA . Two clusters are sick and I want to discover how do they differ ( they have different variables correlation ? which diagnosis are dominant on each cluster ) what test can answer this question ( the Professor here at the Hebrew university suggested chi square to see if cluster status depends on diagnosis status ) – Solomon123 Jun 23 '23 at 13:42