1

I was wondering how one would properly validate the reproducibility of a trained clustering model on a completely new set of data.

Imagine you are clustering a patient population in hospital A. You find three clusters, one cluster with mainly cancer patients, one cluster with mainly cardiovascular patients, and one cluster with all other types of diagnoses. Such a clustering model could be meaningful, as the cluster in which a patient is placed tells you something about their condition. However, if you were to apply this clustering model to the population of another hospital B, how can you know that these clusters have the same meaning? Maybe the first cluster, which contains cancer patients in hospital A, contains completely different patient types in hospital B.

So far I have been unable to find any literature on such a form of cluster validation, where you look at the same clustering model, but on new data. Like how you would externally validate a supervised prediction model an a new set of data.

If anyone knows of such literature, or has ideas on how this should be done, or why they think this is a ridiculous idea, please do share.

  • the first cluster, which contains cancer patients in hospital A, contains completely different patient types in hospital B How do you establish that this "1st" cluster here and there. You sound as if you've established the correspondence. – ttnphns Aug 24 '22 at 12:20
  • Let's assume we use K-Means clustering, your clusters have a unique centroid within the feature space. I'd look up the position of the centroid of cluster 1 within hospital A. Then, within hospital B, you'd check which centroid is closest the centroid 1 in hospital A. The cluster corresponding to that centroid should then correspond to cluster 1. Under the assumption that the clustering model identifies 'the same' clusters in both hospitals of course. Which is also the assumption we'd like to test. – The Jipsess Aug 25 '22 at 09:19
  • Please check point 5 here - just to start with basic ideas then search this site with two keywords query, clustering and cross-validation. You are likely fo find good advices. – ttnphns Aug 25 '22 at 10:30

0 Answers0