I was wondering how one would properly validate the reproducibility of a trained clustering model on a completely new set of data.
Imagine you are clustering a patient population in hospital A. You find three clusters, one cluster with mainly cancer patients, one cluster with mainly cardiovascular patients, and one cluster with all other types of diagnoses. Such a clustering model could be meaningful, as the cluster in which a patient is placed tells you something about their condition. However, if you were to apply this clustering model to the population of another hospital B, how can you know that these clusters have the same meaning? Maybe the first cluster, which contains cancer patients in hospital A, contains completely different patient types in hospital B.
So far I have been unable to find any literature on such a form of cluster validation, where you look at the same clustering model, but on new data. Like how you would externally validate a supervised prediction model an a new set of data.
If anyone knows of such literature, or has ideas on how this should be done, or why they think this is a ridiculous idea, please do share.
the first cluster, which contains cancer patients in hospital A, contains completely different patient types in hospital BHow do you establish that this "1st" cluster here and there. You sound as if you've established the correspondence. – ttnphns Aug 24 '22 at 12:20