Testing generalization viability of clustering method

Question

Suppose you have a training set and a testing set. You learn a clustering model (e.g., k-means) from the training set, and then you take the observations from your testing set and put them in the appropriate cluster.

If the clustering you have learned is meaningful for the population that generated the training and testing sets, you would expect the proportion of observations allocated to each cluster to be fairly similar between the sets. For example, suppose that, for the training set, cluster 1 has 20% of the data, cluster 2 has 30% of the data, and cluster 3 has 50% of the data. If, for the testing set, cluster 1 had 80% of the data and cluster 2 and 3 each had 10%, that would suggest that the clustering was not meaningful.

Is there a statistical test or measure one can use to evaluate whether a clustering is meaningful in the sense described above?

I found a similar question that addresses the issue of ascertaining the quality of a clustering using a test set. But it does not consider the issue of cluster proportions.

I am not sure that I understand your post. Can you please explain what testing generalization viability of the clustering method means? — Michael R. Chernick, Jan 16 '17 at 23:04
Let clusters 1, ... , c each contain a proportion p_i of the training data. Let clusters 1, ..., c each contain a proportion phat_i of the testing data.
If the learned cluster structure for the population is meaningful, we would expect p_i to be close to phat_i, for any cluster i. Is there a statistical test or measure that measures how similar the set of cluster proportions between a training set and a test set are? — ostrichgroomer, Jan 16 '17 at 23:51
In http://stats.stackexchange.com/q/195456/3277 thread I've outlined approaches to validate a cluster solution. See pt. 5. Go please as well to read links to other threads I've given elsewhere in that thread. In short, to check generalizability you have to cluster the test set as well, in addition to the assignments done in it. Checking only proportions, as you did, is not a proper way. — ttnphns, Jan 17 '17 at 06:33
@ttnphns Why is checking proportions not a proper way of checking clustering validity? Do you think it fails to yield useful information, or do you think the information it yields is just redundant given what you can learn from the methods you outline? — ostrichgroomer, Jan 17 '17 at 15:40
ostrichgroomer, test set can be somewhat frequency-disproportioned sample than the training one, still having the same valid clusters (e.g. having same cluster centroids). Your proportins check will fail to show that the clusters are valid in such a case. And on the other hand, your check might occasionally say "valid" in cases when the test set has not as clear clusters as the training one, and maybe even no clusters at all. — ttnphns, Jan 17 '17 at 15:51

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

All k-means does is partition the space. If you are getting 20% of your data are in one region in the training set, and 80% of the test data are in that region, my first guess is that the data are very differently distributed in the two sets. Trying again with a different partition might yield better results. It is possible that you don't have enough data to do what you are trying to do, or that you don't have enough data relative to the dimensionality of your space. It is also possible that the partition k-means is trying to fit is utterly bogus: k-means more imposes a structure on your data than it fits a structure in your data (consider the examples here: How to understand the drawbacks of K-means, it 'finds' clusters when there aren't any and misses clusters that are there when they aren't the type k-means expects).

At any rate, what you are describing is straightforward, just use a chi-squared test. Arbitrarily number your centroids from 1 through k, these become the columns of a table. You will have two rows, training set and test set. Count the number of patterns that are assigned to each centroid in each dataset, and write it in the corresponding cell in the table. Then you can do a chi-squared test on the table. Here is a made up example:

#          centroid 1 centroid 2 centroid 3 Sum
# training         20         30         50 100
# test             40          5          5  50
# Sum              60         35         55 150

chisq.test(table)
#   Pearson's Chi-squared test
# X-squared = 50.26, df = 2, p-value = 1.22e-11

Testing generalization viability of clustering method

1 Answers1