Equivalent of gap statistic for k-medoids on non-Euclidean spaces

Question

I have a connected graph with weighted edges. I want to partition the graph into communities with a clustering algorithm. I chose K-medoids and I run it on a distance matrix, where distance = 1 / (1 + edge weight).

If the feature space were Euclidean, I could simulate a uniform distribution over features and compute the "gap statistic" as the difference between the loss from k-medoids on the real data and the loss of k-medoids on simulated data. But the space is not Euclidean and I don't think I can simulate a uniform distribution and compute the gap statistic.

How can I select the optimal number of clusters?

Is after all your question about gap statistic or about internal cluster validity indices in general (for there are 100+ such indices, the Gap being just one)? — ttnphns, Mar 26 '24 at 20:57
Here is a short overview locally, with some popular indices and their behaviour mentioned https://stats.stackexchange.com/a/358937/3277. — ttnphns, Mar 26 '24 at 21:05

Christian Hennig · Accepted Answer · 2024-03-27T01:19:34.643

A method that is often used together with K-medoids is maximising the Average Silhouette Width (Silhouette method).

In general, estimating the number of clusters is a hard problem, particularly because there is no unique definition of what a "correct/true" clustering actually is, and neither a correct or true number of clusters. What is good will generally depend on background information such as the meaning of the data, the aim of clustering, and how the clustering will be used. (You may be aware that the gap statistic requires tuning decisions, see the documentation of clusGap in R, and different tunings that are in use give different numbers of clusters on many data sets, which illustrates the difficulty of the problem.)

A more general approach similar to the gap statistic is in Hennig, C., Lin, CJ. Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters. Stat Comput 25, 821–833 (2015). https://doi.org/10.1007/s11222-015-9566-5. This requires the user to come up with a homogeneity null model (similar to the uniform in the gap statistic) though.

Equivalent of gap statistic for k-medoids on non-Euclidean spaces

1 Answers1