1

I have a connected graph with weighted edges. I want to partition the graph into communities with a clustering algorithm. I chose K-medoids and I run it on a distance matrix, where distance = 1 / (1 + edge weight).

If the feature space were Euclidean, I could simulate a uniform distribution over features and compute the "gap statistic" as the difference between the loss from k-medoids on the real data and the loss of k-medoids on simulated data. But the space is not Euclidean and I don't think I can simulate a uniform distribution and compute the gap statistic.

How can I select the optimal number of clusters?

  • 2
    Is after all your question about gap statistic or about internal cluster validity indices in general (for there are 100+ such indices, the Gap being just one)? – ttnphns Mar 26 '24 at 20:57
  • 2
    Here is a short overview locally, with some popular indices and their behaviour mentioned https://stats.stackexchange.com/a/358937/3277. – ttnphns Mar 26 '24 at 21:05

1 Answers1

3

A method that is often used together with K-medoids is maximising the Average Silhouette Width (Silhouette method).

In general, estimating the number of clusters is a hard problem, particularly because there is no unique definition of what a "correct/true" clustering actually is, and neither a correct or true number of clusters. What is good will generally depend on background information such as the meaning of the data, the aim of clustering, and how the clustering will be used. (You may be aware that the gap statistic requires tuning decisions, see the documentation of clusGap in R, and different tunings that are in use give different numbers of clusters on many data sets, which illustrates the difficulty of the problem.)

A more general approach similar to the gap statistic is in Hennig, C., Lin, CJ. Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters. Stat Comput 25, 821–833 (2015). https://doi.org/10.1007/s11222-015-9566-5. This requires the user to come up with a homogeneity null model (similar to the uniform in the gap statistic) though.