Estimating number of clusters using Gap Statistics

Question

Since my application is for streaming data, I chose to use BIRCH to create clusters. BIRCH doesn't produce high quality results, therefore it requires "global clustering step" to improve output clusters. Global clustering is often performed using Agglomerative clustering or K-Means.

I am trying to use BIRCH clustering results as input to Gap statistics in order to calculate number of clusters (K), which would be the input for K-Means as the Global step in BIRCH.

Instead of whole dataset, I am feeding Gap statistics with BIRCH subcluster centers as a new dataset. I am also testing this approach with Pham method, which seems to give better results than Gap statistics.

One of the datasets I am using for testing is from sklearn BIRCH examples, 100K points around 100 centers. On Fig 1. Pham method guessed correctly number of clusters in this dataset (BIRCH produced 148 clusters; centers of those 148 clusters were the input points for Pham).

Using Gap statistics I am always getting K=1 as a result. Following this post I was changing the scale, but I am still unable to get good results. Results and dataset are shown on Fig 2. (Dataset is again made of subcluster centers produced by BIRCH)

Do you have any suggestions how I can improve results for Gap Statistics?

Don't try to draw any conclusions from toy data. – Has QUIT--Anony-Mousse Nov 21 '15 at 08:58 — Has QUIT--Anony-Mousse, Nov 21 '15 at 08:58

Estimating number of clusters using Gap Statistics

0 Answers0