Is this expected from Gaussian Mixture Model? Given a perfectly homogenous dataset, the cluster center is not exactly the same as the data point?
//Create a vector (180,3)
val v = Vectors.dense(180.toDouble,3.toDouble)
//Create an array with all the elements set to 'v'
val tVrdd = sc.parallelize(Seq.fill(1000000)(v))
//Cluster the dataset into 10 clusters
val gmm = new GaussianMixture().setK(10).run(tVrdd)
//What's the clusterCenter?
scala> gmm.gaussians(0).mu
res11: org.apache.spark.mllib.linalg.Vector = [180.0000000000454,3.000000000001699]
As a note, I figured I can use KMeans to determine number of clusters and then use that to set "k" for gaussian mixture.
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
val numClusters = 10
val numIterations = 20
val clusters = KMeans.train(tVrdd, numClusters, numIterations)
scala> clusters.k
res12: Int = 1
val k = clusters.k
val gmm = new GaussianMixture().setK(k).run(tVrdd)
//What's the clusterCenter now?
gmm.gaussians(0).mu
scala> gmm.gaussians(0).mu
res13: org.apache.spark.mllib.linalg.Vector = [180.0,3.0]
What's happening is - I have a large dataset that is mostly heterogenous except for subsets every now and then that are perfectly homogenous. So when GMM clusters these homogenous datasets, it places cluster center off where the center is supposed to be and that has other implications.
– Joe Nate Jul 12 '17 at 00:18[180.0000000000454,3.000000000001699]vs[180.0,3.0], that's a result of numerical methods occurring in GMM. It's estimating what the centers should be, and this will contain some slight error regardless of how homogeneous the data may be. – Jon Jul 12 '17 at 16:58