Gaussian Mixture Model fails on a simple distribution with a fixed number of components

Question

I'm trying to fit a 2-component 2D Gaussian Mixture Model to some data. I know that there are only two components. The distribution can be seen below in the left plot:

The brain can effortlessly pick up that there is a smaller clustered distribution centered at around (1, 1) and a much more dispersed distribution acting as "noise". The sklearn.mixture.GaussianMixture function fails miserably though, as seen in the right plot (code below).

I've tried increasing the number of iterations to no avail. Is GMM not the right tool for this problem? Is there a better approach?

import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture
Number of samples per component
N_field, N_clust = 5000, 200
Generate random sample, two components
C = np.array([[0., 0.1], [.1, .0]])
X = np.r_[np.dot(np.random.randn(N_clust, 2), C) + 1.,
          np.random.randn(N_field, 2)]
Fit a 2-component Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=2, max_iter=100, n_init=10)
clf = gmm.fit(X)
Y_ = clf.predict(X)
Plot
plt.subplot(121)
plt.scatter(*X.T, 1, color='k')
colors = ('blue', 'orange')
splot = plt.subplot(122)
for i, (mean, cov) in enumerate(zip(clf.means_, clf.covariances_)):
    v, w = linalg.eigh(cov)
    if not np.any(Y_ == i):
        continue
    plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], 1, color=colors[i], alpha=.5)
# Plot an ellipse to show the Gaussian component
angle = np.arctan2(w[0][1], w[0][0])
angle = 180. * angle / np.pi  # convert to degrees
v = 2. * np.sqrt(2.) * np.sqrt(v)
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=colors[i])
ell.set_clip_box(splot.bbox)
ell.set_alpha(.3)
splot.add_artist(ell)

plt.show()

Try running EM with a range of starting values to check the solution with the highest likelihood. — Xi'an, Jun 14 '21 at 19:43
I tried running the process giving it the exact central coordinates of both components. Also tried all the covariance types available ('spherical', 'diag', 'tied', 'full'), nothing works — Gabriel, Jun 14 '21 at 20:23

score 4 · Accepted Answer · answered Jun 15 '21 at 09:22

I think this is a case of imbalanced clusters that are close to each other. In such cases, there are a lot of local maximas which the EM algorithm can end up in. You can read up more about this in On convergence properties of EM.

However, when the data sampled is from a GMM, one can be sure that the best fitted model (in terms of AIC) should be close enough to the true model from which the data is sampled, no matter how close the two clusters are. In such cases, it is all about the right initialization and right tolerance parameters. Kmeans initialization gives nearly equal-sized covariance matrices and hence does not explore other possible local maximas, no matter how many times you initialize. The default tolerance of 1e-3 is not sufficient to hunt for the best fitted model as all these local maximas could be very close, so we need to reduce the tolerance further.

Running your code gave me an AIC of 29581 and following plot

I made the following changes to your code - increased the tolerance and instead of kmeans initialization, I chose a random initialization. The code (takes around 15 mins on 4gb dual core machine) and results are as follows :

import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture
Number of samples per component
N_field, N_clust = 5000, 200
Generate random sample, two components
C = np.array([[0., 0.1], [.1, .0]])
X = np.r_[np.dot(np.random.randn(N_clust, 2), C) + 1.,
          np.random.randn(N_field, 2)]
Fit a 2-component Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=2, max_iter=10000, n_init=2000,init_params = 'random',tol=1e-6,random_state = 1)
clf = gmm.fit(X)
Y_ = clf.predict(X)
print("AIC",clf.aic(X))
Plot
plt.figure(figsize=(16, 8))
plt.subplot(121)
plt.scatter(*X.T, 1, color='k')
colors = ('blue', 'orange')
splot = plt.subplot(122)
for i, (mean, cov) in enumerate(zip(clf.means_, clf.covariances_)):
    v, w = linalg.eigh(cov)
    if not np.any(Y_ == i):
        continue
    plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], 1, color=colors[i], alpha=.5)
# Plot an ellipse to show the Gaussian component
angle = np.arctan2(w[0][1], w[0][0])
angle = 180. * angle / np.pi  # convert to degrees
v = 2. * np.sqrt(2.) * np.sqrt(v)
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=colors[i])
ell.set_clip_box(splot.bbox)
ell.set_alpha(.3)
splot.add_artist(ell)

plt.show()

I got an AIC of 29130 and the following figure.

Running the same code, but instead of random initialization using k-means initialization, I got an AIC of 29687 and the following figure, which does not look different from the one in your question.

Thank you honeybadger, I didn't think that such a simple problem would need so much processing time. — Gabriel, Jun 15 '21 at 19:52

Gaussian Mixture Model fails on a simple distribution with a fixed number of components

Number of samples per component

Generate random sample, two components

Fit a 2-component Gaussian mixture with EM

Plot

1 Answers1

Number of samples per component

Generate random sample, two components

Fit a 2-component Gaussian mixture with EM

Plot