5

I am using EM algorithm to estimate measured data ($y$) as a sum of two weighted gaussian distributions:

$$model = \sum \limits_{i=1}^{L=2} w_i \phi(\theta_i)$$

Where $\theta$ = ($\mu$, $\sigma$). The problem is that the algorithm does not converge properly if the initialization is quite dissimilar to the real values (I generate the observed data as a sum of two gaussians + noise). So my question is about a good approach to estimate the initial values before running EM.

For the initialization, I tried to use a model (empty at the beginning) and then in a loop I compute this:

Initialization:

$$r = y - model$$ (r=residual)

$$\hat\theta_i = argmax_\theta |r^T\phi(\theta_i)|^2$$ (with newton optimization)

$$\hat w_i = \frac{\phi(\hat\theta_i) r}{\phi(\hat\theta_i)^T\phi(\hat\theta_i)}$$

$$model = model + \hat w_i \phi(\hat\theta_i)$$

But I am not getting good initial values, they are quite dissimilar to the real ones, so my EM algorithm doesn't converge.

Edit 1: If I try to find my initial values as I mentioned before, by computing the residual and the correlation with the gaussian functions to find the estimated values of $\mu$ and $\sigma$ and then the weights, in the first iteration I find a correct value of $\mu$ but the value of $\sigma$ is really small, so the factor $\frac{1}{\sigma \sqrt{2\pi}}$ is huge and so the correlation. That leads to wrong values for the weights and the rest of parameters for the next iterations. This is the point where I am stuck.

xDazz
  • 51
  • As written in my answer, running EM with a large number of initial values should spot the major modes and then you can compare the likelihood values at those modes to decide which one is the global mode. – Xi'an Feb 22 '16 at 21:02

2 Answers2

5

Here is a self-contained exercise from our book Introducing Monte Carlo methods with R:

Consider a sample of size $n$ from a mixture distribution with unknown weights, $$ X_i \sim \theta g(x) + (1-\theta) h(x),\quad i=1, \ldots, > n, $$ where $g(\cdot)$ and $h(\cdot)$ are known.

  1. Introduce $Z_1, \ldots, Z_n$, where $Z_i$ indicates the distribution from which $X_i$ has been drawn, so $$ X_i | Z_i=1 \sim g(x),\quad X_i | Z_i=0 \sim h(x)\,. $$ Show that the complete-data likelihood can be written as $$ L^c (\theta|{\mathbf x}, {\mathbf z})=\prod_{i=1}^n\left[z_ig(x_i)+(1-z_i)h(x_i)\right] \theta^{z_i}(1-\theta)^{1-z_i}. $$
  2. Show that $$\mathbb{E}[Z_i| \theta, x_i] = \theta g(x_i)/[\theta g(x_i)+(1-\theta)h(x_i)]$$ and deduce that the EM sequence is given by $$ \hat{\theta}_{(j+1)}=\frac{1}{n}\sum_{i=1}^n \frac{\hat{\theta}_{(j)} g(x_i)} {\hat{\theta}_{(j)} g(x_i)+(1-\hat{\theta}_{(j)})h(x_i)}. $$
  3. Examine the convergence properties of this EM algorithm on a simulated dataset with $n=25$, $\theta=0.3$, $h(x)=\varphi(x)$, and $g(x)=\varphi((x-2)/2)/2$, where $\varphi$ denotes the $\mathcal{N}(0,1)$ density.

The EM algorithm has the property that the observed likelihood is increasing at each iteration. This means that "an EM sequence $\{\hat{\theta}_{(j)}\}_j$ converges to a stationary point of $L(\theta| {\mathbf x})$, but not necessarily to the maximum likelihood estimator or even to a local maximum. In practice, running the EM algorithm several times with different, randomly chosen starting points is thus recommended if one wants to avoid using a poor approximation to the true maximum." different starting points lead to the three possible endpoints in this example where only the two means are unknown

For mixtures in particular, the number of local modes and saddle-points on the likelihood (surface) depends on the number of components (2 in your case) and on the number of clusters in your data. By clusters, I mean that if you generate a large enough sample from the mixture$$\sum_{i=1}^{10} 10^{-1}\mathcal{N}(i,10^{-1})$$ and apply EM to that sample, there will most likely be a multiple of 10 (local) modes on the likelihood surface.

This leads to the proposal of using subgroups of the sample to initialise the parameters to realistic values. Clustering algorithms can for instance be used, but this is not necessary, as random grouping can be considered as well.

Xi'an
  • 105,342
2

You're running a special case of EM for Gaussian mixture models, in which case the most frequently used initialization is k-means. That can be viewed as a kind of "hard-assignment" version of GMM clustering, and so starting from there is typically reasonable.

Since you're doing this in one dimension, there are efficient and guaranteed-to-find-the-optimum algorithms for k-means available, e.g. Ckmeans.1d.dp if you happen to use R. But if speed isn't a huge concern, standard k-means implementations will probably be fine.

Danica
  • 24,685