MCMC/EM limitations? MCMC over EM?

Question

I am currently learning hierarchical Bayesian models using JAGS from R, and also pymc using Python ("Bayesian Methods for Hackers").

I can get some intuition from this post: "you will end up with a pile of numbers that looks "as if" you had somehow managed to take independent samples from the complicated distribution you wanted to know about." It is something like I can give the conditional probability, then I can generate a memoryless process based on the conditional probability. When I generate the process long enough, then the joint probability can converge.and then I can take a pile of numbers at the end of the generated sequence. It is just like I take independent samples from the complicated joint distribution. For example, I can make histogram and it can approximate the distribution function.

Then my problem is, do I need to prove whether a MCMC converges for a certain model? I am motivated to know this because I previously learned the EM algorithm for GMM and LDA (graphical models). If I can just use MCMC algorithm without proving whether it converges, then it can save much more time than EM. Since I will have to calculate the expected log likelihood function (will have to calculate posterior probability), and then maximize the expected log likelihood. It is apparently more cumbersome than the MCMC (I just need to formulate the conditional probability).

I am also wondering if the likelihood function and prior distribution are conjugate. Does it mean that the MCMC must converge? I am wondering about the limitations of MCMC and EM.

MCMC converges as $n \rightarrow \infty$ by definition. Rather then proving it you diagnose the convergence to check if your model has converged e.g. http://www.math.pku.edu.cn/teachers/xirb/Courses/QR2013/ReadingForFinal/MCMC/2291683.pdf or http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf — Tim, Mar 26 '15 at 07:30
@Tim Thanks for the reply! I think I previously do not fully understand MCMC will edit my problem later. But still have a problem, if MCMC must converge, I am wondering about MCMC's drawback compared with E-M.It seems E-M is more cubersome.. — DQ_happy, Mar 26 '15 at 07:56
Like why most papers choose to use e-m for LDA, also when we are taking machine learning classes, we are learning e-m first but seldom do we learn mcmc. — DQ_happy, Mar 26 '15 at 07:58
EM is faster, it is non-Bayesian (not everyone loves Bayesian statistics) and in some cases it has less identifiability issues (it converges to a single maximum value while with MCMC approach you have a whole distribution that could be more complicated then point estimate) etc. — Tim, Mar 26 '15 at 08:28
@Tim Ah I do not know why EM is non-bayesian, I up till now only use em algorithm for graphical model problems and I think they are all bayesian-related. Can you give me a common non-bayesian problem using em algorithm? many thanks! — DQ_happy, Mar 26 '15 at 08:33
EM is used for maximum likelihood or maximum a posteriori estimation but was initially described as ML algorithm and is commonly used in ML approach (see http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm). — Tim, Mar 26 '15 at 08:46
Even if you use EM for MAP estimation rather than ML, it is non-Bayesian for me because it does try to characterise the posterior distribution but only gets you the local mode of it. — Luca, Mar 26 '15 at 08:54
@Luca Thanks for the explanation! Just to make it clear, does that mean the biggest difference MCMC and EM, is that we must have prior distribution of parameters, while for EM, we only need the conditional probability?(That means EM is not bayesian,since we do not assume prior probability and only get point estimation for parameters) — DQ_happy, Mar 26 '15 at 22:01
For me using EM is non-Bayesian because it gives you a point estimate of your parameters of interest and does not quantify the full posterior distribution. With both EM and MCMC one can have a full probabilistic model with priors, latent and observed random variables but the inference is different. MCMC aims to characterise the full posterior distribution while EM gives does not convey the information of the full posterior distribution. For me a Bayesian is someone who uses the posterior distribution for decision making. However, this might be simplistic. I am also learning this stuff. — Luca, Mar 26 '15 at 22:24

score 18 · Accepted Answer · answered Mar 26 '15 at 09:38

EM is an optimisation technique: given a likelihood with useful latent variables, it returns a local maximum, which may be a global maximum depending on the starting value.

MCMC is a simulation method: given a likelihood with or without latent variables, and a prior, it produces a sample that is approximately distributed from the posterior distribution. The first values of that sample usually depend on the starting value, which means they are often discarded as burn-in (or warm-up) stage.

When this sample is used to evaluate integrals associated with the posterior distribution [the overwhelming majority of the cases], the convergence properties are essentially the same as those of an iid Monte Carlo approximation, by virtue of the ergodic theorem.

If more is needed, i.e., a guarantee that $(x_t,\ldots,x_{t+T})$ is a sample from the posterior $\pi(x|\mathfrak{D})$, some convergence assessments techniques are available, for instance in the R package CODA. Theoretically, tools that ensure convergence are presumably beyond your reach. For instance, perfect sampling or rewewal methods.

MCMC/EM limitations? MCMC over EM?

1 Answers1