What is the intuition behind the Metropolis-Hastings Algorithm?

Question

I've been studying Bayesian Statistics lately, and just came across the Metropolis-Hastings Algorithm. I understand that the goal is to sample from an intractable posterior - but I'm not really able to understand how the algorithm achieves what it sets out to achieve.

Why and how does it work? What's the intuition behind the algorithm?

To clarify the parts I've problems with, in particular, I've attached the algorithm above.

How is the $q$ distribution (the proposal) related to the intractable posterior? I don't see how $q$ popped out of nowhere.
Why is the acceptance ratio calculated the way it is? It doesn't make intuitive sense to me - it'd be great if someone could explain that better.
In Step 3, we accept the $X$ we sampled from the $q$ distribution with some probability - why is that? How does that get me something closer to the intractable posterior, which is our goal? (right?)

Please help me out here. Thanks!

Do you understand Metropolis method? Section II in particular, Eq.(3) and its explanation — Aksakal, Jun 07 '20 at 04:36
@Xi'an http://www2.geog.ucl.ac.uk/~mdisney/teaching/GEOGG121/sivia_skilling/mterop_hastings.pdf — stoic-santiago, Jun 07 '20 at 15:53

Xi'an · Accepted Answer · 2020-06-08T06:03:16.340

How is the $q$ distribution (the proposal) related to the intractable posterior? I don't see how $q$ popped out of nowhere.

The posterior is not intractable: $f(x)$ must be available (in a numerical sense) up to a multiplicative constant for the Metropolis-Hastings algorithm to apply. Otherwise, latent or auxiliary variables must be called in. Or else approximate methods such as ABC are needed.

The density $q(\cdot|\cdot)$ is essentially arbitrary, provided it operates on the same "space" as $f$. Meaning events measurable under $f$ should also be measurable under $q$. It is chosen when running the algorithm with competing goals of (i) a manageable enough simulation of $y\sim q(y|x)$ (ii) a computable density function $q(y|x)$ [up to a multiplicative constant] (iii) a sufficient coverage of the neighbourhood of $x$ towards ensuring eventually (in the number of iterations) a likely exploration of the whole support of the density $f$ (which leads to irreducibility for the associated Markov chain).

Why is the acceptance ratio calculated the way it is? It doesn't make intuitive sense to me.

The acceptance probability$$\alpha(x,y)=1 \wedge \frac{f(y)q(x|y)}{f(x)q(y|x)}$$is one of several choices that ensures $f$ is the stationary distribution density of the associated Markov chain. This means that (i) if $X_t\sim f(x)$, then after one iteration of the algorithm $X_{t+1}\sim f(x)$ (invariance of $f$) (ii) whatever the choice (or distribution) of $X_1$, the limiting distribution of $X_t$ as $t$ grows to $\infty$ is $f$.

One direct explanation for this stationarity is that $$f(x)q(y|x)\alpha(x,y) = f(y)q(x|y)\alpha(y,x)$$ which is called detailed balance. It shows that the flow of the Markov chain is the same looking toward the future and looking toward the past (this is called reversibility). Any other function $\alpha$ that satisfies detailed balance works as well. Take for instance Barker's ratio $$\alpha(x,y)=\dfrac{1}{1+\frac{f(x)q(y|x)}{f(y)q(x|y)}}=\dfrac{f(y)q(x|y)}{f(x)q(y|x)+f(y)q(x|y)}$$

Another intuitive if informal explanation for this property is that, if $X_t\sim f(x)$ and $Y|X_t=x_t\sim q(y|x_t)$, then accepting the value $Y=y$ with probability $\alpha(x_t,y)$ turns the distribution of the pair into $Y\sim f(y)$ and $X_t|Y=y\sim q(x_t|y)$.

In Step 3, we accept the X we sampled from the q distribution with some probability - why is that? How does that get me something closer to the intractable posterior, which is our goal?

The algorithm produces a Markov chain $(X_t)_{t\ge 1}$ that is converging in distribution to the target distribution $f$. Hence the distribution of $X_t$ is eventually getting close to $f$ and hence for $t$ large enough the marginal distribution of $X_t$ is approximately $f$. But it does not make sense to consider that after each iteration $X_t$ is closer to the posterior distribution $f$.

There seems to be a typo in the final term for your definition of Barker's ratio. — D.W., Jun 07 '20 at 23:55

score 5 · Answer 2 · answered Jun 07 '20 at 04:40

Ok. Let's starting by addressing your question piecemeal. First, how is $q$, called the jumping distribution chosen? It's up to you, the model-er. A reasonable assumption, as always, would be a Gaussian, but this may change according to the problem at hand. The choice of the jumping distribution will change how you walk, of course, but it an arbitrary choice.

Now, the core of Metropolis-Hastings is the choice of $\alpha$. You can think of $\alpha$ as the way you control the sampling procedure. The main idea behind MCMC is that in order to estimation an unknown distribution, you 'walk around' the distribution such that the amount of time spent in each location is proportional to the height of the distribution. What $\alpha$ does is ask, 'compared to our previous location, how much higher/lower are we?' If we are higher, then the chance that we pick to move to the next point is higher, and if we are lower, then it's more likely that we stay where we are (this refers to Step 3 from the algorithm you reference). The precise functional form of $\alpha$ can be derived, fundamentally, it comes from the condition that we want our final distribution to be stationary.

Next, let's discuss your final question. Generally speaking, this notion goes beyond Metropolis-Hastings, you should google 'rejection sampling.' If you've heard of that, that's all this is. This is to ensure that you've fully explored the distribution, and don't get 'stuck' in one place.

Hopefully this has given you some greater intuition behind the algorithm. I do recommend spending some time delving into the math, my approach is very casual, focused on interpretability. Though the math can be intimidating, it's the best way to build intuition. Perhaps looking at a software implementation may help. As always, The Elements of Stat. Learning and Bishop are great references, and there are a plethora of online resources you could fine to further your understanding. cheers!

Could you throw some light on how the exact functional form of $\alpha$ can be derived, starting with that the final distribution must be stationary? — stoic-santiago, Jun 07 '20 at 04:49
You could find a lengthy derivation on wikipedia, or a more constructive argument in Bishop, chapter 11.2. This resource here gives a more concise proof. — bigdrip, Jun 07 '20 at 05:04
@cogito_ai: Whenever I've had to re-understand it, I've always used this paper as my go to explanation. There might be something better now ( such as recommendations above ) but I always found it enlightening. http://web1.sph.emory.edu/users/hwu30/teaching/statcomp/papers/chibGreenbergMH.pdf — mlofton, Jun 08 '20 at 06:35

What is the intuition behind the Metropolis-Hastings Algorithm?

2 Answers2