Is MCMC really better than raw MC to Sample a region?

Question

I have implemented the Markov Chain Monte Carlo (MCMC) with the Metropolis-Hastings sample selection criteria.

Basically, as I understand it and as I have implemented it is:

 n = number of accepted samples
 current_x = 0.5

 while i < n:
     new_x = random()
     new_f = CDF(new_x)
     current_f = CDF(current_x)
     a = abs(new_f / current_f)
     transition_prob = min(1, a)

     if random() <= transition_prob:  

         sampled_x[i] = new_x
         sampled_data[i] = new_f
         curr_x = new_x
         i += 1

acceptance_rate = i / trials

Using n=20 and sampling a N(0,1) CDF, I get this picture, which looks fine, with an acceptance rate of 51% (I know I should be sampling millions to get an accurate Monte Carlo, but for the sake of the example let's use n=20)

I would like to ask the following:

Is my understanding of the MCMC implementation correct?
In the sampling I have rejected 49% of actual function evaluations. Those rejected evaluations could have provided me a bigger insight of the objective function. Is that so?
Is MCMC better than MC for "mapping" a very complicated black box function, given that I would reject many costly function evaluations?

I hope my questions are clear enough.

EDIT

In my conceptual code I call CDF to a function that given a probability x it returns the value of a "real life" variable in "real life" units.

For instance, if my CDF was to represent the electric consumption of a house over the year, I would call CDF_house(x=0.1) and it would return 0.4 kWh.

I construct the CDF by sorting the real life measurements and linking the resulting array (P(x)) with another array of equal number of elements ranging from 0 to 1 (x). I build an interpolation function with the two arrays and I call it CDF, but probably it is an inverse CDF as I have been told in the answers below.

@Glen_b even if it is about the code then still I'd say that there are statistical issues that are core of the question rather the coding issues, so I'd say that this is on-topic. — Tim, Jul 24 '16 at 10:25
@Tim The question could be asked in terms of a plain algorithm, though, which would not strike the problem that the on topic page seems to exclude it as it stands. Indeed, I'd like to encourage the OP to take steps to avoid the risk of closure because it I think it would otherwise be a good question — Glen_b, Jul 24 '16 at 10:29
I don't think the code here is particularly obtuse so don't think the question would greatly benefit from changing it to pseudo-code, though annotation with comments might be a good idea. — Silverfish, Jul 24 '16 at 11:35
Indeed the code is just illustrative, I am asking about conceptual issues. — Santi Peñate-Vera, Jul 25 '16 at 09:24
Your CDF is the quantile function, hence the inverse of the actual cumulative distribution function. — dv_bn, Jul 25 '16 at 11:54

score 5 · Accepted Answer · edited May 23 '17 at 12:39

TL;DR No, it is not correct and few things can be improved. Poor performances is due to bugs in the code: using CDF rather then PDF and drawing new values independently of previous ones. Your implementation would return wrong results.

 while i < n:
     new_x = random()

Do you know what algorithm Python uses for random generation in random() function? In many cases you should not use the default pseudo-random generators for statistical purposes (e.g. this comment about C++ rand()). Hopefully Python uses Mersenne Twister algorithm that is pretty good and widely used, but this is worth checking in advance.

Moreover, what you implemented is closer to random-walk Metropolis algorithm (check e.g. Monte Carlo Statistical Methods book by Robert and Casella), where new value is drawn independent of previous draw, rather then dependent as in the Metropolis algorithm. This is also the reason for poor performance. You should rather be using something like

    new_x = current_x + random.uniform(-eps, eps)

where eps is some small constant. Random-walk Metropolis algorithm has higher rejection rate then the Metropolis algorithm. There is also a problem in next line:

     new_f = CDF(new_x)
     current_f = CDF(current_x)

This should be probability density functions, or probability mass functions, not cumulative distribution functions. Check What is the equivalent for cdfs of MCMC for pdfs? for learning more about MCMC-like algorithms for CDF's.

Notice that when using Metropolis algorithm with CDF you receive sample that is strongly biased against higher values since $\Pr(X \le x)$ will be always higher for larger $x$'s (with probability equal to $1$ for $\infty$). If you used PDF you would be drawing values with associated higher probability more often then those with associated lower probabilities (so $-\infty$ and $\infty$ would be both equally unlikely). This is illustrated on the plots below where Metropolis algorithm is used to draw from standard normal distribution using PDF (left), or with CDF (right). As you can see on the plots, when using CDF it is going upwards and accepting even very unlikely values of $X \ge 4$.

Besides of that, you could improve few things.

     a = abs(new_f / current_f)

Why are you using abs() in here? You are dividing something that is positive, by something that is also positive, so the result cannot be negative. There is no need for taking absolute value in here.

     transition_prob = min(1, a)

By definition of Metropolis algorithm there is min() in here but notice that in the next step you are comparing a to $\mathcal{U}(0,1)$ random variable that cannot be greater then $1$, so min() in the previous step does not change anything about values of a greater than $1$ since they will be accepted despite of that. Check this entry from Darren Wilkinson's blog for learning more about examples of implementing Metropolis algorithm.

The rest is fine.

In the sampling I have rejected 49% of actual function evaluations. Those rejected evaluations could have provided me a bigger insight of the objective function. Is that so?

You reject those values because in the end you want to have sample where each value appears with similar probability as in your target distribution. If you accepted everything, then the values would not appear with correct probabilities unless you are sampling from the target distribution.

Is MCMC better than MC for "mapping" a very complicated black box function, given that I would reject many costly function evaluations?

What do you mean by MC? Monte Carlo? There is a number of Monte Carlo algorithms, with MCMC algorithms belonging to this group. Yes, there are Monte Carlo algorithms that are more efficient then Metropolis algorithm, but I'm not sure what you mean in here. Also, if you can sample directly from your target distribution than obviously this is more efficient then using Metropolis algorithm - if this is what you meant.

EDIT

Is seems that in your code function CDF() is in fact an inverse of cumulative distribution function (a.k.a. quantile function). In this case there is no reason to use Metropolis algorithm at all since you can simply use inverse transform sampling to generate samples from your distribution directly. What you need to do is to take uniformly distributed random variable $U$ and pass it through the inverse CDF. This is computationally efficient, one of the most simple and basic ways of generating draws from non-uniform random variables. We do not use quantile function in Metropolis algorithm the same as we do not use CDF. If you used quantile function in Metropolis algorithm, then what would you be telling to the algorithm is "give me larger values of $X$ with higher probability", rather then drawing more probable values with higher probability.

I used abs because 'a' can be negative. Maybe I am misinterpreting things. My CDF object contains an experimentally obtained CDF ( a PDF is super hard to build). I call the CDF with a probability and it returns a variable value in the real variable range. ie. if it was wind speed I would enter a probability and it would return the wind speed associated. — Santi Peñate-Vera, Jul 25 '16 at 09:29
@SantiPeñate-Vera if you enter probability and it returns real values then this is an inverse CDF and in fact you can use inverse transform sampling directly! CDF is a function that takes as input some $x$ and returns $\Pr(X \le x)$. What is it? If it is an inverse CDF than you are using the Metropolis algorithm totally wrong and in fact there is no need to use it at all... — Tim, Jul 25 '16 at 09:40

Greenparker · Answer 2 · 2016-07-24T17:16:39.523

In the sampling I have rejected 49% of actual function evaluations. Those rejected evaluations could have provided me a bigger insight of the objective function. Is that so?

In addition to what Tim said about this, MCMC produces samples that are correlated and MC produces independent samples. If the rejection rate is low, and you accept almost all proposals, then that means that you are probably going to a value that is very close to the previous value. This increases autocorrelation in your samples and thus provides less power. For this reason, when using MCMC it is known that an acceptance probability of around 20 - 40% is good.

Is MCMC better than MC for "mapping" a very complicated black box function, given that I would reject many costly function evaluations?

In theory, if you can do Monte Carlo (MC), there is no reason to do Markov chain Monte Carlo (MCMC). Why? Because MC produces independent samples from the exact target distribution and MCMC produces correlated samples approximately from the target distribution. MC trumps MCMC. So if the target distributed is a known distribution (like Normal, t, $\chi^2$ etc), you should just use Monte Carlo.

However, if a distribution is unknown (like in most Bayesian settings), you can't just use on the shelf software techniques. So for MC you generally use Rejection Sampling (RS). This Monte Carlo technique also proposes a value to be accepted or rejected, the difference is that if values are accepted, they produce iid samples from the target distribution. In RS, you want as high an possible acceptance rate.

You run into trouble when the target distribution is high dimensional (even as high as 10). Because, in RS a vector of values is proposed and the whole vector is accepted or rejected. This reduces the acceptance rate, sometimes to the extent of a .0000001 acceptance rate. So you will have to wait a long time for even 1 acceptance. In MCMC, you can use variable at a time Metropolis Hastings to propose a new value for each variable, and this maintains a good acceptance rate for each value.

The reason MCMC methods are used so much is because most Bayesian posterior distributions are high dimensional, and thus automatically, even if a RS can be implemented, it will be very slow.

Is MCMC really better than raw MC to Sample a region?

2 Answers2

EDIT