Estimate information entropy through Monte Carlo sampling

Question

I am looking for methods that allow estimating the information entropy of a distribution when the only practical ways of sampling from that distribution are Monte Carlo methods.

My problem is not unlike the standard Ising model that is typically used as the introductory example for Metropolis–Hastings sampling. I have a probability distribution over a set $A$, i.e. I have $p(a)$ for each $a \in A$. The elements $a \in A$ are of a combinatorial nature, like Ising states, and there is a very high number of them. This means that in practice I never get the same sample twice when sampling from this distribution on a computer. $p(a)$ cannot be directly computed (due to not knowing the normalization factor), but the ratio $p(a_1)/p(a_2)$ is easy to calculate.

I want to estimate the information entropy of this distribution, $$ S = -\sum_{a \in A} p(a) \ln p(a). $$

Alternatively, I want to estimate the entropy difference between this distribution and one obtained by restricting it to a subset $a\in A_1 \subset A$ (and of course re-normalizing).

score 3 · Answer 1 · answered Feb 10 '17 at 05:29

If I understand what information you have available, what you want is not possible: the information available to you is not enough to determine the entropy. It's not even enough to approximate the entropy.

It sounds like you have a way to sample from the distribution $p(\cdot)$, and you have a way to compute the ratio $p(a_1)/p(a_2)$ for any pair of elements $a_1,a_2$ that you obtained via sampling, but you have no other information. If so, your problem is not solvable.

In particular, we can find a pair of distributions that have different entropies, but that cannot be distinguished using the information available to you. Consider first the uniform distribution on a (random) set of size $2^{200}$. Consider next the uniform distribution on a (random) set of size $2^{300}$. These have different entropies (200 bits vs 300 bits). However, given the information available to you, you have no way of knowing which of those two distributions you are working with. In particular, in both cases, the ratio $p(a_1)/p(a_2)$ will always be exactly 1, so the ratios won't help you distinguish between the two distributions. And due to the birthday paradox, you can sample as much as you like, but you'll never get the same value twice (not within your lifetime, except with exponentially small probability), so the values you get from sampling will look like just random points and contain no useful information.

So, to solve your problem, you'll need to know something more. For instance, if you know something about the structure of the distribution $p(\cdot)$, that might make it possible to solve your problem.

$p(a)$ does in fact have a special property: it is Gibbs like, i.e. $p(a) \propto \exp(\theta E(a))$ where $E$ is the "energy" of $a$. Except that there are multiple "energy" quantities, each with its corresponding $\theta$ parameter. — Charles Wells, Feb 10 '17 at 11:28
@CharlesWells, I'm not following what you mean by "multiple quantities". It sounds like that is worth posting separately, as a separate question, where you give us information about the structure of $p(a)$. Maybe there is a solution to that special case. — D.W., Feb 10 '17 at 16:36

Juan M. Bello-Rivas · Answer 2 · 2017-02-11T15:34:00.527

For the second part of your question (estimation of entropy difference between distributions) you may be able to use the identity $$F = \langle E \rangle - T S,$$ where $\langle E \rangle$ is the average energy, $T$ is the temperature (it is proportional to $\theta$ in $p \propto \mathrm{e}^{\theta E}$), and $S$ is the entropy. For details, see: Jaynes, E. (1957). Information Theory and Statistical Mechanics. Physical Review, 106(4), 620–630. http://doi.org/10.1103/PhysRev.106.620.

The idea then would be to use one of the methods available in the Computational Statistical Physics literature (see the links in the sidebar of that page) to find free energy differences $\Delta F$ and then find $\Delta S$ as a function of $\Delta F$ and $\Delta \langle E \rangle$ using the above formula (keep in mind that you can think of the restriction to a subset $A_1$ of $A$ as being equivalent to modifying the energy function $E$ so that it becomes infinite in the complement of $A_1$).

Here are two additional references on algorithms for computing free energy:

Lelièvre, T., Rousset, M., & Stoltz, G. (2010). Free Energy Computations. Imperial College Press. http://doi.org/10.1142/9781848162488

Chipot, C., & Pohorille, A. (2007). Free Energy Calculations. (C. Chipot & A. Pohorille, Eds.) (Vol. 86). Berlin, Heidelberg: Springer Berlin Heidelberg. http://doi.org/10.1007/978-3-540-38448-9

Can you give more practical references for computing free energy differences? That wiki doesn't go very far — Charles Wells, Feb 11 '17 at 14:57
Done. I added two more references and pointed to the links in the sidebar of the wiki. — Juan M. Bello-Rivas, Feb 11 '17 at 15:35

Estimate information entropy through Monte Carlo sampling

2 Answers2