Entropy: Proving information gain formula: h(x) = -log p(x)

Question

We consider a discrete random variable X, and we want to know how much information we receive every time we observe the value of this random variable. We qualify this measure of information transfer as h(x), a monotonically decreasing function of the probability distribution p(x) for random variable X:

$$h(x) = f(p(x))~~~~~~~~(1)$$

such that if a highly improbable value of X is observed this implies that more information is received, and in contrast, a nearly certain event implies that less information received.

Further, the sum of information gained by observing two independent observations of random variable X is additive. That is if:

$$p(x_2,x_2)=p(x_1)p(x_2)~~~~~~~~(2)$$

then:

$$h(x_1, x_2) = h(x_1) + h(x_2)~~~~~~~(3)$$

Show that from (1), (2), and (3) we can infer that h(x) is defined as:

$$h(x) = - log_2(p(x))$$

https://en.wikipedia.org/wiki/Information_content – Mr Farad Oct 04 '19 at 04:28 — Mr Farad, Oct 04 '19 at 04:28
http://www.stat.yale.edu/~yw562/teaching/itlectures.pdf – Mr Farad Oct 04 '19 at 04:51 — Mr Farad, Oct 04 '19 at 04:51

score 4 · Answer 1 · answered Oct 04 '19 at 20:37

This answer gives an elementary solution in five easy steps. It requires minimal knowledge of properties of real numbers and none of Calculus. The most advanced number fact it relies on is that the set of numbers $2^n$ for integers $n$ has $0$ for its greatest lower bound and is unbounded above. This is used in the fifth step of the solution.

The crux of the matter is to find all real-valued functions $h$ of positive numbers that satisfy the following conditions:

$h(xy) = h(x) + h(y)$ for all positive $x$ and $y.$
$h$ is strictly decreasing: that is, $y \gt x$ implies $h(y) \lt h(x).$

One approach is to parallel the construction of the real numbers, as follows.

First, $h(y) = h(y1) = h(y)+y(1)$ for all $y$ implies $h(1)=0.$

Second, use Mathematical Induction to prove that for any non-negative positive integer $n,$ $h(x^n)=nh(x).$ The first step is the case $n=0.$ The induction step requires you to show $h(x^{n+1})=(n+1)h(x),$ but that follows by applying rule $(1)$ to the case $y=x^n.$

Third, the relation $0 = h(x\cdot 1/x) = h(x) + h(1/x)$ shows $h(1/x)=-h(x)$ for all $x.$ Coupled with the second result, this proves $h(x^{-n})=-nh(x)$ for positive integers $n.$

Fourth, for any positive integers $p$ and $q,$ apply the preceding results to the relation $(x^{p/q})^q = x^p$ to obtain $q h(x^{p/q}) = p h(x),$ whence $h(x^{p/q}) = (p/q)h(x)$ for all $x.$

The fifth step is the most delicate. We need to show $h$ is now uniquely defined for all positive real numbers, not just the rational ones. This is where the monotonicity of $h$ is required.

Let $x\gt 0.$ We can always find rational numbers $a \lt b$ for which $2^a \le x \lt 2^b,$ giving (from the preceding results) $ah(2) \ge h(x) \gt bh(2).$ Consider the rational number $(a+b)/2.$ If $2^{(a+b)/2}\le x,$ then $(a+b)h(2)/2 \ge h(x)$ and otherwise $h(x) \gt (a+b)h(2)/2.$ By narrowing the interval in this manner always to contain $x$ we obtain, recursively, a nested sequence of rational intervals $[a_i,b_i]$ of widths $(b-a)2^{-i}$ for which $x$ lies in every interval $[2^{a_i}, 2^{b_i}].$ The corresponding sequence of nested closed intervals $[a_i,b_i]$ converges to a unique real number $\xi$ for which (1) $2^\xi = x$ and (2) $h(x) = \xi h(2).$

This completes the solution of (1) and (2) for the unknown function $h.$ Evidently, all the values of $h$ are determined by the unknown value $h(2).$

From this point on we are concerned with relating $h$ to the logarithm, so knowledge of the basic properties of logarithms is assumed.

Consider the function $$g(x) = h(x)/h(2) - \log(x)/\log(2).$$ The foregoing results and the usual properties of logarithms show that $g(1)=0,$ and then (following the preceding analysis) that $g(2)=0,$ $g(2^n)=0,$ $g(2^{p/q})=0,$ and $g(x)=0$ for all positive $x.$ Consequently, solving for $h$ we find $$h(x) = \frac{h(2)}{\log(2)}\log(x) = h(2)\log_2(x).$$

Since $2\gt 1$ and $h(1)=0,$ $h(2)\lt 0.$ Consequently, writing $h(2) = -C$ for some positive number $C,$ let $b=2^C \gt 1$ so that $1/C = \log_b(2):$

The most general solution is $$h(x) = h(2)\log_2(x)= -C\log_2(x) = -\log_b(x)$$ where $h(2)\lt 0,$ $C\gt 0,$ and $b \gt 1.$

Mr Farad · Answer 2 · 2019-10-04T19:04:00.840

0

The equation $h(x) = f(p(x))$ above means that the Information we obtain for a specific
value of a random variable x is correlated with its occurring probability p(x),
and its relationship is given by a mapping function f (~).

The probability of two independent observation of random variables of X is:

$$P(x,x) = P(x)P(x)$$ $$f(P(x)P(x)) = f(P(x))~f(P(x))$$ $$h(x^2) = h(x) + h(x) = 2h(x)$$ $$h(x^2) = 2h(x)$$

The probability of 3 independent observation of random variables of X is:

$$P(x,x,x) = P(x)P(x)P(x)$$ $$f(P(x)P(x)P(x)) = f(P(x))~ f(P(x))~f(P(x))$$ $$h(x^3) = h(x) + h(x) + h(x)= 3h(x)$$ $$h(x^3) = 3h(x)$$

Thus, the general property is:

$$h(x^N) = N h(x)$$

A well known property of the log function is:

$$log(A^N) = N Log(A)$$

so we could guess that f(~) is a log function.

However, the requirement is that f(~) be monotonically decreasing, that is: a small probability yields high information, and a large probability yields low information.

If we look at the graph our guess for f(~):

$$h(x) = log(p(x))$$

we see that it is a monotonically increasing function instead of monotonically decreasing.

However, not all is lost, because we can still modify the log function to turn it into a monotonically decreasing function with the following modification:

$$h(x) = \log\bigg(\frac{1}{p(x)}\bigg)$$

further the property $h(x^N)=Nh(x)$ still holds:

$$h(x^n) = \log(1/p^N(x) = log\Bigg(\bigg(\frac{1}{p(x)}\bigg)^N\Bigg)$$

$$h(x^n) = N \log\bigg(\frac{1}{p(x)}\bigg) $$

Finally, we can make our guess for f(~) more tidey as:

$$h(x) = \log\bigg(\frac{1}{p(x)}\bigg)$$ $$h(x) = -\log p(x)$$

edited Oct 04 '19 at 19:04

answered Oct 04 '19 at 15:00

Mr Farad

161

1

The notation at the outset is so ambiguous and confusing--the poor "$X$" and "$x$" sometimes seem equivalent and sometimes not and eventually, as far as I can tell, are made to represent six different quantities at once! It made no sense to follow the rest of this answer. Maybe you could clarify your reasoning? – whuber Oct 04 '19 at 15:43
sorry, i was being a little bit sloppy with caps... i just fixed it... X is a random variable, x is an observation of a random variable. – Mr Farad Oct 04 '19 at 19:06
1

Okay--but the problem remains that you appear to be using "$X$" to represent two or three different random variables in the same equation! Nothing in the first half of your post makes mathematical sense or is correct. – whuber Oct 04 '19 at 19:08
two or more observation of an independent random variable X, $P(X=x_1, X=x_2, X=x_3)=P(X=x_1)P(X=x_2)P(X=x_3)$ i'm assuming that all observations are the same event, example rolling a 6 on a die... thus $x_1$=$x_2$=$x_3$=>rolling a 6 on a die=> i just call them all x. anyways... if anybody can do a better job..feel free to post... that's why i posted it...because i though my proof was somewhat lacking... – Mr Farad Oct 04 '19 at 19:13
Okay, that starts to make sense: but what do you mean by "$h(x^2)$"? Perhaps $h(x)^2$? Note, too, that guessing the log works and then showing it does is not what you were asked. You need to demonstrate that nothing else works. (In this sense the question is incorrectly formulated, because the base-2 logarithm is not the only solution.) – whuber Oct 04 '19 at 19:16
abuse of syntax... why $h(x^2) = h(x,x)$.... well... not entirely sure about that part... – Mr Farad Oct 04 '19 at 19:17
1

i think you found the skipping of steps in the proof that I wasn't sure about... – Mr Farad Oct 04 '19 at 19:18

Entropy: Proving information gain formula: h(x) = -log p(x)

2 Answers2