4

E.T. Jaynes writes, in "Probability Theory: the Logic of Science" the following in order to motivate the derivation of the entropy, $H$:

Suppose the robot perceives two alternatives, to which it assigns probabilities $p_1$ and $q := 1 - p_1$. Then the 'amount of uncertainty' represented by this distribution is $H_2(p, q)$. But now the robot learns that the second alternative really consists of two possibilities, and it assigns probabilities $p_2, p_3$ to them, satisfying $p_2 + p_3 = q$. What is now the robot's full uncertainty $H_3(p_1, p_2, p_3)$ as to all three possibilities? Well, the process of choosing one of the three can be broken down into two steps. Firstly, decide whether the first possibility is or is not true; the uncertainty removed by this decision is the original $H_2(p_1, q)$. Then, with probability $q$, the robot encounters an additional uncertainty as to events $2, 3$, leading to

$$H_3(p_1, p_2, p_3) = H_2(p, q) + q H_2\left(\frac{p_2}{q}, \frac{p_3}{q} \right)$$

as the condition that we shall obtain the same net uncertainty for either method of calculation.

I find this reasoning not compelling enough, especially the way $q$ appears before $H_2$. Perhaps there's a better way to put what Jaynes' meant to say through an explicit example of a coin flip, where $T$ outcome turns out to be governed by another coin flip?

  • 1
    It's not clear to me what you're asking. Is it obvious to you that the left hand side and right hand side are in fact equal? Are you asking why the q appears in front of the right $H_2$ term, and why it's in the denominator of the two probabilities in that term? Are you asking if there's a clearer way to write a paragraph with similar content? Or if the general example would be superior if replaced by a specific example, where 'alternatives' are 'results of coin flips'? – Matthew Gray Oct 20 '15 at 22:13
  • Better way to explain why such requirement of entropy is desired is what I am after. Perhaps a concrete example would be a good way to do it. – Tom Artiom Fiodorov Oct 20 '15 at 22:16

1 Answers1

5

Instead of Shannon entropy, let's talk about scoring rules. You assess a probability over a set of mutually exclusive and exhaustive outcomes, and then you win points based on the probability you assigned to the outcome that's correct.

Scoring rules, if they're any good, have a neat property called being "proper." That is, if I have some probability distribution $(p_1, 1-p_1)$ on two outcomes $1$ and $2$, under a proper scoring rule the best I can do is report my probability distribution as $(x_1,x_2)=(p_1, 1-p_1)$.

This isn't always the case! Suppose my scoring rule is linear: if $i$ happens, you get $x_i$ points. This is what would happen with, say, normal betting. Well, then if you think it's 60% likely that 1 will happen and 40% likely that 2 will happen, you win the most points by reporting $(1.0,0)$, not $(0.6,0.4)$. Calculate it: $1*0.6+0*0.4=0.6$, vs. $0.6*0.6+0.4*0.4=0.52$. If a roulette wheel lands on red 60% of the time, bet on red every time, not 60% of the time.

So there are many proper scoring rules. If we use the spherical score, then we get $x_i/\sqrt{\sum_jx_j^2}$ points, which we can easily prove is proper using geometric means.

All of the scoring rules work for sets with more than 2 alternatives. But are they consistent across multiple perspectives? That is, suppose I flip a coin, and if it's heads, I flip again. There are three possible sequences: T, HH, and HT. If the coin is fair, then the best probability distribution is $(0.5,0.25,0.25)$, and the expected spherical score is $\sqrt{3/8}$.

But what if we decompose this into two different flips, each of which is $(0.5,0.5)$, and only have the second flip half the time? Then the expected spherical score is $\sqrt{1/2}$ for the first flip, and $\sqrt{1/2}$ for the second flip, and so the total expected score is $3\sqrt{1/8}$, which is... different by a factor of $\sqrt{3}$.

So the spherical scoring rule is proper, in that it gets us to give the right probabilities. But it isn't additive, in that I can decompose a set of probabilities however I want and the score necessarily must work out to be the same. As it turns out, the only rule that has both properties is the log scoring rule, which is the same as measuring the entropy.

Matthew Gray
  • 1,419
  • 7
  • 13
  • 2
    Thanks for the good answer, it clears some things up. However, I just realized that when you said the total expected score is $3 \sqrt{1/8}$ = $\sqrt{1/2} + \frac{1}{2}\sqrt{1/2}$, you assumed that under such decomposition entropies can be added (like we add expectations). Is there a reason why entropies should be added rather than multiplied? Perhaps we could use some other binary operation? – Tom Artiom Fiodorov Oct 23 '15 at 16:53
  • @TomArtiomFiodorov The silly and simple answer is we're looking for addition because the property is the additive property. But why addition, not some other binary operation? First, addition is probably the simplest and most natural operation (probability-weighted sums show up many other places.) Second, I don't know whether other binary operations work: for the spherical rule in particular, it looks like we need to carry around both "score" and "normalizer," so that we can change the normalizer as we decompose options. And even that doesn't look clean, but maybe I'm missing something. – Matthew Gray Oct 23 '15 at 17:10
  • sorry I de-accepted your answer despite it being very helpful. I am still thinking about how to incorporate summation into it and hopefully one day I will leave an insightful comment here and accept your answer back. – Tom Artiom Fiodorov Nov 03 '15 at 08:56