3

Based on the given definition of entropy, $H(P(X)) = -\sum_i P(x_i)log_2(P(x_i)$, it appears that if I have a distribution $P_1(x) = [\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4}]$ and another distribution $P_2(x) = [\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4},0,0,0,0...]$, then they have the same entropy?

That just doesn't seem to make sense to me based on my notion of entropy. In $P_1$ we are quite uncertain whereas in $P_2$ we are relatively more certain...

Chet
  • 135

1 Answers1

5

In P1 we are quite uncertain whereas in P2 we are relatively more certain...

Obviously our issue here is not of formal treatment ($P_1(x)$ and $P_2(x)$ are formally equivalent representations of the exact same distribution -in fact, $P_2(x)$ is the more "correct" one, in a sense), but of intuition, and I will try to provide some.

The expression $$P_2(x) = \left\{\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4},0,0,0,0...\right\}$$ includes the probabilities of the distribution, each corresponding to some value that the related random variable takes. So what you are saying is

"If I know that the random variable $X$ takes the values $\{a,b,c,d\}$ each with probability $1/4$, and also an infinite number of values each with probability zero, then I am relatively more certain compared to if I knew that $X$ takes "only" the values $\{a,b,c,d\}$ each with probability $1/4$". Why, really? Why introducing in the picture zero-probability events makes you relatively more certain? Since they were always there in the first place?

Maybe you are thinking that $P_1(x)$ describes a situation where we don't know what other "values" of $X$ may be out there, and so we feel "uncertain", while, if we look at $P_2(x)$, we "know" that these other values have probability zero, and so we are certain that our uncertainty is contained in the four values with positive probability?

This is a very realistic description of a real-world situation -but such a real-world case would not be described by the juxtaposition of $P_1(x)$ and $P_2(x)$. Consider a new social phenomenon that can be quantified. At the beginning, do we know what values can it take? No. We will have to see it evolve through time, and still we will never be certain of all its possible values. Assume that after some time, we have observed that this phenomenon takes the values $\{a,b,c,d\}$ with practically equal probabilities. We have not observed the phenomenon acquiring any other value, not even once. Say we want to model this phenomenon, and include it in some more general model of ours. What are we to do?

As I wrote in the beginning, specifying either $P_1(x)$ or $P_2(x)$ is the exact same thing. In other words $P_1(x)$ is not a way to express any uncertainty regarding the "possible full support of $X$" -it is just a shorthand for $P_2(x)$ that does not hurt neither the mathematics nor the inference. The very real concern that our model me be inaccurate, regarding also the support of $X$ (let alone the allocation of probabilities), is not the kind of uncertainty that is captured by the concept of Shannon's Entropy. Shannon's Entropy summarizes a fully-defined and fully described uncertainty -not uncertainty that we just know or suspect that it exists, but we are unable to tell or do anything about it.
(This is reminiscent of the old Knightian distinction in Economics between "risk"(described uncertainty) and "uncertainty"(which simply means "we are at a loss").)

I hope all these are not irrelevant to your concerns.

  • I see where you're coming from. To be more formal: I am considering two different distributions: $P(x) = {\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4}}$ and $P(y) = {\frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4}, 0, 0, 0, 0}$. Whats confusing in this situation is that $P(x)$ is its maximum entropy state. The notion being, you have no idea what $x$ is. $P(y)$ is not in its maximum entropy state. In fact, by some notion, it seems you are twice as sure what $y$ is than $x$ because your know $y$ is four of eight possibilities. – Chet Oct 02 '14 at 00:20
  • Although I suppose from a different perspective, you could say that $P(y)$ is capable of more entropy than $P(x)$. – Chet Oct 02 '14 at 00:21
  • 3
    Why do you say that $P(x)$ and $P(y)$ are two different distributions? In what sense do "possibilities" that have probability zero, "count"? – Alecos Papadopoulos Oct 02 '14 at 00:54
  • Suppose you are doing object recognition. At first, the probability of seeing a distribution of objects is flat. Then you observe an image of the object. The distribution changes. Then perhaps you see another image of the object and the distribution changes again. If you're looking at a baseball, then the probability the object is a cow is zero. – Chet Oct 02 '14 at 22:17
  • Certainly Chet, but you are offering examples to the effect that we are not certain about the support of the distribution of the variable, something that I have covered in my answer, and something that the formal concept of Shannon's Entropy is not equipped to handle. You will need to search for other concepts, or create your own. – Alecos Papadopoulos Oct 03 '14 at 00:01
  • 1
    @Chet It seems that you are looking for information gain. That is, if you assume that you start with 8 equally probable elements and then learn that only 4 take place. In this case you have gain $H({\frac{1}{8},\frac{1}{8},\frac{1}{8},\frac{1}{8},\frac{1}{8},\frac{1}{8},\frac{1}{8},\frac{1}{8}})$ $- H({\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4},0,0,0,0})$. But for "pure" entropy, it makes sense to make options with probability 0 being the same as if they were missing. – Piotr Migdal Oct 17 '14 at 12:00
  • @PiotrMigdal Very good suggestion. Why don't you expand it into an answer, with some expository example? – Alecos Papadopoulos Oct 17 '14 at 13:18
  • @AlecosPapadopoulos Because your answer (which is good BTW) is already accepted and its unlikely that people will even see my. If you like it, feel free to incorporate it in your answer. – Piotr Migdal Oct 17 '14 at 14:04