117

Shannon's entropy is the negative of the sum of the probabilities of each outcome multiplied by the logarithm of probabilities for each outcome. What purpose does the logarithm serve in this equation?

An intuitive or visual answer (as opposed to a deeply mathematical answer) will be given bonus points!

histelheim
  • 2,993

17 Answers17

73

Shannon entropy is a quantity satisfying a set of relations.

In short, logarithm is to make it growing linearly with system size and "behaving like information".

The first means that entropy of tossing a coin $n$ times is $n$ times entropy of tossing a coin once:

$$ - \sum_{i=1}^{2^n} \frac{1}{2^n} \log\left(\tfrac{1}{2^n}\right) = - \sum_{i=1}^{2^n} \frac{1}{2^n} n \log\left(\tfrac{1}{2}\right) = n \left( - \sum_{i=1}^{2} \frac{1}{2} \log\left(\tfrac{1}{2}\right) \right) = n. $$

Or just to see how it works when tossing two different coins (perhaps unfair - with heads with probability $p_1$ and tails $p_2$ for the first coin, and $q_1$ and $q_2$ for the second) $$ -\sum_{i=1}^2 \sum_{j=1}^2 p_i q_j \log(p_i q_j) = -\sum_{i=1}^2 \sum_{j=1}^2 p_i q_j \left( \log(p_i) + \log(q_j) \right) $$ $$ = -\sum_{i=1}^2 \sum_{j=1}^2 p_i q_j \log(p_i) -\sum_{i=1}^2 \sum_{j=1}^2 p_i q_j \log(q_j) = -\sum_{i=1}^2 p_i \log(p_i) - \sum_{j=1}^2 q_j \log(q_j) $$ so the properties of logarithm (logarithm of product is sum of logarithms) are crucial.

But also Rényi entropy has this property (it is entropy parametrized by a real number $\alpha$, which becomes Shannon entropy for $\alpha \to 1$).

However, here comes the second property - Shannon entropy is special, as it is related to information. To get some intuitive feeling, you can look at $$ H = \sum_i p_i \log \left(\tfrac{1}{p_i} \right) $$ as the average of $\log(1/p)$.

We can call $\log(1/p)$ information. Why? Because if all events happen with probability $p$, it means that there are $1/p$ events. To tell which event have happened, we need to use $\log(1/p)$ bits (each bit doubles the number of events we can tell apart).

You may feel anxious "OK, if all events have the same probability it makes sense to use $\log(1/p)$ as a measure of information. But if they are not, why averaging information makes any sense?" - and it is a natural concern.

But it turns out that it makes sense - Shannon's source coding theorem says that a string with uncorrelated letters with probabilities $\{p_i\}_i$ of length $n$ cannot be compressed (on average) to binary string shorter than $n H$. And in fact, we can use Huffman coding to compress the string and get very close to $n H$.

See also:

tinlyx
  • 107
Piotr Migdal
  • 5,776
  • 13
    This answer has a lot of nice details - but from a layman's perspective it still skirts the issue - what is the role of the logarithm? Why can't we calculate entropy without the logarithm? – histelheim Feb 19 '14 at 19:40
  • 11
    @histelheim What you mean by "without the logarithm"? $\sum_i p_i$ is just one. If you want another measure of diversity without $\log$, look at diversity indices - e.g. so-called Inverse Simpson index $1/\sum_i p_i^2$ which tells effective number of choices (one over average probability), there is Gini–Simpson index $1-\sum_i p_i^2$ which is always between 0 and one. And if you don't care for subtle information-related properties of Shannon entropy, you can use any of them (though, they weight low and high probabilities differently). – Piotr Migdal Feb 19 '14 at 19:51
  • 14
    I am baffled by your last comment, Histelheim: what could "entropy without the logarithm" possibly refer to? That suggests you haven't yet clearly articulated your question, because it sounds like you have some unstated concept of "entropy" in mind. Please don't keep us guessing--edit your question so that your readers can provide the kinds of answers you are looking for. – whuber Feb 19 '14 at 21:23
  • 1
    @ Piotr Migdal - you write "logarithm is to make it growing linearly with system size and "behaving like information"." - this seems crucial for me to understand the role of the logarithm, however I'm not quite clear as to what it means. – histelheim Feb 21 '14 at 02:51
  • 2
    @ Piotr Migdal - further, your explanation following "We can call log(1/p) information. Why?" seems to make sense to me. Is it that the logarithm essentially moves us from a diversity index to an information index - measuring the number of bits we need to tell the events apart. – histelheim Feb 21 '14 at 02:57
  • whuber - I think @PiotrMigdal's first comment tells what "entropy without the logarithm is" - diversity. Maybe this doesn't make mathematical sense, but it makes intuitive sense - the measures are similar, but not the same. – histelheim Feb 21 '14 at 03:01
  • @whuber I have pretty much same concern on entropy. Why take log transformation. How does logarithm quantify diversity and why variance cant be used in place of entropy? large variance means large disorder and variance uses square transformation and not log transform. I am quite intrigued by having to define entropy by the number of bits we need to tell the events apart. Why not simply count the total number of events or take average "amount" of events in a system to quantify diversity? So by that definition a coin has "2 diversity" and dice has "6 diversity" – GENIVI-LEARNER Apr 29 '20 at 19:50
36

This is the same as the other answers, but I think the best way to explain it is to see what Shannon says in his original paper.

The logarithmic measure is more convenient for various reasons:

  1. It is practically more useful. Parameters of engineering importance such as time, bandwidth, number of relays, etc., tend to vary linearly with the logarithm of the number of possibilities. For example, adding one relay to a group doubles the number of possible states of the relays. It adds 1 to the base 2 logarithm of this number. Doubling the time roughly squares the number of possible messages, or doubles the logarithm, etc.
  2. It is nearer to our intuitive feeling as to the proper measure. This is closely related to (1) since we intuitively measures entities by linear comparison with common standards. One feels, for example, that two punched cards should have twice the capacity of one for information storage, and two identical channels twice the capacity of one for transmitting information.
  3. It is mathematically more suitable. Many of the limiting operations are simple in terms of the logarithm but would require clumsy restatement in terms of the number of possibilities

Source: Shannon, A Mathematical Theory of Communication (1948) [pdf].


Note that the Shannon entropy coincides with the Gibbs entropy of statistical mechanics, and there is also an explanation for why the log occurs in Gibbs entropy. In statistical mechanics, entropy is supposed to be a measure the number of possible states $\Omega$ in which a system can be found. The reason why $\log \Omega$ is better than $\Omega$ is because $\Omega$ is usually a very fast-growing function of its arguments, and so cannot be usefully approximated by a Taylor expansion, whereas $\log \Omega$ can be. (I don't know whether this was the original motivation for taking the log, but it is explained this way in a lot of introductory physics books.)

Richard Hardy
  • 67,272
Flounderer
  • 10,518
  • 1
  • 37
  • 45
  • This answer seems to be the most focused yet informative. – bright-star Feb 20 '14 at 01:41
  • 1
    This isn't why the log appears in the entropy calculation. This is why the information reported is reported as such. There is an alternative quantity: the "perplexity" that reports information without the log. In this part of his paper, Shannon is arguing in favor of bits/nats/hartleys, and against perplexity. – Neil G Feb 25 '14 at 15:16
  • @NeilG I really like what Flounderer says that entropy is suppsed to measure different states of the system i.e $\Omega$, so why cant we simply define entropy by just no of states the system can take? i.e the entropy of dice shall be simply 6 and entropy of coin should be 2 etc. It makes intuitive sense. Why log transform? This transform makes sense if you use certain "medium" to tell the events apart say binary digit medium or bits but essentially why should be define entropy by creating some arbitrary medium and use that medium to tell the events apart.In binary medium 1 bit will [continued] – GENIVI-LEARNER Apr 29 '20 at 20:39
  • 1
    @NeilG [continued] distinguish the states for the coin toss apart. 1 bit has two states. So I can create arbitrary medium say Sixt = 6 states, then 1 Sixt of information is present in a dice just like 1 bit of information is in coin toss. It looks like we are taking long distance trip to just count the states by introducing the log. – GENIVI-LEARNER Apr 29 '20 at 20:46
  • 1
    @GENIVI-LEARNER I explained why entropy is defined the way it is (instead of your definition, for example) in my answer. – Neil G Apr 29 '20 at 23:27
  • 1
    @NeilG I just saw your answer and commented there as it is more appropriate. – GENIVI-LEARNER Apr 30 '20 at 11:06
23

another way of looking at this is from an algorithmic point of view. Imagine that you're going to guess a number $x$, that the only information you have is that this number is in the interval $1 \leq x \leq N$. In this situation, the optimal algorithm for guessing the number is a simple Binary search algorithm, which finds $x$ in order $O(\log_2N)$. This formula intuitively says how many questions you need to ask to find out what's $x$. For example, if $N=8$, you need to ask maximum 3 questions to find the unkown $x$.

From the probabilistic perspective, when you declare $x$ as being equally likely to be any values in range $1 \leq x \leq N$, it means $p(x) = 1/N$ for $1 \leq x \leq N$. Claude Shannon nicely showed that the information content of an outcome $x$ is defined as:

\begin{equation} h(x) = \log_2 \frac{1}{p(x)} \end{equation}

The reason for the base 2 in the logarithm is that here we're measuring the information in bits. You can also assume natural logarithm which makes your information measure in nats. As an example, the information content of outcom $x=4$ is $h(4) = 3$. This value is precisely equal to the number of steps in the binary search algorithm (or number of IF statements in the algorithm). Therefore, the number of questions you need to find out $x$ is equal to $4$, is exactly the information content of the outcome $x=4$.

We can also analyze the performance of the binary search algorithm for any possible outcome. One way of doing that is to find out what's the expected number of questions to be asked for any values of $x$. Note that the number of required questions for guessing a value of $x$, as I discussed above, is $h(x)$. Therefore, the expected number of questions for any $x$ is by definition equal to:

\begin{equation} \langle h(x) \rangle = \sum_{1 \leq x \leq N} p(x) h(x) \end{equation}

The expected number of questions $\langle h(x) \rangle$ is just same as the entropy of an ensemble $H(X)$, or entropy in short. Therefore, we can conclude that entropy $H(X)$ quantifies the expected (or average) number of the questions one need to ask in order to guess an outcome, which is the computational complexity of the binary search algorithm.

omidi
  • 1,039
  • 1
  • This is one of my favorite applications of information theory - algorithm analysis. If you have decision points with >2 outcomes, such as when you index an array, that's the principle behind hash coding and O(n) sorts.
  • – Mike Dunlavey Feb 20 '14 at 01:07
  • This argument is fine for discrete entropy, but doesn't easily generalize to continuous entropy. – Neil G Feb 25 '14 at 15:25
  • @NeilG It does but the number of questions just becomes infinite. This interpretation is for me the most sensible as it grounds information theory into a real-world application : "how many yes/no questions do I need to answer X?". – cvanelteren Dec 19 '19 at 10:13
  • 1
    @cvanelteren I agree that this interpretation is intuitive, but I disagree that the number of questions "becomes infinite" in the continuous case. Continuous entropy is usually finite. That's the problem with this answer: It gave you some intuition, but it fails you in reasoning about the general case. – Neil G Dec 19 '19 at 11:37