What is the role of the logarithm in Shannon's entropy?

Question

Shannon's entropy is the negative of the sum of the probabilities of each outcome multiplied by the logarithm of probabilities for each outcome. What purpose does the logarithm serve in this equation?

An intuitive or visual answer (as opposed to a deeply mathematical answer) will be given bonus points!

You (or other readers) may enjoy: A. Renyi (1961), On Measures of Entropy and Information, Proc. of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 547-561. — cardinal, Feb 19 '14 at 17:58
Based on your reaction, I suppose what you mean is why Shannon used logarithm in his formula, right? — Ooker, Nov 16 '15 at 17:56
@Ooker: That is one way to phrase it. "Why" did he put it in? "What" is it's function or role"? "What" does it achieve? "How" is it helpful? To me, these are all in the same neighborhood... — histelheim, Nov 16 '15 at 18:41
Look at my answer here: http://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 — kjetil b halvorsen, Jan 18 '17 at 03:18
See my answer, I think the meaning of a log can truly be understood only by examining the roots of Shannon entropy in statistical mechanics — Aksakal, Oct 20 '19 at 15:44

score 73 · Answer 1 · edited May 04 '20 at 10:42

73

Shannon entropy is a quantity satisfying a set of relations.

In short, logarithm is to make it growing linearly with system size and "behaving like information".

The first means that entropy of tossing a coin $n$ times is $n$ times entropy of tossing a coin once:

$$ - \sum_{i=1}^{2^n} \frac{1}{2^n} \log\left(\tfrac{1}{2^n}\right) = - \sum_{i=1}^{2^n} \frac{1}{2^n} n \log\left(\tfrac{1}{2}\right) = n \left( - \sum_{i=1}^{2} \frac{1}{2} \log\left(\tfrac{1}{2}\right) \right) = n. $$

Or just to see how it works when tossing two different coins (perhaps unfair - with heads with probability $p_1$ and tails $p_2$ for the first coin, and $q_1$ and $q_2$ for the second) $$ -\sum_{i=1}^2 \sum_{j=1}^2 p_i q_j \log(p_i q_j) = -\sum_{i=1}^2 \sum_{j=1}^2 p_i q_j \left( \log(p_i) + \log(q_j) \right) $$ $$ = -\sum_{i=1}^2 \sum_{j=1}^2 p_i q_j \log(p_i) -\sum_{i=1}^2 \sum_{j=1}^2 p_i q_j \log(q_j) = -\sum_{i=1}^2 p_i \log(p_i) - \sum_{j=1}^2 q_j \log(q_j) $$ so the properties of logarithm (logarithm of product is sum of logarithms) are crucial.

But also Rényi entropy has this property (it is entropy parametrized by a real number $\alpha$, which becomes Shannon entropy for $\alpha \to 1$).

However, here comes the second property - Shannon entropy is special, as it is related to information. To get some intuitive feeling, you can look at $$ H = \sum_i p_i \log \left(\tfrac{1}{p_i} \right) $$ as the average of $\log(1/p)$.

We can call $\log(1/p)$ information. Why? Because if all events happen with probability $p$, it means that there are $1/p$ events. To tell which event have happened, we need to use $\log(1/p)$ bits (each bit doubles the number of events we can tell apart).

You may feel anxious "OK, if all events have the same probability it makes sense to use $\log(1/p)$ as a measure of information. But if they are not, why averaging information makes any sense?" - and it is a natural concern.

But it turns out that it makes sense - Shannon's source coding theorem says that a string with uncorrelated letters with probabilities $\{p_i\}_i$ of length $n$ cannot be compressed (on average) to binary string shorter than $n H$. And in fact, we can use Huffman coding to compress the string and get very close to $n H$.

See also:

A nice introduction is Cosma Shalizi's Information theory entry
What is entropy, really? - MathOverflow
Dissecting the GZIP format

edited May 04 '20 at 10:42

tinlyx

107

answered Feb 19 '14 at 18:33

Piotr Migdal

5,776

13

This answer has a lot of nice details - but from a layman's perspective it still skirts the issue - what is the role of the logarithm? Why can't we calculate entropy without the logarithm? – histelheim Feb 19 '14 at 19:40
11

@histelheim What you mean by "without the logarithm"? $\sum_i p_i$ is just one. If you want another measure of diversity without $\log$, look at diversity indices - e.g. so-called Inverse Simpson index $1/\sum_i p_i^2$ which tells effective number of choices (one over average probability), there is Gini–Simpson index $1-\sum_i p_i^2$ which is always between 0 and one. And if you don't care for subtle information-related properties of Shannon entropy, you can use any of them (though, they weight low and high probabilities differently). – Piotr Migdal Feb 19 '14 at 19:51
14

I am baffled by your last comment, Histelheim: what could "entropy without the logarithm" possibly refer to? That suggests you haven't yet clearly articulated your question, because it sounds like you have some unstated concept of "entropy" in mind. Please don't keep us guessing--edit your question so that your readers can provide the kinds of answers you are looking for. – whuber Feb 19 '14 at 21:23
1

@ Piotr Migdal - you write "logarithm is to make it growing linearly with system size and "behaving like information"." - this seems crucial for me to understand the role of the logarithm, however I'm not quite clear as to what it means. – histelheim Feb 21 '14 at 02:51
2

@ Piotr Migdal - further, your explanation following "We can call log(1/p) information. Why?" seems to make sense to me. Is it that the logarithm essentially moves us from a diversity index to an information index - measuring the number of bits we need to tell the events apart. – histelheim Feb 21 '14 at 02:57
whuber - I think @PiotrMigdal's first comment tells what "entropy without the logarithm is" - diversity. Maybe this doesn't make mathematical sense, but it makes intuitive sense - the measures are similar, but not the same. – histelheim Feb 21 '14 at 03:01
@whuber I have pretty much same concern on entropy. Why take log transformation. How does logarithm quantify diversity and why variance cant be used in place of entropy? large variance means large disorder and variance uses square transformation and not log transform. I am quite intrigued by having to define entropy by the number of bits we need to tell the events apart. Why not simply count the total number of events or take average "amount" of events in a system to quantify diversity? So by that definition a coin has "2 diversity" and dice has "6 diversity" – GENIVI-LEARNER Apr 29 '20 at 19:50

score 36 · Answer 2 · edited Apr 05 '16 at 10:34

This is the same as the other answers, but I think the best way to explain it is to see what Shannon says in his original paper.

The logarithmic measure is more convenient for various reasons:

It is practically more useful. Parameters of engineering importance such as time, bandwidth, number of relays, etc., tend to vary linearly with the logarithm of the number of possibilities. For example, adding one relay to a group doubles the number of possible states of the relays. It adds 1 to the base 2 logarithm of this number. Doubling the time roughly squares the number of possible messages, or doubles the logarithm, etc.

It is nearer to our intuitive feeling as to the proper measure. This is closely related to (1) since we intuitively measures entities by linear comparison with common standards. One feels, for example, that two punched cards should have twice the capacity of one for information storage, and two identical channels twice the capacity of one for transmitting information.

It is mathematically more suitable. Many of the limiting operations are simple in terms of the logarithm but would require clumsy restatement in terms of the number of possibilities

Source: Shannon, A Mathematical Theory of Communication (1948) [pdf].

Note that the Shannon entropy coincides with the Gibbs entropy of statistical mechanics, and there is also an explanation for why the log occurs in Gibbs entropy. In statistical mechanics, entropy is supposed to be a measure the number of possible states $\Omega$ in which a system can be found. The reason why $\log \Omega$ is better than $\Omega$ is because $\Omega$ is usually a very fast-growing function of its arguments, and so cannot be usefully approximated by a Taylor expansion, whereas $\log \Omega$ can be. (I don't know whether this was the original motivation for taking the log, but it is explained this way in a lot of introductory physics books.)

This isn't why the log appears in the entropy calculation. This is why the information reported is reported as such. There is an alternative quantity: the "perplexity" that reports information without the log. In this part of his paper, Shannon is arguing in favor of bits/nats/hartleys, and against perplexity. — Neil G, Feb 25 '14 at 15:16
@NeilG I really like what Flounderer says that entropy is suppsed to measure different states of the system i.e $\Omega$, so why cant we simply define entropy by just no of states the system can take? i.e the entropy of dice shall be simply 6 and entropy of coin should be 2 etc. It makes intuitive sense. Why log transform? This transform makes sense if you use certain "medium" to tell the events apart say binary digit medium or bits but essentially why should be define entropy by creating some arbitrary medium and use that medium to tell the events apart.In binary medium 1 bit will [continued] — GENIVI-LEARNER, Apr 29 '20 at 20:39
@NeilG [continued] distinguish the states for the coin toss apart. 1 bit has two states. So I can create arbitrary medium say Sixt = 6 states, then 1 Sixt of information is present in a dice just like 1 bit of information is in coin toss. It looks like we are taking long distance trip to just count the states by introducing the log. — GENIVI-LEARNER, Apr 29 '20 at 20:46
@GENIVI-LEARNER I explained why entropy is defined the way it is (instead of your definition, for example) in my answer. — Neil G, Apr 29 '20 at 23:27
@NeilG I just saw your answer and commented there as it is more appropriate. — GENIVI-LEARNER, Apr 30 '20 at 11:06

omidi · Answer 3 · 2014-02-20T13:14:12.437

another way of looking at this is from an algorithmic point of view. Imagine that you're going to guess a number $x$, that the only information you have is that this number is in the interval $1 \leq x \leq N$. In this situation, the optimal algorithm for guessing the number is a simple Binary search algorithm, which finds $x$ in order $O(\log_2N)$. This formula intuitively says how many questions you need to ask to find out what's $x$. For example, if $N=8$, you need to ask maximum 3 questions to find the unkown $x$.

From the probabilistic perspective, when you declare $x$ as being equally likely to be any values in range $1 \leq x \leq N$, it means $p(x) = 1/N$ for $1 \leq x \leq N$. Claude Shannon nicely showed that the information content of an outcome $x$ is defined as:

\begin{equation} h(x) = \log_2 \frac{1}{p(x)} \end{equation}

The reason for the base 2 in the logarithm is that here we're measuring the information in bits. You can also assume natural logarithm which makes your information measure in nats. As an example, the information content of outcom $x=4$ is $h(4) = 3$. This value is precisely equal to the number of steps in the binary search algorithm (or number of IF statements in the algorithm). Therefore, the number of questions you need to find out $x$ is equal to $4$, is exactly the information content of the outcome $x=4$.

We can also analyze the performance of the binary search algorithm for any possible outcome. One way of doing that is to find out what's the expected number of questions to be asked for any values of $x$. Note that the number of required questions for guessing a value of $x$, as I discussed above, is $h(x)$. Therefore, the expected number of questions for any $x$ is by definition equal to:

\begin{equation} \langle h(x) \rangle = \sum_{1 \leq x \leq N} p(x) h(x) \end{equation}

The expected number of questions $\langle h(x) \rangle$ is just same as the entropy of an ensemble $H(X)$, or entropy in short. Therefore, we can conclude that entropy $H(X)$ quantifies the expected (or average) number of the questions one need to ask in order to guess an outcome, which is the computational complexity of the binary search algorithm.

This argument is fine for discrete entropy, but doesn't easily generalize to continuous entropy. — Neil G, Feb 25 '14 at 15:25
@NeilG It does but the number of questions just becomes infinite. This interpretation is for me the most sensible as it grounds information theory into a real-world application : "how many yes/no questions do I need to answer X?". — cvanelteren, Dec 19 '19 at 10:13
@cvanelteren I agree that this interpretation is intuitive, but I disagree that the number of questions "becomes infinite" in the continuous case. Continuous entropy is usually finite. That's the problem with this answer: It gave you some intuition, but it fails you in reasoning about the general case. — Neil G, Dec 19 '19 at 11:37

Mike Dunlavey · Answer 4 · 2014-02-19T23:25:51.883

Here's an off-the-cuff explanation. You could say 2 books of the same size have twice as much information as 1 book, right? (Considering a book to be a string of bits.) Well, if a certain outcome has probability P, then you could say its information content is about the number of bits you need to write out 1/P. (e.g. if P=1/256, that's 8 bits.) Entropy is just the average of that information bit length, over all the outcomes.

score 7 · Answer 5 · answered Feb 28 '14 at 17:24

The purpose of $\log(p_i)$ appearing in Shannon's Entropy is that $\log(p_i)$ is the only function satisfying the basic set of properties that the entropy function, $H(p_1, \ldots ,p_N)$, is held to embody.

Shannon provided a mathematical proof of this result that has been thoroughly picked over and widely accepted. The purpose and significance of the logarithm in the entropy equation is therefore self-contained within the assumptions & proof.

This doesn't make it easy to understand, but it is ultimately the reason why the logarithm appears.

I have found the following references useful in addition to those listed elsewhere:

Probability Theory: The Logic of Science by E.T. Jaynes. Jaynes is one of the few authors who derives many results from scratch; see Chapter 11.
Information Theory, Inference and Learning Algorithms by David MacKay. Contains an in-depth analysis of Shannon's source coding theorem; see Chapter 4.

caveman · Answer 6 · 2016-04-05T11:09:53.657

Summary:

Because it represents average total number of perfect questions that you need them to get answered in order to fully resolve all ambiguities in a data that you hadn't seen yet. A perfect question with $n$ possible answers is one that, when answered, the space of possibilities will be reduced by $n$ times.

Example:

Suppose that I rolled a $6$-faced fair dice and you were to predict its outcome. The space of possibilities is $6$. You could ask me questions like this binary one "is the outcome $1$?" (answer is either yes or no, i.e. $n=2$) and my answer could be "nopies!". Then the space of possibilities by just $1$. So this question is not a good one to ask.

Alternatively, you could ask better questions, such as this superior binary question "is it greater than $3.5$?", and my answer would be "yeppies!" -- then boom, the space of possibilities is reduced down by half! I.e. there are just $6/2=3$ candidates left (out of the originally 6). Hell yeah dude.

Now suppose that you keep recursively asking more of these good questions until you reach the case when the space of possibilities has only $1$ possibility, by which -by definition- there is no ambiguity left (you know the answer).

Let's do this:

$6$ possibilities. Q: Is outcome $> 3.5$? A: Yes.
$6/2=3$ possibilities left. Q: is outcome $\ge 5$? A: Yes.
$6/2/2=1.5$ possibilities left. Q: is outcome $= 6$? A: Yes.

You conclude that the outcome must be number $6$, and you only needed to ask $3$ binary questions. I.e. $ceil(\log_2(6)) = ceil(2.58) = 3$

Now, obviously, number of binary questions are always a natural number. So why doesn't Shannon's entropy use $ceil$ function? Because it actually spits out the average number of good questions that need to be asked.

If you repeat this experiment (by writing a Python code), you will notice that on average you will need to ask $2.58$ perfect binary questions.

Of course, if you ask binary questions, you set the base of the log to that. So here $\log_2(...)$ because our questions were binary. If you ask questions that expect $n$ many possible answers, you will set the base to $n$ instead of $2$, i.e. $\log_n(...)$.

Simulation:

import random

total_questions = 0
TOTAL_ROUNDS = 10000

for i in range(0,TOTAL_ROUNDS):
    outcome = random.randrange(1,7)
    total_questions += 1
    if outcome > 3.5:
        total_questions += 1
        if outcome >= 5:
            total_questions += 1
            if outcome == 5:
                pass
            else:
                # must be 6! no need to ask
                pass
        else:
            # must be 4! no need to ask
            pass
    else:
        total_questions += 1
        if outcome >= 2:
            total_questions += 1
            if outcome == 2:
                pass
            else:
                # must be 3! no need to ask
                pass
        else:
            # must be 1! no need to ask
            pass


print 'total questions: ' + str(total_questions)
print 'average questions per outcome: ' + str(total_questions/float(TOTAL_ROUNDS))

Results:

total questions: 26634
average questions per outcome: 2.6634

Holy molly dude $2.6634 \ne \log_2(6) \ne 2.58$.

What's wrong? It's almost close, but not really close as I hoped. Is it Python's PRNG trying to say a slow joke? Or is it Shannon being wrong? Or is it -God forbid- my understanding is wrong? Either way HELP. S.O.S. already dude.

You're onto a good explanation. The resolution of your difficulty is to combine separate problems. I'll illustrate. Don't predict one die at a time: predict, say, five at a time. There are $6^5=7776$ possibilities. By asking $\lceil\log_2(6^5)\rceil=13$ questions you can pin down any possible combination. Consequently (because the dice are independent) there are an average of $13/5=2.6$ bits of information per die. Better, roll $190537$ dice: it takes $492531$ questions to discover all their values, or $492531/190537\approx 2.584962500722$ questions per die to predict them all. Etc. — whuber, Apr 05 '16 at 20:59
@whuber isn't this what I am doing in my code? I toss 10000 dies, and sum total number of questions I ask for all dies. I then do sum/10000 I get 2.66. — caveman, Apr 06 '16 at 20:41
No, you're not doing that in your code at all! You need to ask a set of questions designed to simultaneously obtain the state of all the dice at once. That is not the same thing as the average number of questions needed to find the state of one die at a time. — whuber, Apr 06 '16 at 20:42
Do you mean that if there are N possible answers that asking the perfect question will reduce that by half and therefore a half of a half, etc, which is exponential ... and logarithm being a mathematically simpler form? — Randy Zeitman, Apr 26 '22 at 16:21

Neil G · Answer 7 · 2020-04-30T13:58:25.573

Suppose we have a discrete information source that produces symbols from some finite alphabet $\Omega = \{\omega_1, \dotsc, \omega_n\}$ with probabilities $p_1, \dotsc, p_n$. Shannon defines the entropy as the measure $H(p_1, \dotsc, p_n)$ such that

$H$ is continuous in its parameters,
$H$ is monotone increasing in $n$ when $p_1 = \dots = p_n = \frac1n$ (since uncertainty is increasing), and
$H$ is independent of how a choice is split into successive choices. For example, consider three events when rolling a black die and a white die: (1) the white die is odd, (2) the white die is even and the black die is less than three, and (3) otherwise. Either the dice are rolled together, or else the white die is rolled first, and maybe the black die if necessary. This requirement states that \begin{align} H\left(\frac12, \frac16, \frac13\right) &= H\left(\frac12, \frac12\right) + \frac12 H\left(\frac13, \frac23\right). \end{align}

Shannon proves that the only $H$ satisfying the three requirements has the form \begin{align} H(p_1, \dotsc, p_n) &= -\sum_{i=1}^np_i\log_kp_i \end{align} where $k>1$ corresponds to an arbitrary information measurement unit. When $k=2$, this unit is the bit. For a proof, see appendix 2 in:

C. E. Shannon. 2001. A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5, 1 (January 2001), 3–55.

In the end all we are interested in how to quantify information (something new, something different). So if we merely count the no of states a system could have then it will suffice the three properties you mentioned. So say dice has 6 sides so we can count 6 units of information while coin has 2 sides so 2 units of information. If we compare 6 with 2 we can easily deduce that dice carries more information (fulfilling monotonic property you stated). This counting can be done in continuous domain by integrating (measure theory). So counting is conditionally independent of how choice is split. — GENIVI-LEARNER, Apr 30 '20 at 11:04
@GENIVI-LEARNER You're mistaken: Your perplexity idea doesn't satisfy the third property. See Appendix 2 in the my citation. — Neil G, Apr 30 '20 at 13:52
thanks for pointing it out I shall thoroughly review your reference. — GENIVI-LEARNER, May 01 '20 at 14:44
@GENIVI-LEARNER I'm afraid I don't understand. 1) what is the third property that Neil mentions, and 2) why do we care about it? — Connor McCormick, Jun 04 '21 at 00:37

score 5 · Answer 8 · edited Oct 31 '16 at 00:26

This question was raised two years ago and there have been many awesome answers already, but I'd like to add mine which helped myself a lot.

The question is

What purpose does the logarithm serve in this equation?

The logarithm(usually based on 2) is because of the Kraft's Inequality.

$\sum_{i=1}^m 2^{-l_i} <= 1$

we can intuit it this way: sum of the probability of all the code with length $l_i$ is smaller than 1. From the inequality we can derive the following result that for every code length function $L_x$ of a uniquely decodable code, there is a distribution $P(x)$ such that

$P(x) = 2^{-L(x)} $,

And hence $L_{(x)} = -logP(x)$ and $P(x)$ is the probability of the code with length $L_{(x)}$.

The Shannon's entropy is defined as the average length of all code. Since the probability of every code with lenght $L_{(x)}$ is $P(x)$, the average length(or Shannon's entropy) is $-P(x)logP(x)$.

An intuitive illustration and a visual answer(as you required, but more specifically for the Kraft's Inequality) is articulated in this paper Code Tree, and Kraft's Inequality.

kjetil b halvorsen · Answer 9 · 2020-07-29T18:21:39.513

A historical view may be interesting. Entropy is related to the concept of weight of evidence from information theory (note this is not the same as discussed here Intuition behind Weight of Evidence and Information Value formula) This $\text{woe}$ is discussed deeply in this book by IJ Good, (much of the content in that book certainly he learnt when working with A Turing at Bletchley Park.) This other Good book can be easier to find, and contains some of the same material.

But it goes even longer back, for instance to this classic paper of CS Peirce, which discussed at length why using the logarithm:

It is that our belief ought to be proportional to the weight of evidence, in this sense, that two arguments which are entirely independent, neither weakening nor strengthening each other, ought, when they concur, to produce a belief equal to the sum of the intensities of belief which either would produce separately. Now, we have seen that the chances of independent concurrent arguments are to be multiplied together to get the chance of their combination, and therefore the quantities which best express the intensities of belief should be such that they are to be added when the chances are multiplied in order to produce the quantity which corresponds to the combined chance. Now, the logarithm is the only quantity which fulfills this condition. There is a general law of sensibility, called Fechner's psychophysical law. It is that the intensity of any sensation is proportional to the logarithm of the external force which produces it. It is entirely in harmony with this law that the feeling of belief should be as the logarithm of the chance, this latter being the expression of the state of facts which produces the belief.

So Peirce compares to Fechner's law! Now, the weight of evidence for an hypothesis $H$ as compared to an alternative $\bar{H}$ on evidence $E$ (using notation from the Good book) is basically the logarithm of the likelihood ratio $$\DeclareMathOperator{\P}{\mathbb{P}} \text{woe}=\log\frac{\P(E | H)}{\P(E | \bar{H})} $$ The expectation of the weight of evidence is related to concepts of entropy. For instance $$\DeclareMathOperator{\KL}{KL} \DeclareMathOperator{\E}{\mathbb{E}} \E_H \text{woe} = \KL(\P(\cdot | H) || \P(\cdot | \bar{H}) ) $$ (and $\E_\bar{H} \text{woe} = 0$) where $\KL$ is the Kullback-Leibler divergence, see Intuition on the Kullback-Leibler (KL) Divergence. So arguments for the use of logarithm must be valid for both or neither.

IJ Good comments about entropy

While the manuscript was with the publishers an article appeared involving ideas that are related in some ways to those of the present chapter. Suppose that an event occurs whose probability on known evidence is p. It is desired to introduce a simple numerical definition for the amount of information that is thereby conveyed. We have already defined a measure for the weight of evidence in favour of a particular hypothesis, but we are now concerned with the amount of information as such, i.e. the amount from the point of view of a person who is interested merely in collecting information, without reference to any uncertain hypothesis. It is natural to make two demands on the measure (i) it should be a decreasing function of p, and (ii) the amount of information provided by two independent events should be the sum of the separate amounts. The only functions satisfying these conditions are of the form -log p, where the units are natural bels if the base of the logarithms is e. If the base is 2 then the unit may be called an "octave", a" binary digit" or (after J. Tukey) a "bit ". For example, if a coin is spun and comes down heads then one bit of information is provided.

Then he goes on explaining how concepts from the Shannon paper is related to his book.

Is this saying that the CHANCES of multiple information events occurring simultaneously is multiplied ... but should they occur the total VALUE of the information conveyed is proportional to the sum of the chances of the events? — Randy Zeitman, Apr 26 '22 at 16:27

Aksakal · Answer 10 · 2019-10-20T15:41:19.773

I don't think it is possible to give you a universal "intuitive" answer. I'll give you answer that is intuitive for some people, such as physicists. Logarithm is there to get the average energy of the system. Here's details.

Shannon used a word "entropy" because he adapted the concept from statistical mechanics. In statistical mechanics there's a seminal distribution named after Boltzmann. Interestingly, it's an important distribution now in machine learning!

The Boltzmann distribution can be written as $$P=e^{\frac{a-E} b}$$ where $a, b$ are constants, and $E$ is the energy of the system in a state $dV$ of the state space $V$. In classical thermodynamics $dV=dpdx$, where $x,p$ are a coordinate and momentum of the particle. It's a proper probability function when constants $a,b$ are selected properly, i.e. $\int_VPdV=1$. Also, you may find it interesting that $b$ corresponds to a temperature of the system.

Now, notice how $\ln P\sim E$, i.e. a log of probability is linear (proportional) to energy. Now, you can see that the following expression is essentially an expected value of energy of the system: $$S\equiv -\int_VP\ln P dV=<E>$$ This is what Gibbs did.

So, Shannon took this thing and discretized as $$\eta=-\sum_i P_i\ln P_i$$ and called it "entropy," and we call this "Shannon entropy." There's no more energy concept here, but maybe you could anti-log the probability of a state $e^{-P_i}$ and call this an energy of the state?

Is this intuitive enough for you? It is for me, but I was a theoretical physicist in past life. Also, you can go to a deeper level of intuition by linking to even older thermodynamics concepts such as temperature and works of Boltzmann and Clausius.

Is the intuitive answer that the CHANCES of multiple information events occurring simultaneously is multiplied ... but their VALUE (of information conveyed) should they occur is proportional (equal?) to the sum of the chances? — Randy Zeitman, Apr 26 '22 at 16:31

Atamiri · Answer 11 · 2015-08-04T18:13:03.237

0

Entropy is defined as the logarithm of the geometric mean of the multinomial coefficient that expresses the number of states a system can be in:

$$\log \sqrt[N]{N \choose n_1,\ldots,n_k}$$

The logarithms appear in the formula after using Stirling's approximation of the factorial (see this explanation)

edited Aug 04 '15 at 18:13

answered Aug 04 '15 at 16:39

Atamiri

101

4

I believe the OP knows the logarithm is part of the definition. They ask why is it there? – whuber Aug 04 '15 at 17:12

score 0 · Answer 12 · edited Apr 13 '17 at 12:44

Based on your nonacceptance of any already answers, I think what you are looking for is the reason why Shannon used logarithm in his formula at the first place. In other words, the philosophy of it.

_{Disclaimer: I'm just into this field for a week, coming here because of having the question just like you. If you have more knowledge on this, please let me know.}

I have this question after reading one of the most important paper of Ulanowicz, Increasing Entropy: Heat death or perpetual harmonies?. This is the paragraph explains why the formula has -log(p) instead of (1-p):

Before further unpacking the formal definition of entropy, one would be justified in asking why not simply choose (1 – p) instead of [–log(p)] as the most appropriate measure of nonexistence? The answer is that the resultant product with p (that is [p–p^2]) is perfectly symmetrical around the value p = 0.5. Calculations pursuant to such a symmetric combination would be capable of describing only a reversible universe. Boltzmann and Gibbs, however, were seeking to quantify an irreversible universe. By choosing the univariate convex logarithmic function, Boltzmann thereby imparted a bias to nonbeing over being. One notices, for example, that max[–xlog{x}] = {1/e} ≈ 0.37, so that the measure of indeterminacy is skewed towards lower values of pi.

Looks like that Shannon chose logarithm for no reason. He just "smelt" that he should use logarithm. Why did Newton choose multiply operation in his formula F=m*a?

Note that at that time, he had no idea about entropy:

My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.

So my answer is: there is no reason for this. He chose this because it just magically worked.

score 0 · Answer 13 · answered Jan 18 '17 at 02:21

The log comes from the derivation of a function H satisfying certain natural requirements. See pg. 3 Sec. 2 of this source:

http://www.lptl.jussieu.fr/user/lesne/MSCS-entropy.pdf

Given the axioms, if you carry out the optimization, you get a unique (upto constants) function with a log in it.

All of the above answers are correct, except that they interpret the log, but don't explain the source of it.

Gabrer · Answer 14 · 2019-10-20T14:16:00.243

I guess your question is more about the "meaning" of that logarithm and why each component contributes to the overall meaning of the formula, rather than the mere formalism showing the coherence of the definition to certain requirements.

The idea in the Shannon entropy is to evaluate the information of a message by looking at its FREQUENCY (i.e. $p(x)$) and at its GENERALITY (i.e. $-log(p(x))$):

$p(x)$: the more "frequent" a message is the less information will carry (i.e. easier to be predicted).
$-log(p(x))$: The more "general" a message is the more information will carry.

The first term $p(x)$ is about the frequency, the $-log(p(x))$ is about its generality.

From now on, I will discuss how the GENERALITY affects the final entropy formula.

So, we can define how general (e.g. rain/not rain) or specific (e.g. ligth/avg/heavy/veryHeavy rain) is a message based on the number of bits needed to encode it: $$ log_2(x) = number\_of\_bits\_to\_encode\_the\_messages $$

Now, sit, relax and look at how beautifully Shannon's Entropy does the trick: it is based on the (reasonable) assumption that messages which are more GENERAL are, consequently, more FREQUENT.

E.g. I will say that is raining either if it is an average, heavy or veryHeavy rain. Thus, he proposed to encode the GENERALITY of messages based on how FREQUENT they are... and there you go:

$$ log_2 N = -log_2 1/N = -log_2 P $$

with $N$ the frequency of a message $x$.

The equation can be interpreted as: rare messages will have longer encoding because they are less general, so they need more bits to be encoded and are less informative. Therefore, having more specific and rare messages will contribute more to the entropy than having many general and frequent messages.

In the final formulation, we want to consider two aspects. The first, $p(x)$, is that frequent messages are easier to be predicted, and from this perspective less informative (i.e. longer encoding means higher entropy). The second one, $-log(p(x))$, is that frequent messages are also general, and from this perspective more informative (i.e. shorter encoding means lower entropy).

The highest entropy is when we have a system with many rare and specific messages. The lowest entropy with frequent and general messages. In between, we have a spectrum of entropy-equivalent systems which might have both rare and general messages or frequent but specific messages.

manisar · Answer 15 · 2021-01-27T23:05:35.507

There is a profound reason why the logarithm comes into picture, and it is not randomly chosen. The relationship between $\log$ and information stems from this simple way of writing any number $m$ (the symbols don't have any meaning yet), and the discussion that follows.

$$ m = \frac{1}{p} = 2^{i} \tag{1}$$

The above tells us that if we use exactly $i$ letters to encode a string where each letter can have one of 2 values at a time, we'll get $m$ different strings. A 2-valued letter is nothing else but a bit. So writing any number $m$ in this way brings into picture a property of the number - $i$ which can be used to construct the number $m$ again (uniquely) - using bits.

Now, it is easy to see that for a given outcome that has a probability $p$, the number of other outcomes in the same event that have probabilities greater than $p$ will always be less than or equal to $\frac{1}{p}$. For detail on this, check here.

This means that, as per $(1), $ $i=\log_2(\frac{1}{p})$ bits can be safely used to represent this outcome in an event unless there are lower probability outcomes. But even if there are lower probability outcomes, it is easy to see that we can still encode this outcome with $i=\log_2(\frac{1}{p})$ bits, and use more bits to encode the lower-$p$ outcomes. Check here for a detailed proof. In summary, $i=\log_2(\frac{1}{p})$ bits can be safely used to represent this outcome in any event.

Now, the information about an outcome that goes from the sender to receiver is actually the codeword that represents the outcome. And we just saw how the length of the codeword is determined by $\log_2(\frac{1}{p})$. So, we choose to call this special length $i$ - the information of the event, and that's how $\log$ comes into picture naturally.

$ p=0.25 \Rightarrow i = 2 $ means that we need $ 2 $ bits for encoding this outcome in any event.

$ p=0.125 \Rightarrow i = 3 $ means that we need ( 3 ) bits for encoding this outcome in any event.

Finally, what would be the information content of any event in total, that is, the information of all the outcomes combined? In other words, what is the information content of a system that can have different states with different probabilities? The answer is that each outcome or state adds its information to the system but only in the ratio of how much of it is there - i.e. its probability. This is just verbiage for the Entropy equation: $$\begin{align} H = & \sum_i{p_i.i} \\[6pt] = & - \sum_i{p_i \log_2({p_i}}) \end{align}$$

The above has been explained in more detail here.

visitsb · Answer 16 · 2021-08-17T06:13:03.037

Ok, no maths. I too was curious to know the same thing, and have understood the purpose served by using logarithm in entropy equation?.

Let's use a simple example. Suppose we are looking at 3 different containers. Each container has some triangles or circles.

Let's focus on first container - Container 1 which has 26 triangles, 4 circles. If you put your hand inside the container and picked one, then what is the chance that you pick a triangle (or a circle)?

If you were allowed to pick until there is nothing left in Container 1, that would mean after 30 picks you would have 26 triangles, 4 circles in hand. So you can say "The chance of getting a triangle is 26/30 and not getting a triangle is 4/30". The exact opposite can be said about picking a circle.

In other words, because the number of triangles is more than the number of circles there is a higher chance (likelihood/probability/confidence) that you will pick a triangle when you take something out of Container 1. This likelihood is what we usually term a probability.

To put it another way, when picking one item at a time randomly from Container 1, you have less doubt (uncertainty/surprise) about picking a triangle. In the same way, you have more doubt (less certain) about picking a circle. This uncertainty is what is usually termed as entropy, which is the opposite of probability.

That means, for Container 1 you had higher ~87% chance to pick a triangle (26/30), less chance ~13% to pick a circle. You have more confidence you'll pick a triangle but if you do get a circle at times, you won't be as surprised because chances of getting a circle are less.

In other words, entropy (element of surprise, chance that you get something else than what you expected) will be less for Container 1.

Similarly, Container 2 has a ~50% chance to pick a triangle (14/30), ~50% chance for circle (16/30).

To use the same line of thought, from Container 2 it is equally likely you pick a triangle or circle on any pick, you have same confidence about picking(or same uncertainty about not picking) a triangle or a circle. So entropy (element of surprise, chance that you will get something else than what you expected) is higher for Container 2.

Finally for Container 3 no chances you'll pick a triangle (0/20), guaranteed chance you'll pick a circle (20/20). Thus, whenever you pick something out, you have absolutely no doubt (entropy is zero for Container 3) that it will be a circle.

What purpose does the logarithm serve in entropy equation?

How can one put a number on probability, entropy for Container 1 (or Container 2, Container 3) as a whole? It would be ideal to have the number between 0 and 1 so that one could easily represent it as a percentage.

Looking at Container 1, the ratio of number of triangles to the total number of items is 26/30 (~0.87). For circles it is 4/30 (~0.13). Trying to just add these two ratios doesn't make sense (because I get back 30/30 = 1). I need some way to scale them (make them bigger or smaller) between 0 and 1. This scale should be the same irrespective of the number of items e.g. the same scale should work for Container 3 even if it has different number of items compared to Containers 1, 2.

Logarithm provides this scale. Logarithms offer a way to represent numbers (especially if they are large) into reduced (or scaled down) versions. If multiply 26/30 with log(26/30) then I can scale it. To ensure my scale stays independent (because total numbers of items can vary in different containers), one option is to use 2 as the base for logarithms. Any base (2, 10, e) will suffice as long as one measures each container using the same base.

Specific to our examples, we either pick a triangle or we don't (we pick a circle otherwise) - so we have only 2 outcomes for any pick. Base 2 represents 0, 1 as 2 choices in computers hence that has been the choice of base to use for logarithm.

Using this fact about logarithms, if we multiply each ratio with it's log, we effectively scale it between 0 and 1.

So using that background we can calculate entropy for each container as follows (I've avoided math so far, but now it will be plain & easy)

Note: Since we divide by log(total number of items) to keep the new scaled ratio between 0 and 1, multiplying by that ends up being less than 0 (hence the negative sign is added to make it back to positive).

As you can see, entropy (doubt, surprise, uncertainty) for Container 1 is less (56%), but more for Container 2 (99%). For Container 3, there is no (0%) entropy - you have 100% chance to pick a circle.

Hope this explanation helps to both visualize and make an intuition about Entropy, the choice of logarithms - statistical functions almost always use logarithms to scale down ratios and large numbers towards a common range representable between 0 and 1.

score 0 · Answer 17 · answered Feb 14 '22 at 21:05

Entropy measures how chaotic and unpredictable is given distribution. For distributions with low entropy certain values occur significantly more often than others. When we draw a sample, the probability density function tells us how surprised are we about the given outcome.

Let's draw a sequence of $N$ independent samples from the distribution and compute their joint probability:

$$ \prod^N_i p(x_i) $$

For simplicity let's consider a discrete distribution with $K$ values. We can than group samples with each discrete value. We will denote $p(k)$ as $p_k$ for brevity. If we make $N$ sufficiently large than the count of each value will be proportional to its probability $N_k = p_k N$. We can also replace product of a constant with power:

$$ \prod^K_k \prod^{N_k} p_k =\prod^K_k \prod^{p_k N} p_k = \prod^K_k p_k^{p_k N} $$

For convenience, we move to the negative log-likelihood. This allows us to replace the product with sum. We than use $x = e^{\log x}$ and $\log e^x = x$:

$$ -\log \prod^K_k p_k^{p_k N} = -\sum^K_k \log p_k^{p_k N} = -N \sum^K_k p_k \log p_k $$

Once we drop the constant $N$ we get the Shannon's entropy formula. Replace the sum with integral for continuous distributions.

As a bonus, note we assumed that the probability $p(x)$ is the true data distribution. We could instead choose to compute the log-likelihood under this $p(x)$, while still assuming that the true distribution is a different distribution $q(x)$. That would yield the cross-entropy formula:

$$ - \sum^K_k q_k \log p_k $$

https://en.wikipedia.org/wiki/Cross_entropy#Relation_to_log-likelihood

What is the role of the logarithm in Shannon's entropy?

17 Answers17

Summary:

Example:

Simulation:

Linked

Related