3

So Shannon's information is a way to quantify "distinct knowledge" by means of combination of microstates. So say 1 bit of information in binary system conveys 2 sets of information due to two possible microstates

\begin{pmatrix} 0 \\ 1 \end{pmatrix}

2 bit of information in binary conveys $2^2=4$ set of information due to 4 possible microstates.

\begin{pmatrix} 0\ 0 \\ 0\ 1 \\ 1\ 0 \\ 1\ 1 \end{pmatrix}

Now how to understand this concept using probabilities? If a particular outcome of a random variable say a biased coin flip has a probability of 0.3 for heads then what does it really mean when we say that it conveys $-\log_2(0.5)=1.73 $ bits of information? How does the outcome of the coin has 1.73 microstates?

1 Answers1

2

There is a difference between entropy and number of microstates when dealing with a random process that is not equally probable. In the example of a single coin flip there are only two microstates regardless of coin bias, the coin can come up heads or tail but the entropy will be different for the cases of even coins or biased coins. For an even coin the entropy can be calculated the usual way,

$$H(X) =- \sum_{i \in h,t}p_i \log_2 p_i =- (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1\;\mathrm{bits}$$

or because each microstate is equally probable $H(X) = \log_2 2 = 1 \;\mathrm{bits}$

For the biased coin where heads has a probability $p_h = 0.3$ the entropy is,

$$H(X) = - ( 0.3 \log_2 0.3 + 0.7 \log_2 0.7) = 0.88\;\mathrm{bits}$$

The entropy for the biased case is lower because we are less uncertain about the outcome of a coin flip (our intuition tells us that tails is more likely to occur). Another simple example is if we have a random process where we take two coins and flip them then there are four possible microstates $X =\{hh,ht,th,tt\}$

For even coins where each microstate is equally probably the entropy $H(X) = \log_2 4 = 2 \;\mathrm{bits}$

and for two biased coins the entropy is $H(X) = 1.76 \;\mathrm{bits}$ .

Again the entropy of the biased coins is less than the equiprobable case because we know that the coins are weighted towards tails.

Entropy is really tedious to understand because it has use in chemistry, statistical mechanics, and information theory. In my opinion the best and most clear understanding of entropy is "Where We Do Stand on Maximum Entropy"[pg. 12-27] by E.T. Jaynes

dtg67
  • 151
  • Yes entropy as a concept is easy to understand when you bring in equal probable microstates. But what I dont understand is why $log_b$ is used as a measure or quantification of information and how say a probability of 0.3 tells us about a microstate? or should we not think of it in that way. – GENIVI-LEARNER Apr 06 '20 at 12:59
  • the post here, explains it like this We can call log(1/) information. Why? Because if all events happen with probability , it means that there are 1/ events. To tell which event have happened, we need to use log(1/) bits (each bit doubles the number of events we can tell apart). But how do we underatand this if all events are dont happen with probability , meaning different microstates have different probability – GENIVI-LEARNER Apr 06 '20 at 13:03
  • So I totally agree with you if all microstates are equally probable then it indeed reflects microstate. More over, we dont even need log in that case, we can quantify entropy in terms of microstates. For an unbiased coin the entropy shall be something 1/p or 1/0.5 = 2, for a fair dice it shall be 1/0.16 = 6 which is fair quantification of disorder in the system. But when bias occurs we got to use $log$ that what confuses me as how exactly log mitigates the issue of biaseness. No other SE post actually tackling this point, which I believe is the heart of understanding entropy. – GENIVI-LEARNER Apr 06 '20 at 13:08
  • 1
    The entropy of a dice is not 6 bits, it's 2.585 bits, you need the logarithm. The probability distribution tells us about the entropy, and the probability of a microstate. If the entropy is less than the equally probable distribution entropy then a microstate must be more probable and a microstate must be less probable (probabilities must sum to 1). The logarithm in the entropy formula does not mitigate the issue of bias. To me "the heart of understanding entropy" is expanded in "Where Do We Stand On Maximum Entropy". – dtg67 Apr 06 '20 at 16:39
  • Ok. I shall look into the reference you provided. i did not get what you meant by If the entropy is less than the equally probable distribution entropy then a microstate must be more probable and a microstate must be less probable what did you mean by then a microstate must be more probable and a microstate must be less probable – GENIVI-LEARNER Apr 06 '20 at 17:06
  • 1
    +1 and kudos for referencing Jaynes. @GENIVI-LEARNER the logarithmic relationship between microstate probabilities and (thermodynamic) entropy was already part of Boltzmann's formulation in the 1800s. Jaynes demonstrated the fundamental connection between Shannon entropy and the entropy that had been developed earlier in thermodynamics and statistical mechanics. See this answer and the links from it for further details. – EdM Apr 06 '20 at 18:38
  • @EdM I was looking to little bit of more intuition on why "Logarithm"? To me a more intuition would have been that log measures the permutations a microstate could take but I just looked at your reference Boltzmann version of equation meaures just that using logarithm. So there W is "defined" in terms of permutation which is more intuitive. I cant correlate as how inverse of probability is linked to permutation? – GENIVI-LEARNER Apr 06 '20 at 21:02
  • @EdM if you have certain given permutations such as 8-permutations and you know certain base was use to make that permutation such as base-2 then $log_28$ would tell how much of that base was used to produce 8-permutations, in this case the answer is 3. This is what I understand from Boltzmann formula of Entropy and it makes more sense. But in terms of inverse of probability I cant get direct intuition or understanding :| – GENIVI-LEARNER Apr 06 '20 at 21:04
  • 2
    @GENIVI-LEARNER Boltzmann wanted to find the most probable distribution among states, subject to fixed energy and number of molecules. Maximizing the log of (something proportional to a) probability was more tractable. See page 15 of the Jaynes paper linked in this answer for how the Stirling approximation for factorials leads to a form similar to Shannon entropy. If all microstates are not equally probable then you must generalize to Gibbs entropy. Changing the log base only introduces a constant scale factor. – EdM Apr 06 '20 at 21:40
  • 1
    @GENIVI-LEARNER I'm convinced that you have an understanding of what microstates are. A microstate is a unique and specific configuration of the system you are studying. For a coin it's simple, the microstates are H or T. For a box of gas with indistinguishable particles, in Boltzmann's case, it is every single possible arrangement of atoms which is N!. N! is both the permutations of atoms and each permutations is a specific microstate. Logarithm does not "measure the permutations a microstate could take" because there is only one permutation for a microstate, the microstate itself. – dtg67 Apr 06 '20 at 21:56
  • Yes, exactly. each permutation is a specific microstate and each specific microstate is distinct so what I understood is that each distinct microstate is considered distinct information if each microstate is equally likely . – GENIVI-LEARNER Apr 06 '20 at 22:19
  • Also I do acknowledge that what i mentioned earlier that logarithm measuring permutation a microstate was wrong. I meant a macrostate. Because macrostate determines the permutations or distinct microstates. – GENIVI-LEARNER Apr 06 '20 at 22:25
  • 1
    I'm not sure what you mean when you say "so is each distinct microstate can be considered distinct information?" but the existence of a specific microstate is important for information theory but to quantify information/entropy you need a probability distribution. – dtg67 Apr 06 '20 at 22:30
  • Allright. I shall read the reference you provided to get more understanding. What I meant what I said "each distinct microstate can be considered distinct information" I mean like in coin example a head is a microstate as well as tail so if we know a head resulted from a coin toss, we get a distinct piece of information. Actually i got more intuition after discussing in these comments. So kuddos for that. – GENIVI-LEARNER Apr 06 '20 at 22:52