5

The notion of information as per Shannon is that if the probability of RV is close to 1, there is little information in that RV because we are more certain about the outcome of the RV so there is little information that RV can provide us.

Contrasting this to Fisher information which is the inverse of the covariance matrix, so by that definition if the variance is high meaning the uncertainty is high we have little information and when uncertainty is low (probability of RV close to 1) the information is high.

The two notion of information is conflicting and I would like to know if I understood it wrong?


From one of the references provided by @doubllle the following plot shows what Shannon entropy is for the coin flip model parametrized by $\theta$ of Bernoulli distribution Vs the same for Fisher information

Shannon Entropy

enter image description here

  • 2
    Cramer Rao theorem states that the inverse of Fisher Information is bounded by the cov matrix. And Fisher Information is defined for the carried information in $X$ about parameter $\theta$. When the uncertainty is low (the observations not wide spreaded), naturally we are more certain about $\theta$. See here and here – doubllle Mar 31 '20 at 21:06
  • good references, I shall look at them and revert back. So essentially when I mentioned about Fisher information in my question is wrong? I am struggling to relate Shannon's notion of information with Fisher information. – GENIVI-LEARNER Apr 01 '20 at 15:51
  • If I were you, I'd rephrase the second paragraph. Please also check this: https://stats.stackexchange.com/questions/196576/what-kind-of-information-is-fisher-information/197471#197471 Your question is somehow covered there – doubllle Apr 01 '20 at 20:40
  • @doubllle thats a lot for the effort. Your references are really "informative" pun intended :) Will take some time for me to crunch them. – GENIVI-LEARNER Apr 02 '20 at 16:05

4 Answers4

4

Fisher information and Shannon/Jaynes entropy is very different. For a start, the entropy $\DeclareMathOperator{\E}{\mathbb{E}} H(X) =-\E \log f(X)$ (using this expression to have a common definition for continuous/discrete case ...) showing the entropy is the expected negative loglik. This only relates to the distribution of the single random variable $X$, there is no necessity for $X$ to be embedded in some parametric family. This is in a sense the expected informational value from observing $X$, calculated before the experiment. See Statistical interpretation of Maximum Entropy Distribution.

Fisher information, on the other hand, is only defined for a parametric family of distributions. Suppose the family $f(x; \theta)$ for $\theta\in\Theta \subseteq \mathbb{R}^n$. Say $X \sim f(x; \theta_0)$. Then the fisher information is $\DeclareMathOperator{\V}{\mathbb{V}} \mathbb{I}_{\theta_0} = \V S(\theta_0)$ where $S$ is the score function $S(\theta)=\frac{\partial}{\partial \theta} \log f(x;\theta)$. So the Fisher information is the expected gradient of the log likelihood. The intuition being that where the variance of the gradient of the loglik is "large", it will be easier to discriminate between neighboring parameter values. See What kind of information is Fisher information?. It is not clear that we should expect any relationship between these quantities, and I do not know of any. They are also used for different purposes. The entropy could be used for design of experiments (maxent), Fisher information for parameter estimation. If there are relationships, maybe look at examples where both can be used?

  • Qutie comprehensive answer. I didnot understand what you meant by where the gradient of the loglik is "large", it will be easier to discriminate between neighboring parameter values Why shall it be easier to discriminate between neighboring parameter values? and what exactly is "discriminate"? – GENIVI-LEARNER Apr 05 '20 at 12:39
  • See the edit, was missing "variance of". Estimating by maximum likelihood, we search where the gradient is zero, if the gradient is varying more with the data, the maximum will be more precisely located. See the linked post. – kjetil b halvorsen Apr 06 '20 at 18:50
  • ok it makes sense. So second derivative measures how fast the gradient is varying, right? – GENIVI-LEARNER Apr 06 '20 at 21:11
1

They are both information but are informing you about different things. Fisher information is related to estimating the value of a parameter $\theta$:

$$I_\theta = {E}\left [ \nabla_\theta \log p_\theta(X)\nabla_\theta \log p_\theta(X)^T \right ] $$

What Fisher information is measuring is the variability of the gradient for a given score function, $\nabla_\theta \log p_\theta(X)$. An easy way to think about this is if the score function gradient is high, we can expect that the variability of the score function is high and estimation of the parameter $\theta$ is easier.

Shannon information is related to the probability distribution of possible outcomes. In your coin example there is little information from a probability distribution in the extreme cases, $P(X = 0)$ and $P(X = 1)$. If you knew the probability distribution you would not be surprised or uncertain about any observation at these cases. The higher entropy at $P(X = 0.5)$ produces the maximum uncertainty.

dtg67
  • 151
  • I am just little confused. So take the coin example. Shannon information measures the information we will get "after" the coin is tossed keeping the parameter constant while Fisher information determines the information of the variability of the parameters itself so maybe the variance in parameter for biased coin could be 0.6,0.65,0.7 etc so does Fisher information measure that? – GENIVI-LEARNER Apr 02 '20 at 18:44
  • 1
    Fisher information requires observations of a random variable and then models that distribution using a parameter $\theta$. In Shannon information there is no parameter because it's not modeling a distribution given an observation of random variables. Shannon is measuring the uncertainty of a given process. This is why we have zero uncertainty of what coin side will be observed at the extremes in the coin example and maximum uncertainty when both sides are equally likely. – dtg67 Apr 02 '20 at 19:45
  • Well in that case in Shannon's entropy, the information pertains to uncertainty? So high uncertainty, the high information we will obtain once the outcome is observed. Well that makes sense. And in Fisher scenario the concept of information is how much the score function varies? – GENIVI-LEARNER Apr 02 '20 at 20:06
  • Also why do you say variability of the score function is high and estimation of the parameter is easier. what does high variability of score function got to do with estimation of parameter being easier? So if the variability is low the parameter estimation is difficult? – GENIVI-LEARNER Apr 02 '20 at 20:09
1

The Fisher information in a statistic computed on sample data, describes a parameter of the probability distribution from which the data have been sampled. An unbiased statistic's value (ignoring measurement error) is equal to that of the not-directly-observable parameter, plus a random perturbation in the value. The random discrepancy between estimate and parameter, which arises due to the sampling process itself (i.e., due to the fact that not all population members are in the sample), is called sampling error.

A statistic's Fisher information is inversely related to its error, so that greater error means less information, and vice versa. In short, Fisher information is precision. This is why it is technically meaningless to report a point estimate without a measure of its precision, like the sample standard error: otherwise, we have no idea how informative our estimate is.

In contrast, the Shannon information in a measure describes a message, not a parameter. Unlike a parameter, a message is completely observable: if we have the message because it was composed and transmitted to us, it clearly was observed by someone at some point. Also, unlike a statistical sample, we need not assume a message has a latent structure.

Messages are usually represented as binary strings (or can be expressed as such). Now, suppose we are interested in message $X$ but all we know about $X$ is that its length is $n$. Then the $2^n$ possible binary strings of length $n$ constitute the set of all possible $X$. A measure that provides some non-zero amount of Shannon information about $X$ is anything$-$an observation, a rule, a function$-$that allows us to distinguish $X$ from at least one other member of the set. If the measure lets us precisely determine which member of the set $X$ is, it contains the total Shannon information about $X$, the expected value of which is its total Shannon entropy.

Shannon information, like Fisher information, is probabilistic. Suppose $X$ was sent to us over a noisy connection, so that some 0's are randomly flipped to 1's and vice versa. Call the noisy message $X'$. Shannon entropy can be used to describe the expected characteristics of $X|X'$ probabilistically. Or, if $X'$ is one message plucked from a large number of similar messages (e.g., calls going through a cell tower), the entropy of $X'$ describes mean message $X$.

Historically, these two fields, and their respective information types, were developed and studied separately. As noted by others, there is no precise translation between them, although mathematical inequalities have been defined. They describe different (even incompatible) attributes of data.

(As an aside, this suggests to me not that the two are unrelated, but that they are in fact complementary and even mutually necessary. Consider, for example, that estimation theory conceives of data randomness in two distinct ways: as a long-run dispersion parameter, and as the sample-specific effect of that parameter. Fisher information can describe the former but not the latter. Were the latter describable by an alternative information type, one independent of the first, their respective measures would fully characterize any unique probability sample.)

virtuolie
  • 528
0

The two notion of information is conflicting and I would like to know if I understood it wrong?

The conflict might arrise from the idea that the two notions are related because they both are the same as information. But, the Shannon entropy and Fisher information matrix are not the same as information.

You could see Shannon entropy as the degree of concentration of the probable states/outcomes of a distribution, it is the magnitude of the information.

The Fisher information matrix tells how much the information changes when the parameters of a distribution change.


Fisher information is very much related to entropy and more specifically to cross-entropy,

$$H(\theta_1,\theta_2) = -\mathbb{E}_{\theta_1}\left[\log p(x|\theta_2)\right] = \sum_{\forall x} p(x|\theta_1) \log p(x|\theta_2) $$

, and you can see it as the second order rate of change of the cross-entropy as we change the distribution parameter(s) $\theta_2$.

If we define

$$H^{\prime\prime}(\theta_1,\theta_2) = \frac{\partial^2}{\partial \theta_2^2} H(\theta_1,\theta_2) = \mathbb{E}_{\theta_1}\left[\frac{\partial^2}{\partial \theta_2^2} \log p(x|\theta_2)\right] = \sum_{\forall x} p(x|\theta_1)\frac{\partial^2}{\partial \theta_2^2} \log p(x|\theta_2) $$

Then the Fisher information is $$\mathcal{I}(\theta) = H^{\prime\prime}(\theta,\theta)$$

(See also Connection between Fisher metric and the relative entropy)