13

I'm very confused about the difference between Information gain and mutual information. to make it even more confusing is that I can find both sources defining them as identical and other which explain their differences:

Sources stating Information gain and Mutual information are the same

  • Feature Selection: Information Gain VS Mutual Information
  • An introduction to information retrieval: "Show that mutual information and information gain are equivalent", page 285, exercise 13.13.
  • It is thus known as the information gain, or more commonly the mutual information between X and Y" --> CS769 Spring 2010 Advanced Natural Language Processing, "Information Theory", lecturer: Xiaojin Zhu
  • "Information gain is also called expected mutual information" --> "Feature Selection Methods for Text Classification", Nicolette Nicolosi, http://www.cs.rit.edu/~nan2563/feature_selection.pdf

Sources stating they're different:

Sources that are confused

I could still find other sources defending opposite thesis but I think these are enough. Can anyone enlighten me about the real difference / equality of these two measures?

EDIT: other related question

Information gain, mutual information and related measures

smci
  • 1,472
jcsun
  • 157

2 Answers2

15

There are two types of Mutual Information:

  • Pointwise Mutual Information and
  • Expected Mutual Information

The pointwise Mutual Information between the values of two random variables can be defined as: $$ pMI(x;y) := \log \frac{p(x,y)}{p(x)p(y)} $$

The expected Mutual Information between two random variables $X$ and $Y$ can be defined as as the Kullback-Leiber Divergence between $p(X,Y)$ and $p(X)p(Y)$: $$ eMI(X;Y) := \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)} $$

Sometimes you find the definition of Information Gain as $I(X; Y) := H(Y) - H(Y \mid X)$ with the Entropy $H(Y)$ and the conditional entropy $H(Y\mid X)$ , so

$$ \begin{align} I(X; Y) &= H(Y) - H(Y \mid X)\\ &= - \sum_y p(y) \log p(y) + \sum_{x,y} p(x) p(y\mid x) \log p(y\mid x)\\ &= \sum_{x,y} p(x, y) \log{p(y\mid x)} - \sum_{y} \left(\sum_{x}p(x,y)\right) \log p(y)\\ &= \sum_{x,y} p(x, y) \log{p(y\mid x)} - \sum_{x,y}p(x, y) \log p(y)\\ &= \sum_{x,y} p(x, y) \log \frac{p(y\mid x)}{p(y)}\\ &= \sum_{x,y} p(x, y) \log \frac{p(y\mid x)p(x)}{p(y)p(x)}\\ &= \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(y)p(x)}\\ &= eMI(X;Y) \end{align} $$

Note: $p(y) = \sum_x p(x,y)$

So expected Mutual Information and Information Gain are the same (with both definitions above).

5

"Information gain" seems to be an overloaded name that corresponds to multiple formulas. The non-ambiguous names appear to be:

  1. The mutual information linking two random variables X and Y:

$$ MI(X,Y) = H(X) + H(Y) - H(X,Y) = H(Y) - H(Y|X) = H(X) - H(X|Y) $$

where $H$ is the entropy of the random variable, and

  1. the Kullback-Leibler (KL) divergence, which measures the difference between two probability laws or probability density functions:

$$ KL(p,q) = \int p(x) \log \frac{p(x)}{q(x)}dx. $$

These two quantities are linked. After straightforward manipulations from $ MI(X,Y) = H(Y) - H(Y|X)$, we find:

$$ MI(X,Y) = \int p(x) KL(P(Y=y|X=x) | P(Y=y)) dx, $$

where the right hand side is the average KL divergence between a random variable's marginal and conditional distributions.

Taylor
  • 20,630