Information gain and mutual information: different or equal?

Question

I'm very confused about the difference between Information gain and mutual information. to make it even more confusing is that I can find both sources defining them as identical and other which explain their differences:

Sources stating Information gain and Mutual information are the same

Feature Selection: Information Gain VS Mutual Information
An introduction to information retrieval: "Show that mutual information and information gain are equivalent", page 285, exercise 13.13.
It is thus known as the information gain, or more commonly the mutual information between X and Y" --> CS769 Spring 2010 Advanced Natural Language Processing, "Information Theory", lecturer: Xiaojin Zhu
"Information gain is also called expected mutual information" --> "Feature Selection Methods for Text Classification", Nicolette Nicolosi, http://www.cs.rit.edu/~nan2563/feature_selection.pdf

Sources stating they're different:

https://math.stackexchange.com/questions/833713/equality-of-information-gain-and-mutual-information
yang --> "A comparative study on Feature Selection in Text Categorization" --> they are treated separately and mutual information is even discarded because it performs very bad compared to IG
citing yang --> "An Extensive Empirical Study of Feature Selection Metrics for Text Classification" -- http://www.jmlr.org/papers/volume3/forman03a/forman03a_full.pdf

Sources that are confused

http://www.researchgate.net/post/What_is_the_difference_between_mutual_information_and_information_gain

I could still find other sources defending opposite thesis but I think these are enough. Can anyone enlighten me about the real difference / equality of these two measures?

EDIT: other related question

Information gain, mutual information and related measures

chris elgoog · Accepted Answer · 2018-11-26T14:28:16.253

There are two types of Mutual Information:

Pointwise Mutual Information and
Expected Mutual Information

The pointwise Mutual Information between the values of two random variables can be defined as: $$ pMI(x;y) := \log \frac{p(x,y)}{p(x)p(y)} $$

The expected Mutual Information between two random variables $X$ and $Y$ can be defined as as the Kullback-Leiber Divergence between $p(X,Y)$ and $p(X)p(Y)$: $$ eMI(X;Y) := \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)} $$

Sometimes you find the definition of Information Gain as $I(X; Y) := H(Y) - H(Y \mid X)$ with the Entropy $H(Y)$ and the conditional entropy $H(Y\mid X)$ , so

$$ \begin{align} I(X; Y) &= H(Y) - H(Y \mid X)\\ &= - \sum_y p(y) \log p(y) + \sum_{x,y} p(x) p(y\mid x) \log p(y\mid x)\\ &= \sum_{x,y} p(x, y) \log{p(y\mid x)} - \sum_{y} \left(\sum_{x}p(x,y)\right) \log p(y)\\ &= \sum_{x,y} p(x, y) \log{p(y\mid x)} - \sum_{x,y}p(x, y) \log p(y)\\ &= \sum_{x,y} p(x, y) \log \frac{p(y\mid x)}{p(y)}\\ &= \sum_{x,y} p(x, y) \log \frac{p(y\mid x)p(x)}{p(y)p(x)}\\ &= \sum_{x,y} p(x, y) \log \frac{p(x, y)}{p(y)p(x)}\\ &= eMI(X;Y) \end{align} $$

Note: $p(y) = \sum_x p(x,y)$

So expected Mutual Information and Information Gain are the same (with both definitions above).

Note on edit: There was a mistake in the original derivation. — chris elgoog, Nov 26 '18 at 14:25
is $I(X;Y) = H(X) + H(Y) - H(X,Y)$ point-wise or expected mutual information? — develarist, Aug 28 '20 at 23:27

score 5 · Answer 2 · edited Dec 06 '16 at 02:34

"Information gain" seems to be an overloaded name that corresponds to multiple formulas. The non-ambiguous names appear to be:

The mutual information linking two random variables X and Y:

$$ MI(X,Y) = H(X) + H(Y) - H(X,Y) = H(Y) - H(Y|X) = H(X) - H(X|Y) $$

where $H$ is the entropy of the random variable, and

the Kullback-Leibler (KL) divergence, which measures the difference between two probability laws or probability density functions:

$$ KL(p,q) = \int p(x) \log \frac{p(x)}{q(x)}dx. $$

These two quantities are linked. After straightforward manipulations from $ MI(X,Y) = H(Y) - H(Y|X)$, we find:

$$ MI(X,Y) = \int p(x) KL(P(Y=y|X=x) | P(Y=y)) dx, $$

where the right hand side is the average KL divergence between a random variable's marginal and conditional distributions.

Information gain and mutual information: different or equal?

Sources stating Information gain and Mutual information are the same

Sources stating they're different:

Sources that are confused

EDIT: other related question

2 Answers2

Linked