I would love to get some help with that interview question I failed to answer correctly:
In what sense is the JS-divergence prefferable over KL-divergence $D_{KL}(p_{data}||p_{model})$ as an objective for GANs?
Hint: Analyze the case of a non-zero $p_{data}(x)>0$, with near-zero small $p_{model}(x)$
- 41
1 Answers
You have $$ D_{KL}(p_{data} \parallel p_{model}) = \int_{-\infty}^{\infty} p_{data}(x) \log \frac{p_{data}(x)}{p_{model}(x)} dx $$ and $$ D_{JS}(p_{data} \parallel p_{model}) =\frac{1}{2}D_{KL}\Big(p_{data} \parallel \frac{1}{2} (p_{model}+p_{data})\Big)+ \frac{1}{2}D_{KL}\Big(p_{model} \parallel \frac{1}{2} (p_{model}+p_{data})\Big) $$ It is straighforward to show as the hint suggests that for $p_{model}(x) = \epsilon$ for $x \in [a,b]$, then $D_{KL} \rightarrow \infty$ as $\epsilon \rightarrow 0$, but $D_{JS}$ does not.
If for two data sets $A$ and $B$ you have $[a,b]$ and $[c,d]$ such that $$p_{data_A}(x)> 0, \, p_{model}(x) = \epsilon {\text{ for }}x \in [a,b] $$ $$p_{data_B}(x)> 0, \, p_{model}(x) = \epsilon {\text{ for }}x \in [c,d] $$ Then both $D_{KL}(p_{data_A} \parallel p_{model})$ and $D_{KL}(p_{data_B} \parallel p_{model}) \rightarrow \infty$ as $\epsilon \rightarrow 0$, and you will not be able to distinguish the two data distributions relative to the model distribution, regardless of how small $[a,b]$ and $[c,d]$ are.
There are other differences between the two divergences as well, and the non-symmetry of $D_{KL}$ comes to mind.
- 601