The textbook Elements of Information Theory gives us an example:
For example, if we knew the true distribution p of the random
variable, we could construct a code with average description length
H(p). If, instead, we used the code for a distribution q, we would
need H(p) + D(p||q) bits on the average to describe the random
variable.
To paraphrase the above statement, we can say that if we change the information distribution(from q to p) we need D(p||q) extra bits on average to code the new distribution.
An illustration
Let me illustrate this using one application of it in natural language processing.
Consider that a large group of people, labelled B, are mediators and each of them is assigned a task to choose a noun from turkey, animal and book and transmit it to C. There is a guy name A who may send each of them an email to give them some hints. If no one in the group received the email they may raise their eyebrows and hesitate for a while considering what C needs. And the probability of each option being chosen is 1/3. Toally uniform distribution(if not, it may relate to their own preference and we just ignore such cases).
But if they are given a verb, like baste, 3/4 of them may choose turkey and 3/16 choose animal and 1/16 choose book. Then how much information in bits each of the mediators on average has obtained once they know the verb? It is:
\begin{align*}
D(p(nouns|baste)||p(nouns)) &= \sum_{x\in\{turkey, animal, book\}} p(x|baste) \log_2 \frac{p(x|baste)}{p(x)} \\
&= \frac{3}{4} * \log_2 \frac{\frac{3}{4}}{\frac{1}{3}} + \frac{3}{16} * \log_2\frac{\frac{3}{16}}{\frac{1}{3}} + \frac{1}{16} * \log_2\frac{\frac{1}{16}}{\frac{1}{3}}\\
&= 0.5709 \space \space bits\\
\end{align*}
But what if the verb given is read? We may imagine that all of them would choose book with no hesitatation, then the average information gain for each mediator from the verb read is:
\begin{align*}
D(p(nouns|read)||p(nouns)) &= \sum_{x\in\{book\}} p(x|read) \log_2 \frac{p(x|read)}{p(x)} \\
&= 1 * \log_2 \frac{1}{\frac{1}{3}} \\
& =1.5849 \space \space bits \\
\end{align*}
We can see that the verb read can give the mediators more information. And that's what relative entropy can measure.
Let's continue our story. If C suspects that the noun may be wrong because A told him that he might have made a mistake by sending the wrong verb to the mediators. Then how much information in bits can such a piece of bad news give C?
1) if the verb given by A was baste:
\begin{align*}
D(p(nouns)||p(nouns|baste)) &= \sum_{x\in\{turkey, animal, book\}} p(x) \log_2 \frac{p(x)}{p(x|baste)} \\
&= \frac{1}{3} * \log_2 \frac{\frac{1}{3}}{\frac{3}{4}} + \frac{1}{3} * \log_2\frac{\frac{1}{3}}{\frac{3}{16}} + \frac{1}{3} * \log_2\frac{\frac{1}{3}}{\frac{1}{16}}\\
&= 0.69172 \space \space bits\\
\end{align*}
2) but what if the verb was read?
\begin{align*}
D(p(nouns)||p(nouns|baste)) &= \sum_{x\in\{book, *, *\}} p(x) \log_2 \frac{p(x)}{p(x|baste)} \\
&= \frac{1}{3} * \log_2 \frac{\frac{1}{3}}{1} + \frac{1}{3} * \log_2\frac{\frac{1}{3}}{0} + \frac{1}{3} * \log_2\frac{\frac{1}{3}}{0}\\
&= \infty \space \space bits\\
\end{align*}
Since C never know what would the other two nouns be and any word in the vocabulary would be possible.
We can see that the KL divergence is asymmetric.
I hope I am right, and if not please comment and help correct me. Thanks in advance.