I'm trying to understand some machine learning theory background: specifically, the relationship between cross entropy loss and "negative log likelihood".
To start, I already fully understand these definitions:
- Entropy of a probability distribution $p$ with $K$ classes:
$$ H(p) = - \sum_{k=1}^{K} p_k \log p_k $$
- Cross entropy between two probability distributions $p$ (ground-truth) and $q$ (predicted):
$$ H(p, q) = - \sum_{k=1}^{K} p_k \log q_k $$
My specific confusion comes from reading Kevin Murphy's 2021 book "Probabilistic Machine Learning: An Introduction". He says something like this about Kullback-Leibler divergence (it's a paraphrase summarization of sections 4.2 and 6.2):
$KL(p||q) = \sum_{k=1}^{K} p_k \log p_k - \sum_{k=1}^{K} p_k \log q_k$
We recognize the first term as the negative entropy and the second term as the cross entropy. The first term is a constant with respect to our predictions $q$, so we can ignore it.
Let us suppose the $p$ distribution is defined with a delta function $\delta$ like this: $ p(x) = \frac{1}{N} \sum_{n=1}^{N} \delta(x - x_n)$ .
Then the KL divergence becomes \begin{align} KL(p||q) &= -H(p) - \frac{1}{N} \sum_{n=1}^{N} \log q(y_n)\\ &= constant + NLL \end{align} This is called the cross entropy objective, and is equal to the average negative log likelihood of q on the training set.
Questions:
The term $\frac{1}{N} \sum_{n=1}^{N} \log q(y_n)$ mentions one distribution $q$. How can it be a cross-entropy term when cross entropy is defined for two distributions $p$ and $q$?
How does a log-likehood expression in terms of $N$ training instances ($\frac{1}{N} \sum_{n=1}^{N})$ turn into a cross-entropy expression in terms of $K$ classes ($\sum_{k=1}^{K}$)?
Is the author's use of a delta function $\delta$ just another way of saying a one-hot distribution?
I'm still confused even after reading other posts like this one, this one, and this one.
argminof the negative log likelihood (NLL), which is equivalent to minimizing the cross-entropy loss. But I find there's a discrepancy with a $\frac{1}{N}$ term, where $N$ is the number of training instances. NLL is defined as $-\sum_{i}^{N} \log P(y_i | x_i)$, but total cross entropy is defined as $- \frac{1}{N} \sum_{i}^{N} \log q_i = - \frac{1}{N} \sum_{i}^{N} \log P(y_i | x_i)$. Why does cross-entropy loss have that $\frac{1}{N}$ term (for computing the mean), while NLL does not? How can they be equivalent? – stackoverflowuser2010 May 25 '21 at 21:05