I'm not totally sure what you mean by "its value is itself a distribution," so let me say a few things and see if they help; feel free to ask more questions if not.
The network is predicting a discrete distribution over the three entries. Letting the predictive label be the random variable $\hat Y$ and naming its possible values $a$, $b$, and $c$, it says that $\Pr(\hat Y = a) = 0.4$, $\Pr(\hat Y = b) = 0.1$, and $\Pr(\hat Y = c) = 0.5$. Note that $\hat Y$ is a function of the network's parameters $\theta$ and the feature vector $x$: we can write it as $\hat Y_\theta(x)$ to denote its dependence on $\theta$ and $x$.
Now, we want to see if that predicted distribution is any good. We only have one data point to evaluate this with: the true observed value $y$, which in this case was observed as $a$. Taking a maximum-likelihood approach, we choose to evaluate the quality of a network $\theta$ by its likelihood: the probability of $a$ under the predictive distribution $\Pr(\hat Y_\theta(x) = \cdot)$, which we can evaluate as $P(\hat Y_\theta(x) = a) = 0.4$. (If the labels were continuous, then we'd use the probability density.)
Now, the network actually predicts one of these distributions for each of the possible inputs $x_i$; our measure of the overall quality of the network as a predictor is the sum of the likelihoods for each data sample. Because we assume these are iid, we get
$$
\log \Pr\left( \big( \hat Y_\theta(x^{(i)}) \big)_{i=1}^n = \big( y^{(i)} \big)_{i=1}^n \right)
= \log \prod_{i=1}^n \Pr\left( \hat Y_\theta(x^{(i)}) = y^{(i)} \right)
= \sum_{i=1}^n \log \Pr\left( \hat Y_\theta(x^{(i)}) = y^{(i)} \right)
.$$
The log-likelihood of a parameter value $\theta$ under the data $\{(x^{(i)}, y^{(i)}) \}_{i=1}^n$ is then
$$
\ell(\theta) = \sum_{i=1}^n \log \Pr(\hat Y_\theta(x^{(i)}) = y^{(i)})
,$$
which is what we want to maximize.
Compare to the case of finding the maximum-likelihood estimator for a series of biased coin flips. There the model is $\mathrm{Bernoulli}(\theta)$, i.e. $\Pr(\hat Y_\theta = H) = \theta$, $\Pr(\hat Y_\theta = T) = 1 - \theta$. The log-likelihood is
$$\sum_{i=1}^n \begin{cases}\log(\theta) & y^{(i)} = H \\ \log(1 - \theta) & y^{(i)} = T\end{cases},$$
and we can estimate $\theta$ by maximizing $\ell(\theta)$ given the $y^{(i)}$.
The only difference in this case is that there are also feature vectors $x^{(i)}$, and we're maximzing the likelihood conditional on the $x^{(i)}$.
- log(predicted probability for true class)
– Sam Aug 30 '17 at 17:19In this context, the outcome is x = (0.4, 0.1, 0.5), isn't it? So we're effectively looking for "the probability of (0.4, 0.1, 0.5) (given the parameters)" aren't we? (Equivalently, the likelihood of those params given that x).
If we were looking for the probability of (1.0, 0.0, 0.0) (which can be interpreted as "getting outcome 1") given that the correct distribution is (0.4, 0.1, 0.5) then I'd understand.
– monk Aug 30 '17 at 18:00