Likelihood of a single outcome which is itself a probability distribution

Question

I'm reading this page about cross-entropy loss and why it works as a maximum likelihood estimator.

He says:

Because we usually assume that our samples are independent and identically distributed, the likelihood over all of our examples decomposes into a product over the likelihoods of individual examples:

Then he gives an example: if our NN predicts (0.4, 0.1, 0.5) as the probabilities of the three classes, and the correct value is (1.0, 0.0, 0.0), then he says that the "likelihood" of that single example is just 0.4.

As I understand it, L(x, theta) = f(x, theta) for a single observation. How are we supposed to measure f(x, theta) when its value is itself a distribution? That is, what does it mean to have an outcome of (0.4, 0.1, 0.5) given a distribution of (1.0, 0.0, 0.0)?

Edit: if the question were "what is the probability of outcome 1 given distribution (0.4, 0.1, 0.5)", then there's no confusion. I've been interpreting it as "what is the probability of (0.4, 0.1, 0.5) given (1.0, 0.0, 0.0)" which seems awkward.

As this is a neural net, it doesn't really have a likelihood model on the output - What this 0.4 refers to, is most likely just the value for the cross-entropy-loss of that instance, which is

log(predicted probability for true class) — Sam, Aug 30 '17 at 17:19
This looks perfectly ordinary to me, if we understand "predicts (0.4, 0.1, 0.5)" to mean that the estimated underlying distribution over three mutually exclusive classes assigns those probabilities to them. Since the likelihood--by definition--is the probability of the data and the data consist of the first class (which is what I presume "(1.0, 0.0, 0.0)" is intended to mean), then indeed its probability is $0.4$ according to the chosen model. See https://stats.stackexchange.com/questions/2641. — whuber, Aug 30 '17 at 17:43
(@Sam) In this case, (0.4, 0.1, 0.5) is the output of a softmax layer (i.e., they are probabilities). We're then using it to calculate cross entropy loss. Does that help? — monk, Aug 30 '17 at 17:44
@Sam using cross-entropy loss arrives at the maximum likelihood estimator as a special case of the neural net. — AdamO, Aug 30 '17 at 17:51
@whuber: I'm still confused. The "likelihood of a single example" L(x, $theta$) should be the same as the probability (density) at p(x, $theta$) for that example, no?
In this context, the outcome is x = (0.4, 0.1, 0.5), isn't it? So we're effectively looking for "the probability of (0.4, 0.1, 0.5) (given the parameters)" aren't we? (Equivalently, the likelihood of those params given that x).

If we were looking for the probability of (1.0, 0.0, 0.0) (which can be interpreted as "getting outcome 1") given that the correct distribution is (0.4, 0.1, 0.5) then I'd understand. — monk, Aug 30 '17 at 18:00
The outcome (often called $x$) is class 1, if I'm reading your notation correctly. The probability law is $(0.4,0.1,0.5)$, often referred to as $\theta$. Thus the chance of this outcome for this particular probability law is $p_\theta(x)=0.4$. That is the likelihood of $\theta$ associated with the observation $x$. — whuber, Aug 30 '17 at 18:05
@whuber: I think we're narrowing in on the source of my confusion. Since the network is predicting (0.4, 0.1, 0.5), I'm thinking of that as the outcome. I have to think about why that's backwards. — monk, Aug 30 '17 at 18:16
@monk If you were a Bayesian you could speak of the probability of (0.4, 0.1, 0.5) but that would require a prior. The likelihood requires no prior. The likelihood L(x=1, theta=(0.4, 0.1, 0.5)) is 0.4, We do not know that (0.4, 0.1, 0.5) is the correct multinomial distribution. Theta=1,0,0 is the most likely one. (0.01, 0.01, 0.98) could in fact be the correct distribution. the class x for multinomial density is parsed out into a matrix of indicators for evaluating likelihood (X==1, X==2, X==3). — AdamO, Aug 30 '17 at 18:17
Yes, I think you might be using some terms differently than intended. If I understand correctly what the network is doing, it is giving you its guess about a discrete distribution on three categories: that's $\theta$. It can be described by three non-negative numbers that sum to unity, as required of any probability distribution. An "outcome" is what we use to model the data ("our samples" or "examples" in your quotation). Its value $x$ can be any one of those three categories. You have encoded that value using three random variables. Their values are $(1,0,0)$, indicating category 1. — whuber, Aug 30 '17 at 18:52

Danica · Answer 1 · 2017-08-30T17:50:49.413

I'm not totally sure what you mean by "its value is itself a distribution," so let me say a few things and see if they help; feel free to ask more questions if not.

The network is predicting a discrete distribution over the three entries. Letting the predictive label be the random variable $\hat Y$ and naming its possible values $a$, $b$, and $c$, it says that $\Pr(\hat Y = a) = 0.4$, $\Pr(\hat Y = b) = 0.1$, and $\Pr(\hat Y = c) = 0.5$. Note that $\hat Y$ is a function of the network's parameters $\theta$ and the feature vector $x$: we can write it as $\hat Y_\theta(x)$ to denote its dependence on $\theta$ and $x$.

Now, we want to see if that predicted distribution is any good. We only have one data point to evaluate this with: the true observed value $y$, which in this case was observed as $a$. Taking a maximum-likelihood approach, we choose to evaluate the quality of a network $\theta$ by its likelihood: the probability of $a$ under the predictive distribution $\Pr(\hat Y_\theta(x) = \cdot)$, which we can evaluate as $P(\hat Y_\theta(x) = a) = 0.4$. (If the labels were continuous, then we'd use the probability density.)

Now, the network actually predicts one of these distributions for each of the possible inputs $x_i$; our measure of the overall quality of the network as a predictor is the sum of the likelihoods for each data sample. Because we assume these are iid, we get $$ \log \Pr\left( \big( \hat Y_\theta(x^{(i)}) \big)_{i=1}^n = \big( y^{(i)} \big)_{i=1}^n \right) = \log \prod_{i=1}^n \Pr\left( \hat Y_\theta(x^{(i)}) = y^{(i)} \right) = \sum_{i=1}^n \log \Pr\left( \hat Y_\theta(x^{(i)}) = y^{(i)} \right) .$$

The log-likelihood of a parameter value $\theta$ under the data $\{(x^{(i)}, y^{(i)}) \}_{i=1}^n$ is then $$ \ell(\theta) = \sum_{i=1}^n \log \Pr(\hat Y_\theta(x^{(i)}) = y^{(i)}) ,$$ which is what we want to maximize.

Compare to the case of finding the maximum-likelihood estimator for a series of biased coin flips. There the model is $\mathrm{Bernoulli}(\theta)$, i.e. $\Pr(\hat Y_\theta = H) = \theta$, $\Pr(\hat Y_\theta = T) = 1 - \theta$. The log-likelihood is $$\sum_{i=1}^n \begin{cases}\log(\theta) & y^{(i)} = H \\ \log(1 - \theta) & y^{(i)} = T\end{cases},$$ and we can estimate $\theta$ by maximizing $\ell(\theta)$ given the $y^{(i)}$.

The only difference in this case is that there are also feature vectors $x^{(i)}$, and we're maximzing the likelihood conditional on the $x^{(i)}$.

Thanks! I meant that (as you say) "the network is predicting a discrete distribution." When computing likelihoods, I'd only seen examples where the outcome is a single value.
Suppose I have a coin where P(H)=p. We might ask "what is the likelihood of p given 'HTH'?" The question "what is the likelihood of p given 'H'?" is equivalent to "what is the probability of 'H' given p?"

Here we have "what is the probability of (0.4, 0.1, 0.5) given (1.0, 0.0, 0.0)?" I don't know how to make sense of that. What you call "the natural way" sounds good to me, but is there theoretical justification? — monk, Aug 30 '17 at 17:31
Also, when you say "the likelihood of $a$," you mean the probability of $a$, right? Likelihood should be of some parameter? — monk, Aug 30 '17 at 17:34
I think the discussion on the question has pointed out where I went wrong: if the outcome is (1, 0, 0) and the distribution we're comparing with is (0.4, 0.1, 0.5) then I understand how to interpret the likelihood. But I've been assuming the that (0.4, 0.1, 0.5) is the outcome. — monk, Aug 30 '17 at 18:19
Yes, $(0.4, 0.1, 0.5)$ is the model (conditional on $x$): in my notation, it's $\hat Y_\theta(x)$. The outcome is $a$ in my notation, $(1, 0, 0)$ in yours, which refer to the same thing. — Danica, Aug 30 '17 at 18:20
It's all coming together now: my network is supposed to be outputting a model which achieves maximum likelihood given the labeled outcomes. That makes so much more sense than whatever I was thinking. Thanks! — monk, Aug 30 '17 at 18:26

score 1 · Answer 2 · answered Aug 30 '17 at 18:33

The question is confused: I was thinking of the predictions of the network as outcomes under a given model (represented by the labels). This is backward; the network is generating a model (parameterized by the network weights and inputs) under which the labels are the outcomes.

Therefore, the likelihood of a single example is equal to the probability of getting the labeled outcome (1.0, 0.0, 0.0) (referring to class 1) given the predicted distribution (0.4, 0.1, 0.5) and not vice-versa.

Likelihood of a single outcome which is itself a probability distribution

2 Answers2