Why do we work with factor of likelihoods instead of e.g. a sum for a batch in the negative log likelihood loss function?

Question

In a classification task, at a certain stage of the training process, we get a likelihood of sampling proper class Y for a particular data point X. For batch, we get many independent likelihoods.

Let's say we have likelihoods l1, l2, l3.

l1 is a likelihood of sampling "fun" given a sequence "Data Science is", with current model params.

l2 is a likelihood of sampling "cool" given a sequence "Machine Learning is", with current model params.

l3...

Then, we get a product of them, which is the likelihood of all the "events" together.

l_batch = l1 * l2 * l3

Finally, we can apply the negative logarithm and get the "negative log-likelihood" loss function.

-log(l_batch)

Theoretically, we could drop the negative logarithm and optimize our joint likelihood (l_batch = l1 * l2 * l3) directly. It would be maximization instead of minimization, because of missing minus sign. It's perfectly differentiable and backpropagation would work. The only problem would be our likelihoods product could be extremely small and cause arithmetic problems or at least be inconvenient to read and work with.

But we could use the sum of likelihoods instead of the product, and maximize it. It would be differentiable too.

l1 + l2 + l3

The question is a) if I reason correctly b) why engineers/scientists don't use the sum of likelihoods instead of product as a loss function for neural network classification then? (we could include a minus sign, to turn it into a minimization problem and be consistent with other loss functions)

Can you edit your question to give a toy illustration of what you mean? I do not see how using sum would "optimize the likelihood of separate data samples". — Henry, Feb 02 '24 at 11:03
“But our goal here is not to follow the probability theory concepts.” this sounds weird, why not? “Our goal is to optimize the likelihood of separate data samples.” ok, maybe you are after some more complicated scheme (it is confusing because optimizing likelihood is still a probability theory concept; what is different?). How do you wish to optimize this seperately, if not by using multiplication of likelihoods? Are you considering some different cost function or different assumptions about indepdence? Can you explain this more exactly? — Sextus Empiricus, Feb 02 '24 at 12:26
Your use of 'our goal' is confusing. Who is this 'our' referring to? Is your question about understanding why people use a product of likelihoods at all, or is it about a specific case where the product might not be useful? — Sextus Empiricus, Feb 02 '24 at 12:34
Thanks you're asking for clarification. I did my best to rephrase the question. — Maciek Gruszczyński, Feb 02 '24 at 21:18
[tag:maximum-likelihood] estimates (MLEs) are things that exist. As you write, there's a direct and obvious relationship between MLEs and minimizing negative (log) likelihoods. https://stats.stackexchange.com/questions/378274/how-to-construct-a-cross-entropy-loss-for-general-regression-targets As you write, computing many products of small numbers could underflow or products of large numbers could overflow. So, it's hard to understand what it is that you're trying to know -- you seem to have already given the answers. — Sycorax, Feb 02 '24 at 21:22
@Sycorax I assume your answer for my question a) is "yes". How about b) question? — Maciek Gruszczyński, Feb 02 '24 at 21:37
This is an interesting question. The reason we typically use the product of the likelihoods of the individual data points is this is what gives us the likelihood of the entire dataset. Namely, the probability of a bunch of independent events is the product of the individual probabilities, and the likelihood is propto this probability. But would maximizing the sum of the individual single-data likelihoods still have reasonable properties? I don't know the answer to this. PS people in the comments still seem confused, and I think it would help a lot if you wrote down the two formulas. — John Madden, Feb 02 '24 at 21:48
From a practical perspective, the sum of likelihoods is probably going to be much less convex than the product. For example, if your data have Gaussian likelihoods, with a sufficiently small variance, you will have one peak for each datapoint, even in linear regression. — John Madden, Feb 02 '24 at 21:51
Thank you @JohnMadden. I love that you get it. Formulas may help indeed. I need some time to understand your answer, but this is exactly the kind of answer I expected. — Maciek Gruszczyński, Feb 02 '24 at 21:59
The justification for an MLE is that you're maximizing the joint probability of the model parameters given the data. I'm not aware of any justification for simply summing the probability of each observation. You can add together the probabilities of mutually exclusive events, but that's not the scenario here -- events need not be mutually exclusive, and you'll easily end up with probabilities larger than 1, which departs from the standard probability axioms. — Sycorax, Feb 02 '24 at 22:01
@Sycorax Sure, it wouldn't be a probability. But when we apply a logarithm to the product, it's not a probability anymore, too. It doesn't need to be. But what you are saying "I'm not aware of any justification for simply summing the probability of each observation." is a valuable insight for me tho. — Maciek Gruszczyński, Feb 02 '24 at 22:08
We use the log of the likelihood for computational reasons; exponentiation gets us something useful - the probability of the entire data set given the parameters. Maximizing the log-likelihood is equivalent to maximizing the likelihood (... monotonic transform...) and finding the parameters that make the data most probable. The sum of the probabilities doesn't give us anything useful, except in the unusual case where all the events are independent of each other, and maximizing the sum of the probabilities doesn't maximize the probability of seeing what we actually saw. — jbowman, Feb 02 '24 at 22:22
@jbowman "maximizing the sum of the probabilities doesn't maximize the probability of seeing what we actually saw". Well, it maximizes the probabilities of separate data points. And data points is what we saw. Do I miss something? — Maciek Gruszczyński, Feb 02 '24 at 22:38
Log-probabilities have an injective and monotonic increasing relationship to raw probabilities; the change of scale is inconsequential because you can always recover the original probability value. — Sycorax, Feb 02 '24 at 22:55
Which is more likely, two independent events occurring with probabilities of 99% and 1% or two independent events occurring with probabilities of 20% and 20%? The former occurs less than 1% of the time, the latter 4% of the time, but the former probabilities add to 100% while the latter only add to 40%. Probabilities multiply, not add. — jbowman, Feb 02 '24 at 23:36

John Madden · Answer 1 · 2024-04-04T22:55:30.620

Commonly, in statistics, we make a strong distributional assumptions about where each datapoint comes from: we say that $P(y_i|\theta) = f(y_i,\theta)$. If we make the assumption that the data are independent, then we can get the joint probability of all our data by taking the product of each density: $P(\{y_1, \ldots, y_N\}|\theta) = \prod_{i=1}^N f(y_i|\theta)$.

When it comes time to find out what the parameters $\theta$ are, a common approach is maximum likelihood. This involves simply computing the maximum of $P(\{y_1, \ldots, y_N\}|\theta)$ now viewed as a function of $\theta$:

$$ \hat\theta = \underset{\theta}{\textrm{argmax}} \prod_{i=1}^N f(y_i,\theta) $$

So this is the first part of an answer to your question: if we are set on doing maximum likelihood, the product comes straight from our assumption of independence and basic probability rules.

But you ask: is it possible to consider the sum of the probabilities instead? This would involve abandoning the idea of maximum likelihood (but it could probably fall under the umbrella of M-Estimation):

$$ \hat\theta = \underset{\theta}{\textrm{argmax}} \sum_{i=1}^N f(y_i,\theta) $$

One downside is that it's not as well motivated as the product formulation, because that's based on the probability of the entire sample rather than being some function of the individual probabilities. But nevertheless, this cost functional is still monotonically increasing in each individual probability, right? So it's not a priori a ridiculous thing to try, and I imagine that there may be circumstances in which it does reasonable things.

However, in the most common circumstances, it's going to have inferior properties to the product formulation. Let's do a simple example. Imagine that we have some data, and we want to fit a normal distribution to it with a fixed variance, say $\sigma=0.1$, but want to find the mean $\mu$ given this fixed variance assumption:

$$ y_i \overset{iid}{\sim} N(\mu, 0.1^2) $$

We can compute the individual probabilities, and then get the standard (product) likelihood versus the proposed sum functional.

Here's code for that in R:

N <- 10
x <- rnorm(N)
sigma <- 0.1
indiv_lik <- function(mu) dnorm(x, mu, sigma)
mu_seq <- seq(-3,3,length.out=10000)
par(mfrow=c(2,1))
plot(mu_seq, sapply(mu_seq, function(mu) sum(indiv_lik(mu))), type = 'l')
plot(mu_seq, sapply(mu_seq, function(mu) prod(indiv_lik(mu))), type = 'l')

This is the plot produced; the top plot shows the sum formulation, the bottom the standard product formulation:

Even though summation and product-taking are conceptually similar, the plots look nothing alike! The sum plot is highly multimodal and nonconvex, while the bottom plot is unimodal. And this is just for a normal likelihood with known variance!

The reason is that the normal likelihood is log-convex, which is a property conserved under product-taking.

However, the property conserved under summation is not log-convexity, but just plain-ol' convexity, which the normal density does not possess. The same is true of many other popular distributions: they are much friendlier when combined via a product than a sum.

Now all this being said, could I rule out some circumstance where summing the probabilities leads to a better inference situation than producting it? I certainly don't know of a proof ruling this out. But in most common situations, it will be inferior.

Sextus Empiricus · Answer 2 · 2024-02-03T00:59:17.440

This relates to the questions: Could a mismatch between loss functions used for fitting vs. tuning parameter selection be justified? and If the predicted value of machine learning method is E(y | x), why bother with different cost functions for y | x?

The likelihood function is popular because it relates to filtering statistical noise in an efficient way. The likelihood does not need to represent some function related to the actual final goal that is to be optimized. Many different functions may be used for that. But even when it is not the same as that final cost function, by using the likelihood one obtains an estimate/fit that has good performance on that final cost function (example: when estimating the mean of a Gaussian population, minimizing the sum of squared residuals will lead to an estimate that minimizes the absolute error).

In the case of a complex model like a language learning model, the parameters of the model may be very much unrelated and the fitting of all of them together in a single value of a single likelihood function may indeed not be necessary or it may be even undesirable. And I also don't think that many people take that approach, ie. for complex models more complex cost functions are incorporated into the training of such models. For example, the fitting of different aspects of the model are weighted according to desired relative importance of these aspects, in order to optimize the variance/error and performance in some final cost function. (Still, using the product in some may remains desirable because it makes computations easier. The relationship between the least squares method and the Gaussian error distribution is a good example)

Also related: When combining p-values, why not just averaging?

score 0 · Answer 3 · answered Feb 03 '24 at 23:38

The "likelihood" is a derivative concept. The primary concept is probabilistic: we consider the joint density of a sample of independent observations, because it expresses the probability of this sample actually being realized as a whole.

So by determining the parameters that maximize the joint density/likelihood, we find the estimates that maximize the probability of this sample happening, given the data and the assumption we have made about the statistical distribution family (with unknown parameters) that characterizes the data. This is why, comparing models (i.e. different assumptions about the distribution), in terms of the attained sample likelihood value is meaningful: the model that gives the higher likelihood value, is more probable than the rest.

This is how, and why, "maximum likelihood estimation" was originally conceived -not as some mathematical maximization algorithm, but based on probabilistic concepts.

An underlying concept exists also for the "least-squares" estimation method, where we minimize the sum of squared deviations: it makes sense to obtain the estimates that minimize the sum of deviations from the actual sample values. Here we minimize the inaccuracy.

So let's say we optimize the sum of likelihoods... We have no argument to support and argue in favor of the estimates thus obtained, as being good estimates of the unknown true values of the parameters. They do optimize some function (the sum of individual ikelihoods in this case), but then we could form some other function of the data (among myriads), and obtain a different set of estimates. How we would be able to compare the two and conclude which on is the preferable one?

Of course, as mathematical transformations go, any strictly monotonic transformation of the joint density / product of individual likelihoods would give us the same estimates as the optimization of the joint likelihood, but here too, we base our estimation to the underlying probabilistic concept, and then we apply mathematics to make our life easier (like taking the logarithm for example).

To summarize: we do not use the sum of likelihoods, because it is disassociated from probabilistic concepts, and hence it is just an arbitrary function to optimize, providing no support or comfort that the estimates obtained will be adequate.

Certainly, one could imagine a project where we just try out various arbitrary functions to optimize, and through simulations, checking whether they perform better than the joint likelihood... but this would be a never-ending project, since each such arbitrary function should be tried against very many distributions and very many parameter values, so that the whole project is comprehensive enough to be persuasive... too much cost, and what would the benefit be?

Why do we work with factor of likelihoods instead of e.g. a sum for a batch in the negative log likelihood loss function?

3 Answers3

Linked

Related