Classification problems have the benefit of being discrete, so it's easy to calculate how much information your model has. For example, in LLMs your training data has inputs $\mathbf{x}_i$, a list of previous tokens, and some distribution for the next token $\mathbf{p}_i$. The entropy is $$H = -\sum \mathbf{p}_i^T\ln \mathbf{p}_i.$$ If you want to compress text using some next token predictor (say a model $f$), then you need to minimize $$\mathbb{E}_p[-\ln f(x)] = -\sum \mathbf{p}_i^T\ln f(\mathbf{x}_i),$$ which is just the cross-entropy loss. Thus, minimizing cross-entropy loss theoretically leads to an optimal compression scheme1. However, it's less straightforward when the output is continuous.
People commonly use the mean-squared error, as this maximizes likelihood if you assume the error is normally-distributed (see here). This assumption is almost never true, for example a quick autoencoder I spun up:
It leads to issues like blurry images. A hacky solution is to add an adversarial loss term (e.g. AEGAN), but that isn't very theoretically grounded.
Another approach is to assume the empirical distribution belongs to a nice class of distributions where it's easy to calculate the log-likelihood. This is the idea of normalizing flows: you deform a normal distribution, and use the Jacobian to calculate the change in probability density at your data points3.
However, most distributions are not deformations of normal ones! If one were, every moment would have to be well-defined4, but that's rarely the case, e.g. the Cauchy distribution only has a well-defined zeroth moment.
A third approach is model the gradient of likelihood, such as with denoising models. Or very similarly, model the direction to the maximum likelihood with consistency models. Two issues with this:
- It uses mean-squared error. Using an autoencoder can make images appear sharp, but their underlying latents will not be; look at the fur in Google's video diffusion model.
- Like with normalizing flows, denoisers assume a gradient exists. Worse, because neural networks are smooth, every term in the moment generating function needs to be well-defined almost everywhere. But images often have discrete pieces like the number of fingers! This problem gets worse when using autoencoders, as hands are not very prominent in most images so they get little weight in the latent space.
All of these models have the same problem: they're making assumptions about the training distribution that simply aren't true. The error will almost never be normally distributed, so mean-squared error will almost always fail, and most distributions are more of a mess than some nice "analytic" function.
On the other hand, neural networks and cross-entropy loss work so well because they make no assumptions! Neural networks are universal function approximators, and cross-entropy loss is a universal measure of information.
What is a universal loss for continuous, potentially "messy" distributions?
1 If you care about the size of the model, you have to use the Akaike information criterion. You can mostly ignore this if you
- Fix the number of parameters.
- Have enough degrees of freedom or train the model stochastically and long enough—the first will keep it from getting trapped in local minima, while the latter makes lower losses exponentially more likely2 so minimizing loss is equivalent to minimizing negative log likelihood.
2 I'm unaware of a proof for this, but in game theory there is the not-well-known Ellison's lemma, which is the essential idea.
3 If you use Neural ODE's, i.e. a continuous deformation, this actual becomes pretty easy to calculate: it involves integrating a trace (see here), which can be quickly estimated with the Hutch++ algorithm.
4 Some people use splines for their deformations, so more accurately I should say, a well-defined moment generating function almost everywhere.

How do we do maximum likelihood estimation when, unlike with a categorical outcome, we don’t know the likelihood?Perhaps tack on something like,If this is impossible/ridiculous, what are viable approaches to estimation?This could lead to an answer that expands on my comment above above generalized method of moments estimation and proportional odds modeling. $//$ I'm voting to reopen so I can post my comment as an answer. Perhaps someone can post a better answer that elaborates. – Dave Jan 05 '24 at 14:12