1

Classification problems have the benefit of being discrete, so it's easy to calculate how much information your model has. For example, in LLMs your training data has inputs $\mathbf{x}_i$, a list of previous tokens, and some distribution for the next token $\mathbf{p}_i$. The entropy is $$H = -\sum \mathbf{p}_i^T\ln \mathbf{p}_i.$$ If you want to compress text using some next token predictor (say a model $f$), then you need to minimize $$\mathbb{E}_p[-\ln f(x)] = -\sum \mathbf{p}_i^T\ln f(\mathbf{x}_i),$$ which is just the cross-entropy loss. Thus, minimizing cross-entropy loss theoretically leads to an optimal compression scheme1. However, it's less straightforward when the output is continuous.

People commonly use the mean-squared error, as this maximizes likelihood if you assume the error is normally-distributed (see here). This assumption is almost never true, for example a quick autoencoder I spun up:

MNIST Autoencoder Error Distribution

It leads to issues like blurry images. A hacky solution is to add an adversarial loss term (e.g. AEGAN), but that isn't very theoretically grounded.

Another approach is to assume the empirical distribution belongs to a nice class of distributions where it's easy to calculate the log-likelihood. This is the idea of normalizing flows: you deform a normal distribution, and use the Jacobian to calculate the change in probability density at your data points3.

However, most distributions are not deformations of normal ones! If one were, every moment would have to be well-defined4, but that's rarely the case, e.g. the Cauchy distribution only has a well-defined zeroth moment.

A third approach is model the gradient of likelihood, such as with denoising models. Or very similarly, model the direction to the maximum likelihood with consistency models. Two issues with this:

  • It uses mean-squared error. Using an autoencoder can make images appear sharp, but their underlying latents will not be; look at the fur in Google's video diffusion model.
  • Like with normalizing flows, denoisers assume a gradient exists. Worse, because neural networks are smooth, every term in the moment generating function needs to be well-defined almost everywhere. But images often have discrete pieces like the number of fingers! This problem gets worse when using autoencoders, as hands are not very prominent in most images so they get little weight in the latent space.

All of these models have the same problem: they're making assumptions about the training distribution that simply aren't true. The error will almost never be normally distributed, so mean-squared error will almost always fail, and most distributions are more of a mess than some nice "analytic" function.

On the other hand, neural networks and cross-entropy loss work so well because they make no assumptions! Neural networks are universal function approximators, and cross-entropy loss is a universal measure of information.

What is a universal loss for continuous, potentially "messy" distributions?


1 If you care about the size of the model, you have to use the Akaike information criterion. You can mostly ignore this if you

  • Fix the number of parameters.
  • Have enough degrees of freedom or train the model stochastically and long enough—the first will keep it from getting trapped in local minima, while the latter makes lower losses exponentially more likely2 so minimizing loss is equivalent to minimizing negative log likelihood.

2 I'm unaware of a proof for this, but in game theory there is the not-well-known Ellison's lemma, which is the essential idea.

3 If you use Neural ODE's, i.e. a continuous deformation, this actual becomes pretty easy to calculate: it involves integrating a trace (see here), which can be quickly estimated with the Hutch++ algorithm.

4 Some people use splines for their deformations, so more accurately I should say, a well-defined moment generating function almost everywhere.

  • The Cauchy loss might work better than MSE, but it still assumes a Cauchy distribution for the error. If you try to automatically match the loss function to a distribution, you're actually just doing normalizing flows (though perhaps towards some other class of distributions). – programjames Dec 31 '23 at 18:52
  • 1
    I think shortening and adding precise definitions, e.g. for 'universal loss' and ' compression scheme', would improve the question a lot. – picky_porpoise Jan 01 '24 at 13:34
  • Minimizing cross-entropy loss for continuous functions can be considered as minimizing log likelihood with binned data and making the bins such small that each bin contains only 1 single case. For all $i$ you get $p_i =1/n$ and the sum becomes $$ -\frac{1}{n} \sum \ln f(\mathbf{x}_i) \cdot w$$ where $w$ is the binsize and can be ignored as it is just a scale factor. – Sextus Empiricus Jan 01 '24 at 18:02
  • Could this question be summarized as “how do we do maximum likelihood estimation when, unlike with a categorical outcome, we don’t know the likelihood?” – Dave Jan 01 '24 at 18:06
  • Hmm, that does seem to be the problem. In which case this is an impossible task. – programjames Jan 01 '24 at 18:46
  • You can still be smart at how you estimate when you don’t know the likelihood. An econometrician explained generalized method of moments estimation as being the next-best option, only lacking the efficiency of MLE if you knew the likelihood (which you typically don’t). Frank Harrell would argue for proportional odds modeling when you don’t know the likelihood. You can still be smart with the statistics when you don’t know the likelihood. – Dave Jan 01 '24 at 20:26
  • I think this question could be reopened with the question rephrased to as something like what I wrote above: How do we do maximum likelihood estimation when, unlike with a categorical outcome, we don’t know the likelihood? Perhaps tack on something like, If this is impossible/ridiculous, what are viable approaches to estimation? This could lead to an answer that expands on my comment above above generalized method of moments estimation and proportional odds modeling. $//$ I'm voting to reopen so I can post my comment as an answer. Perhaps someone can post a better answer that elaborates. – Dave Jan 05 '24 at 14:12

0 Answers0