12

While reading about likelihood, I have heard that "the exact value of any likelihood is meaningless" why?

So, because of that we may use the likelihood ratio.

So, my question is, why the exact value of likelihood is meaningless? And what is the benefit of likelihood-ratio over the likelihood?

Alice
  • 640

5 Answers5

20

It's “meaningless” in the sense that it's very hard to interpret, it's just “the bigger, the better”. That is the case because the likelihood is not probability and it is calculated without normalizing the constant so the numerical value can be any non-negative number. You still can maximize the likelihood because lack of normalization doesn't matter for optimization. See other questions tagged as for more details like What is the reason that a likelihood function is not a pdf? or What is the difference between "likelihood" and "probability"?.

Tim
  • 138,066
  • 3
    The likelihood is a probability (or probability density) in the data space (not the model-parameter space). Specifically, it is the probability of the data given the model parameters. The integral of the likelihood over all possible data comes to 1. In MCMC, one often works with an un-normalized version of the likelihood, and that un-normalized version is, of course, not a proper probability. – apdnu Jun 12 '22 at 12:45
  • 5
    @apdnu it's not, it's more complicated, see https://stats.stackexchange.com/q/2641/35989 Also you seem to assume a Bayesian perspective, where the term exists also outside it. – Tim Jun 12 '22 at 13:20
  • 1
    The likelihood is literally defined as the probability of the data, given a model. It's a probability in data-space, integrates to one in that space, is non-negative, etc. – apdnu Jun 13 '22 at 11:46
  • @apdnu it doesn't have to be normalized, chance integrates to one, in non-Bayesian setting it is not a conditional probability, since the parameters are not random variables. You are assuming Bayesian likelihood, not the general concept. – Tim Jun 13 '22 at 11:53
  • 1
    It sounds like your objections are more philosophical (Frequentist vs. Bayesian) than practical. Likelihood functions exactly like a probability (or probability density) in the data-space, and that is a perfectly valid interpretation of what it is. Indeed, that is exactly how it is generally defined, and any other definition is equivalent to this definition under a shift of your philosophical outlook (e.g., from Frequentist to Bayesian). – apdnu Jun 13 '22 at 11:59
  • 1
    @apdnu Likelihood functions have the parameter(s) of interest as the x-axis scale. How is that "in the data-space"? – Michael Lew Jun 13 '22 at 21:18
  • @MichaelLew There are a number of ways of seeing that the likelihood is a probability density in the data space (I'm assuming the data variables are continuous, no discrete). First, the units of the likelihood are 1/(units of data). In other words, it has the units of density in the data-space. Second, if you integrate the likelihood over the data domain, you get one. It behaves exactly like a probability density in the data-space (which is not surprising, given that it's defined as the conditional probability of the data on the model). – apdnu Jun 14 '22 at 12:35
  • @apdnu but the likelihood is used and defined as a function of parameters with fixed data, so "in data space" is an irrelevant angle to consider it. – Tim Jun 14 '22 at 12:52
  • @Tim The likelihood can also be considered as a function of the data, holding the model parameters fixed. Either way of viewing the likelihood is valid. Viewing the likelihood as a function of the data gives extra intuition about what it means (i.e., it's a probability density in data space) that one wouldn't get by considering it as a function of the model parameters alone. I don't understand the resistance to viewing it both ways, especially since it's pedagogically useful and would clear up the questioner's original confusion. – apdnu Jun 14 '22 at 15:08
  • 1
    @apdnu If you are restricting your considerations to continuous data then you are omitting the most common examples used to demonstrate likelihood! And, for likelihoods it is worth bearing in mind that all data are discrete because they cannot be specified with infinite precision. That leads to the numerical value of likelihood being dependent on the number of decimals attached to the data. – Michael Lew Jun 15 '22 at 00:21
  • @apdnu Likelihood has no units. If trying to express it as something within "data space" gives it units then you have found a good reason to avoid such a way of expressing and understanding it! – Michael Lew Jun 15 '22 at 00:22
  • 1
    @MichaelLew If the data consists of continuous variables with units, then the likelihood does indeed have units of 1/(units of the data). If the data is discrete and unitless, then the likelihood is also unitless. The likelihood can be understood as the probability of the data given the model, and I think it's confusing to students to deny that that's the case, or to claim that the likelihood isn't a probability. You should explain this interpretation of the likelihood first, and after that, explain that some people have philosophical objections to it. – apdnu Jun 15 '22 at 06:50
  • @apdnu Probability is dimensionless. – Michael Lew Jun 15 '22 at 07:26
  • 1
    @MichaelLew Probabilities of discrete variables are dimensionless, but probabilities of continuous variables have dimensions of 1/(units of variables). Without this, the normalization condition for probability densities would make no sense, because integrals of the probability density function over the variable's domain would end up with units of (units of variable). In that case, how do you normalize to one? One in which units? – apdnu Jun 15 '22 at 09:00
  • @apdnu Seems unlikely to me. I'll compose a question about it and you can write an extended answer. – Michael Lew Jun 15 '22 at 21:36
  • 1
    I now understand that I was not correct. Question and answer here: https://stats.stackexchange.com/questions/578944/units-for-likelihoods-and-probabilities – Michael Lew Jun 28 '22 at 21:10
9

When we use likelihood then we are comparing probability (density) of the data given a certain hypothesis/theory.

The actual probability is not important. It can actually become extremely small. Imagine that you are testing whether a coin is a fair coin and you observe 1000 000 flips with 500 000 heads and 500 000 tails. If the coin is fair (equal head and tails probability), then the probability of this particular observation is 00.0007978844.

7

The likelihood function is usually taken to be the PDF viewed as as a function of parameters for known data.

For example, if I have a coin with Heads probability $\theta$ and toss it $n = 10$ times, getting $x = 3$ heads, then I can take the likelihood function to be

${n\choose x}\theta^x(1-\theta)^{n-x},$ considered as a function of $\theta.$

If I want the MLE $\hat \theta$ of $\theta,$ then I might write the likelihood function as $$f(\theta \mid x = 3)\propto \theta^3(1-\theta)^7,$$ where the symbol $\propto$ (read as "proportional to") indicates that the constant ${n\choose x} = {10\choose 3} = 120$ is omitted. The maximum of the likelihood function is at $\hat \theta = x/n = 0.3,$ whether I use or ignore the constant ${10\choose 3}.$

So the values of the likelihood function might be considered less important than its shape, which leads to the MLE $\hat\theta.$

enter image description here

Henry
  • 39,459
BruceET
  • 56,185
  • 2
    It is more than that: If you saw THTTTTTHTH then the probability of that exact result would have been $\theta^3(1-\theta)^7$ while the probability of $3$ heads and $7$ tails in any order would have been ${10 \choose 3}\theta^3(1-\theta)^7$. The likelihood of $\theta$ being say $0.3$ rather than some other value does not change between these two ways of finding the probability so it is reasonable to say each is only proportional to the likelihood. – Henry Jun 12 '22 at 23:48
  • @Henry. Thanks for edit fixing typo. – BruceET Jun 13 '22 at 00:16
4

[Context]

@Henry and @JonathanLew firmly pointed out errors in my original answer, which argued that the statement "the exact value of any likelihood is meaningless" is glib and that you can't prove a logical claim about all likelihoods by providing specific examples where it's safe to compute the likelihood up to a constant (which admittedly is often the case).

Since I first posted my answer I've learned that continuous likelihoods have (theoretical) units given by 1/(units of the data) from @apdnu's answer to Units for likelihoods and probabilities.

I've come across examples where likelihood function should be be computed exactly to get the correct answer. These examples teach me that I should be careful with my likelihood calculations and don't presume that I can safely ignore normalizing constants.

And I've discovered that (a version of) this question has been asked and answered before: What does "likelihood is only defined up to a multiplicative constant of proportionality" mean in practice?

Example #1: Comparing a model with normal errors to a model with Cauchy errors

This example is from Chapter 6, Y. Pawitan In All Likelihood: Statistical Modelling And Inference Using Likelihood (2013).

We want to model Y in terms of X; there are a few unusual values in the data (outliers). We propose two models with the same mean structure E(Y) = β0 + β1X but different error structure: in one model the errors are iid Normal(0, σ2), in the other model the errors are iid Cauchy(0, γ). We fit the models by maximizing the likelihoods and next we use AIC = -2$\log$L + 2k (where k is the number of model parameters) to choose the "better" model. Both the Normal density and the Cauchy density have constant terms that are usually save to ignore: (2)-1/2 for the Normal and -1 for the Cauchy. These are not the same constant for both models, so no parts of the likelihood functions can be dropped.

Example #2: Mixture of Bernoullis for latent class analysis

This example is from Chapter 9 of C. M. Bishop. Pattern Recognition and Machine Learning (2006).

We want to model a dataset of binary observations as a mixture of $K$ Bernoulli components with parameters $\{\mu_k\}$ and mixing proportions $\pi_k$. The log likelihood is:

$$ \ln p(\mathbf{X}|\mathbf{\mu},\mathbf{\pi}) = \sum_{n=1}^N\ln\left\{\sum_{k=1}^K\pi_kp(\mathbf{x}_n|\mathbb{\mu}_k)\right\} $$

Since there is a summation inside a logarithm, the math doesn't simplify but the maximum likelihood solution can be found with the EM algorithm.

Example #3: Bayesian $t$-test

This example is from Chapter 4 of K. P. Murphy. Machine Learning: A Probabilistic Perspective (2012).

We want to test the hypothesis $\mu > \mu_0$ for some known value of $\mu_0$. The p-value for an one-side t-test is an integral over the likelihood:

$$ \begin{aligned} p(\mu>\mu_0|\text{data}) = \int_{\mu_0}^\infty p(\mu|\text{data})d\mu \end{aligned} $$

We can't omit any terms inside the integral or we won't compute the p-value correctly.

In summary, there are both theory and examples to illustrate that the exact value of the likelihood function can be meaningful.


[Original answer, with corrections following comments]

The statement "the exact value of any likelihood is meaningless" is abstract and imprecise at the same time. So let's start with the definition of likelihood. In the spirit of this question, the definition isn't mathematically rigorous.

We take a probabilistic model f(x,θ) for data x with parameter θ.

  • As a function of the data x, f(x,θ) is a probability density/mass function. [pdf if x is continuous; pmf if x is discrete.]
  • As a function of the parameter θ, f(x,θ) is the likelihood.

It's true that the likelihood doesn't integrate to 1. Many functions don't, yet we don't conclude that their exact value is meaningless.

  • x f(x,θ) dx = 1 [replace the integral with a summation if x is discrete]
  • θ f(x,θ) dθ = constant that depends on the model f and the data x

A common theme running through the answers is that likelihood computations often simplify. The logical argument seems to go something like this: in many computations a term in the likelihood is constant or behaves like a constant so we can simplify the math by dropping that term; ergo the exact value of a likelihood function is meaningless.

However, the likelihood has more uses than maximizing it to find the MLE or performing a likelihood ratio test. And a likelihood term that can be ignored in one computation is important to keep track of in another.

dipetkov
  • 9,805
  • 2
    But clearly you can rescale both likelihoods by the same amount without affecting the Likelihood Ratio or Bayes Factor. Suppose you toss a biased coin three times and want to consider the models of the probability of heads as $\theta_0=\frac15$ or $\theta_1=\frac45$. If you see $HHT$ it does not matter whether you say the likelihood of $\theta$ is then $\theta^2(1-\theta)$ or say ${3\choose 2}\theta^2(1-\theta)$ as you will get a ratio of $4$ in both cases. – Henry Jun 11 '22 at 21:44
  • In your $\theta_0 = 0.1$ and $\theta_1 = 0.12$ case and my $HHT$ example, it still does not matter whether you say the likelihood is $\theta^2(1-\theta)$ or ${3\choose 2}\theta^2(1-\theta)$ as you would get a ratio of $1.408$ either way. – Henry Jun 12 '22 at 00:17
  • You seem to be suggesting that if one model thinks order matters and the other that order does not matter then there is an issue over the likelihood of the latter. I am saying that since the calculations you do for the likelihoods are only meaningful up to proportionality, you can scale them to be on a comparable basis – Henry Jun 12 '22 at 00:17
  • $\theta^2(1-\theta)$ is the probability of Heads then Heads then Tails in that order. ${3\choose 2}\theta^2(1-\theta)$ is the probability of $2$ Heads and $1$ Tails in any order. It would be peculiar if your assessment of the likelihood of a particular value of $\theta$ having observed $HHT$ depends on whether you use the full observation or a sufficient statistic but, since likelihood is relative, you do need to be consistent in your choice – Henry Jun 12 '22 at 00:32
  • Not at all, and I did not say that - it is obviously important in finding a binomial probability and is not meaningless. I am saying it can be used or not in the calculation of a likelihood (so long as this is done consistently) and in that sense does not affect the likelihood of a parameter taking a particular value. Hence my original comment – Henry Jun 12 '22 at 01:08
  • Yes it does, but the same issue arises there. If the parameter there is say the probability $\theta$ that the next result is the same as the previous result, then obviously you have to look at the results in order. The probability of seeing those particular sequences is very small for any value of that parameter, simply because the sequences are very long. But even then, you would need to decide whether the likelihood with the first sequence is $\theta^{95}(1-\theta)^{204}$ or ${299 \choose 95}\theta^{95}(1-\theta)^{204}$ while I would say it is proportional to each of those – Henry Jun 12 '22 at 23:28
  • The first paragraph is wrong wrong wrong. Likelihood ratios can be used as measures of evidence but an individual likelihood cannot. The "hypotheses" that can be assessed by a likelihood ratio are parameter values within the statistical model, not the usual things brought to mind by the word 'hypothesis '. Likelihoods are entirely model-dependent and and so a likelihood of, say, 0.0023 from a coin tossing experiment with the usual statistical model is not the same as a likelihood of 0.0023 from an experiment where the data are on a continuous scale. That is not a difference between kg and lb. – Michael Lew Jun 13 '22 at 21:34
3

An important concept that should be mentioned in the context of this discussion is that of the Likelihood principle (See also Berger & Wolpert's book). The likelihood principle, which is one of the foundations of Bayesian statistics, states that all the evidence relevant to a model parameter $\theta$ is contained in the likelihood function $\mathcal L(\theta | x) = f_X(x|\theta)$.

The precise statement of the likelihood principle is that if two experiments about the same parameters $\theta$ are proportional to each other, namely if

$$f_X(x|\theta) = cf_Y(y|\theta)$$

with $c$ some positive constant, then the evidence on $\theta$ from the experiments is identical. In this sense, the likelihood is only meaningful (as evidence) up to a normalizing constant.

Indeed in Bayesian statistics those two likelihoods will lead to the same posterior probability of $\theta$ (given the same prior), hence Bayesian statistics 'automatically' respects the likelihood principle.

Frequentists statistics in general violates the likelihood principle, so discussion of the meaning of likelihood from a frequentist perspective is somewhat meaningless by itself. But practically most if not all frequentist uses of the likelihood function are via quantities (such as likelihood ratios or the maximum likelihood estimator) that are also invariant under scaling,

J. Delaney
  • 5,380