Definition and Interpretation of Likelihood for non-PhD's

Question

Regarding "non-PhD" in the question title, please answer this question for the audience of people with a solid understanding of probability distributions but no knowledge of the nuances of different statistical paradigms (e.g., frequentism, Bayesianism, likelihoodism).

After reading through the existing stack exchange questions on this topic, I am confused as I have seen multiple top-voted answers that seem to contradict each other, and seem to contradict a highly regarded machine learning book.

Here is what I've gathered, please comment on what is correct / wrong. If something is both correct and wrong depending on the statistical paradigm one subscribes to, please say so instead of stating the belief dependent only on one paradigm. Please only comment if you are an expert on this topic, as this seems to be a contentious topic.

Firstly, I'll state that:

likelihood is not a PMF / PDF as its integral does not sum to 1
discrete / continuous functions have probabilities / probability densities

So no need to expend energy there.

Secondly, both Wikipedia and a forum top comment (Macro) agree that probability (density) and likelihood produce the same value, given some data X and some parameters $\theta$:

Wikipedia:

$\mathcal{L}(\theta|x) = P(X=x|\theta)$

$\mathcal{L}(\theta|x) = f(x|\theta)$

Macro:

the likelihood is not the probability of the parameter value being correct or anything like that - it is the probability (density) of the data given the parameter value

Which contradict two other forum top comments, which say that likelihood is proportional but not equal to probability:

hello_there_andy + ars:

Although it seems like we have simply re-written the probability function, a key consequence of this is that the likelihood function does not obey the laws of probability (for example, it's not bound to the [0, 1] interval). However, the likelihood function is proportional to the probability of the observed data.

Sextus Empiricus:

In your class they introduced the likelihood as being equal to the conditional probability $\mathcal{L}(\theta;x) = f(x;\theta)$, but this was just a simplification. The likelihood does not need to be equal and it is proportional.

I was tempted to just take hello_there_andy, ars, and Sextus Empiricus answer and move on, but I wrote this question so that we could have a clear comparing of these top answers.

Thirdly, Aurélien Géron's "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" states:

Given a statistical model with some parameters θ, the word “probability” is used to describe how plausible a future outcome x is (knowing the parameter values θ), while the word “likelihood” is used to describe how plausible a particular set of parameter values θ are, after the outcome x is known.

This seems to contradict the current Wikipedia definition, which includes this statement:

The likelihood function does not specify the probability that $\theta$ is the truth, given the observed sample X=x.

As well as part of Macro's answer:

the likelihood is not the probability of the parameter value being correct or anything like that

Maybe I am wrong in equating "plausibility" with "probability".

If I am wrong here, is it correct to say that likelihood:

Is representative of the plausibility of $\theta$, given X, while not being the probability of $\theta$, given X.
Is proportional (or equal...) to the probability of data x occurring, given parameters $\theta$.

I do not have time for an answer now, but here are some earlier answers of mine which can help: https://stats.stackexchange.com/questions/345069/likelihood-comparable-across-different-distributions/355836#355836, https://stats.stackexchange.com/questions/97515/what-does-likelihood-is-only-defined-up-to-a-multiplicative-constant-of-proport/97522#97522, or @whuber answer at https://stats.stackexchange.com/a/397166/11887 — kjetil b halvorsen, Jan 25 '23 at 18:40
Bayesians knew all along you have to multiply the likelihood by a prior to get a posterior probability distribution. The problem is there is no universally agreed upon prior for any given data analysis, the differences are not even proportional to each other. For frequentists, the absolute value of a likelihood means nothing. L=50 may as well be L=0.0005. For nicely behaved families (concave, regular exponential famliies) indexed by $\theta$, the maximum is an interesting estimator, but much breaks down when you go beyond that. — AdamO, Jan 25 '23 at 19:05
I'm just here to say that it's a subtle distinction between probability and likelihood, so one should expect to struggle with it before understanding. — John Madden, Jan 25 '23 at 21:22
The illustration and explanation here: https://stats.stackexchange.com/a/284827/805 may resolve one of the issues you appear to have; note that while the likelihood (a red curve for some data, $x$) is equal to a joint density (black curve, for some parameter $\theta$) at every point, the functions operate in different directions and mean different things. The black curves are densities, the red curves join points from different densities and are not densities; they might not even be integrable. ... ctd — Glen_b, Jan 26 '23 at 01:14
ctd ... Indeed, it's perfectly possible for one to be discrete and the other continuous (either way around). Once this distinction between the two is clearly understood, a lot of common misconceptions fall away. — Glen_b, Jan 26 '23 at 01:17

score 1 · Answer 1 · answered Jan 25 '23 at 21:07

The likelihood does give us what can often be equated with 'plausibility', but it is important to say that it is the relative plausibility according to the statistical model. And it is probably useful to note that you are, in effect, using likelihood as a definition of (statistical?) plausibility. That is, in my opinion, a reasonable usage, but it will likely grate on people who view likelihood-based inference with mistrust. The problem is that 'plausible' refers to a state of mind as much as it refers to a statistically definable thing, and different minds are, well, different.

Your statement 2 seems to match Fisher's original definition of likelihood and so it must be at east approximately correct. I would say that it would be greatly improved by omitting the "(or equal...)" and by adding something like 'according to the statistical model being used.'

Because likelihoods always have arbitrary scaling we really can only make inferences about plausibility of parameter values with reference to the plausibility of at least one other parameter value. The likelihoods that give those plausibilities must be taken from the same statistical model and data, as a ratio of likelihoods across models is meaningless with respect to parameter values, even if they might be helpful when choosing the statistical models themselves.

The sentence you've taken from Wikipedia ("The likelihood function does not specify the probability that is the truth, given the observed sample X=x.") is correct, but it is correct in a way that people seem to miss. Because the likelihoods only exist and are relevant to other likelihoods within a statistical model, their relevance to any real-world truth value is necessarily limited by the statistical model, and we respect the notion that all models are wrong!

score 0 · Answer 2 · answered Jan 25 '23 at 19:08

0

Your interpretations 1 and 2 are both wrong.

Bayesians knew all along you have to multiply the likelihood by a prior to get a posterior probability distribution. The problem is there is no universally agreed upon prior for any given data analysis, the differences are not even proportional to each other.
For frequentists, the absolute value of a likelihood means nothing. L=50 may as well be L=0.0005. The relative likelihood (such as a ratio) tells us more. When ratios are monotone, Karlin Rubin tells us that the likelihood ratio test has optimal power. For nicely behaved families (concave, regular exponential famliies) indexed by $\theta$, the maximum is an interesting estimator, but much breaks down when you go beyond that.

answered Jan 25 '23 at 19:08

AdamO

62,637

1

I disagree with your statement that the interpretations are both wrong. They may be limited, and they might be improved with more words, but they are not wrong. Your two numbered statements do not seem to say why the questioner's statements are wrong. – Michael Lew Jan 25 '23 at 20:44
@MichaelLew I believe that I hit the two key issues: if you want a probability for a parameter's unknown value, a Bayesian treatment is required. And regarding the interpretation of an absolute likelihood value, your answer and mine agree it's only useful to compare relative values in the same dataset and within the same probability model. You can't, for instance, compare likelihoods to decide if a probit or a logit model is "better". I think the real difference here is that I take the hard line and you are nudging in a more enlightened direction. – AdamO Jan 26 '23 at 06:35
Why can you not use likelihood to choose between logit or probit? Those are models defined with respect to the same underlying measure, so it should be fine? – kjetil b halvorsen Jan 27 '23 at 16:20
@kjetilbhalvorsen well, you could use the AIC, and, for models having the same number of parameters, ranks for models for selection would equate to using the likelihood. The problem is that the exact and asymptotic properties of the AIC are not well understood. I often see "Model Selection and Multimodel Inference" by Burnham, Anderson cited for this, but I can't get a copy. This is why any "choice" along these lines made by AIC is generally regarded as pragmatic. – AdamO Jan 27 '23 at 16:34
Ok, thanks for the answer. I thought you had objections to the logic of comparing this likelihood, seems you have not. These doubts come up often here, iwill try to write an answer addressing it. But it must wait ... My computer is at a repairshop now – kjetil b halvorsen Jan 28 '23 at 00:51

Definition and Interpretation of Likelihood for non-PhD's

2 Answers2