29

According to the Wikipedia article Likelihood function, the likelihood function is defined as:

$$ \mathcal{L}(\theta|x)=P(x|\theta), $$

with parameters $\theta$ and observed data $x$. This equals $p(x|\theta)$ or $p_\theta(x)$ depending on notation and whether $\theta$ is treated as random variable or fixed value.

The notation $\mathcal{L}(\theta|x)$ seems like an unnecessary abstraction to me. Is there any benefit to using $\mathcal{L}(\theta|x)$, or could one equivalently use $P(x|\theta)$? Why was $\mathcal{L}(\theta|x)$ introduced?

Taylor
  • 20,630
danijar
  • 990
  • 6
    In this context, it reminds us that the likelihood function is a function of $\theta$ with the data $x$ fixed. On the other hand, the joint distribution is a function of the data $x$ given $\theta$. – knrumsey Jun 11 '17 at 22:25
  • 1
    @BigAgnes Thanks. Aren't observed variables fixed by definition, though? I'm also confused why we can call $p(x|\theta)$ a joint distribution. Isn't it a scalar since both $x$ and $\theta$ are fixed (assuming a Frequentist approach where $\theta$ is not a random variable). – danijar Jun 11 '17 at 22:34
  • 1
    Closely related: https://stats.stackexchange.com/questions/224037/wikipedia-entry-on-likelihood-seems-ambiguous – Tim Jun 12 '17 at 07:50

6 Answers6

27

Likelihood is a function of $\theta$, given $x$, while $P$ is a function of $x$, given $\theta$.

Roughly like so (excuse the quick effort in MS paint):

"3D" plot showing a set of densities running left to right and likelihoods running front to back

In this sketch we have a single $x$ as our observation. Densities (functions of $x$ at some $\theta$) are in black running left to right and the likelihood functions (functions of $\theta$ at some $x$) are in red, running front to back (or rather back to front, since the $\theta$ axis comes 'forward' and somewhat to the left). The red curves are what you get when you 'slice' across the set of black densities, evaluating each at a given $x$. When we have some observation, it will 'pick out' a single red curve at $x=x_\text{obs}$.

  • The likelihood function is not a density (or pmf). It is defined in terms of density but it's a different one at every point. It doesn't integrate (/sum) to 1. It needn't even be normalizable.

  • Indeed, $\mathcal L$ may be continuous while $P$ is discrete (e.g. likelihood for a binomial parameter) or vice-versa (e.g. likelihood for an Erlang distribution with unit rate parameter but unspecified shape)

Imagine a bivariate function of a single potential observation $x$ (say a Poisson count) and a single parameter (e.g. $\lambda$) -- in this example discrete in $x$ and continuous in $\lambda$ -- then when you slice that bivariate function of $(x,\lambda)$ one way you get $p_\lambda(x)$ (each slice gives a different pmf) and when you slice it the other way you get $\mathcal L_x(\lambda)$ (each a different continuous likelihood function).

(That bivariate function simply expresses the way $x$ and $\lambda$ are related via your model)

Conversely, with a discrete $\theta$ and a continuous $x$ the likelihood is discrete and the density continuous.

As soon as you specify $x$, you identify a particular $\mathcal L$, that we call the likelihood function of that sample. It tells you about $\theta$ for that sample -- in particular what values had more or less likelihood of giving that sample.

Likelihood is a function that tell you about the relative chance that this value of $\theta$ could produce your data (in that ratios of likelihoods can be thought of as ratios of probabilities of being in the interval from $x$ to $x+dx$), when comparing it to other values for $\theta$.

Glen_b
  • 282,281
  • 1
    It's not a density. For any given $\theta$ its value is equal to that of a density evaluated at a specific $x$, but it's equal to a different density at every $\theta$. Imagine you took every possible value of $\theta$ (imagine for the moment a discrete $\theta$ but continuous $p$) and for each one, you drew the probability density, $p$). Then at the specific sample value ($x$), you slice orthogonally across all those different densities. That slice is a likelihood function --- and it is not itself a density. – Glen_b Nov 12 '19 at 22:45
  • 1
    The second thing. For each specific value of $\theta$ and a given $x$, $L$ is equal to the value of the density evaluated at that $x$, given that $\theta$. but the density changes with $\theta$, so $L$ is equal to the value of a different density (each evaluated at $x$) at every point on $L$. – Glen_b Nov 12 '19 at 23:27
3

According to the Bayesian theory, $f(\theta \mid x_1,\ldots,x_n) = \frac{f(x_1,\ldots,x_n|\theta) \cdot f(\theta)}{f(x_1,\ldots,x_n)}$ holds, that is $\text{posterior} = \frac{\text{likelihood} \cdot \text{prior}}{\textrm{evidence}}$.

Notice that the maximum likelihood estimate omits the prior beliefs(or defaults it to zero-mean Gaussian and count on it as the L2 regularization or weight decay) and treats the evidence as constant(when calculating the partial derivative with respect to $\theta$).

It tries to maximize the likelihood by adjusting $\theta$ and just treating $f(\theta\mid x_1,\ldots ,x_n)$ equal to $f(x_1,\ldots,x_n\mid \theta)$ which we can easily get(usually the loss) and keep the likelihood as $\mathcal{L}(\theta\mid \mathbf x)$. The true probability $\frac{f(x_1,\ldots,x_n|\theta) \ldots f(\theta)}{f(x_1,\ldots,x_n)}$ can hardly be worked out because the evidence(the denominator), $\int_{\theta} f(x_1, \ldots,x_n, \theta)d\theta$, is intractable.

Hope this helps.

User1865345
  • 8,202
Lerner Zhang
  • 6,636
  • 1
  • 41
  • 75
2

I agree with @Big Agnes. Here is what my professor taught in class: One way is to think of likelihood function $L(\theta | \mathbf{x})$ as a random function which depends on the data. Different data gives us different likelihood functions. So you may say conditioning on data. Given a realization of data, we want to find a $\hat{\theta}$ such that $L(\theta | \mathbf{x})$ is maximized or you can say $\hat{\theta}$ is most consistent with data. This is same to say we maximize "observed probability" $P (\mathbf{x} | \theta)$. We use $P(\mathbf{x} | \theta)$ to do calculation but it is different from $P(\mathbf{X} | \theta)$. Small $\mathbf{x}$ stands for observed values, while $\mathbf{X}$ stands for random variable. If you know $\theta$, then $P(\mathbf{x} | \theta)$ is the probability/ density of observing $\mathbf{x}$.

jwyao
  • 236
  • Thanks. Could I equivalently use $P(x|\theta)$ (with lowercase x) instead of $\mathcal{L}(\theta|x)$? When we write ${max}_\theta{P(x|\theta)}$ it should be clear that $x$ is fixed and we're trying to find the most consistent $\theta$. Or does $\mathcal{L}(\theta|x)$ refer to something more abstract that has a different implementation in some situations? – danijar Jun 11 '17 at 22:52
  • Also, could you elaborate why $\mathcal{L}(\theta|x)$ is a random function? It seems like it should be deterministic since both $x$ and $\theta$ are fixed (unless we give $\theta$ a Bayesian treatment and consider it a random variable). – danijar Jun 11 '17 at 22:54
  • It is better to use $L(\theta | \mathbf{x})$ (actually likelihood is defined in such way), because it is a function of $\theta$ rather than $\mathbf{x}$. I don't know if $L(\theta | \mathbf{x})$ refers to something abstract. As for random function argument, it is just a way of thinking of likelihood function. The true $\theta$ is fixed, but we don't know it. That's why we estimate it. You plug in your observations into $L(\theta | \mathbf{x})$, and different data gives you different functions. So likelihood function depends on your observation, so it is like a function of random variables. – jwyao Jun 11 '17 at 23:09
  • $L(\theta | \mathbf{x})$ looks like a posterior distribution but in fact, it isn't. There is no assumption on the (prior) distribution $\pi (\theta)$ of $\theta$. – jwyao Jun 11 '17 at 23:11
  • So one could write $\mathcal{L}_x(\theta)$ to express this more clearly? (I know we shouldn't write this in practice since it's not common notation.) – danijar Jun 11 '17 at 23:17
  • My guess is yes. Your notation states it's a function of $\theta$ clearly. But as you said, in practice $L(\theta | \mathbf{x})$ is what people use. I think it is a common notation. You may want to look at some standard statistics textbooks, like link. – jwyao Jun 11 '17 at 23:22
2

I think the other answers given by jwyao and Glen_b are quite good. I just wanted to add a very simple example which is too long for a comment.

Consider one observation $X$ from a Bernoulli distribution with probability of success $\theta$. With $\theta$ fixed (known or unknown), the distribution of $X$ is given by $p(X|\theta)$.

$$P(x|\theta) = \theta^x(1-\theta)^{1-x}$$

In other words, we know that $P(X=1) = 1 - P(X=0) = \theta$.

Alternatively, we could look treat the observation as fixed and view this as a function of $\theta$.

$$L(\theta | x) = \theta^x(1-\theta)^{1-x}$$

For example, in a maximum likelihood setting, we seek to find $\theta$ which maximizes the likelihood as a function of $\theta$. For example, if we observe $X = 1$, then the likelihood becomes

$$L(\theta | x) = \begin{cases} \theta, & 0 \leq \theta \leq 1 \\ 0, & \text{else} \end{cases}$$

and we see that the MLE would be $\hat\theta = 1$.

Not sure that I've really added any value to the discussion, but I just wanted to give a simple example of the different ways of viewing the same function.

knrumsey
  • 7,722
0

When we write $f(z|\omega)$ it means a function of $z$ for parameters $\omega$. The density $P(x|\theta)$ is a function of data $x$ for parameters $\theta$, e.g. $\int_X P(x|\theta)dx=1$.

When we define likelihood function $\mathcal{L}(\theta|x)$ it means a function of parameters $\theta$ given the dataset $x$. For instance, $\int_\Theta\mathcal{L}(\theta|x)d\theta\ne 1$.

It's important distinction because the optimization problem is over parameters $\theta$ and not the dataset $x$:$$\min_\Theta \mathcal{L}(\theta|x)$$ In fact, we have only one dataset in this case. Therefore, despite we define likelihood with equality $\mathcal{L}(\theta|x)=P(x|\theta)$, it is important to denote it as a function of parameters $\theta$.

Aksakal
  • 61,310
-1

L(θ) = P(y|x;θ) L(θ): likelihood of θ where θ is not a random variable it is a parameter P(y|x;θ): Probability of observing y given data x when parameter θ is used.

Given data in format of x = [x1, x2, x3, ..., xn] => y (n feature input and y as our predefined output in our training data)

So, maximizing L(θ) helps us get θ parameter which will increase chances of observing y when x given. (Maximizing probability of getting result h(θx) = y)