I understand following
- when we say probability we mean probability that a random variable $X$ will have certain value $x_i$ given the parameter $\theta$ that defines underlying probability distribution, that is $P(X=x_i|\theta)$.
- when we say likelihood, we mean probability that the parameter $\theta$ will lead to given random variable observation (or given value of random variable) $\mathcal{L}(\theta|X=x_i)$
I was trying to understand the connection between cross entropy and likelihood from this answer. It says:
For labels $y_i\in \{0,1\}$, the likelihood of some binary data under the Bernoulli model with parameters $\theta$ is $$ \mathcal{L}(\theta) = \prod_{i=1}^n p(y_i=1|\theta)^{y_i}p(y_i=0|\theta)^{1-y_i}\\ $$
I got confused as this seem to define $\mathcal{L}(\theta|X)$ in terms of product of $p(X|\theta)$!!! ($X$ is replaced with variable $y_i$) I googled a bit more to understand how likelihood is defined in terms of product of probabilities. This stanford course notes also states:
$$\mathcal{L}(\theta)=\prod_{i=1}^nf(X_i|\theta)$$
None gave any reason for this. So I gave some thought and felt that its because of nature of independent and identically distributed samples following burnoulli distribution. That is, very definition of Burnoulli distribution and IID samples lead to above fact. In other words, likelihood that certain parameter $\theta$ will lead to certain given values $X_i$ will simply be product of probability of these values under same parameter $\theta$.
However, I still feel I need more clarity behind this intution.
Q1. Can someone please give better / clearer intuitive reasoning behind why likelihood is product of probabilities?
Q2. Is their formal proof for the same?
Q3. The standford notes also says:
In the case of discrete distributions, likelihood is a synonym for the joint probability of your data.
Why is it so?