How is maximum likelihood estimation written in terms of expectation with respect to empirical distribution defined by training data?

Question

How is the equation 5.58 same as equation 5.59? What does equation 5.59 even mean?

As far as I know, the Expected value is with respect to a random variable. And its value is the summation of the product of all possible values that the random variable can take and its corresponding probability.

Benoit Sanchez · Accepted Answer · 2017-12-27T12:29:23.113

3

Write $f_\theta(x)=\log p_\text{model}(x;\theta)$ to simplify formulas. The idea holds for any function.

When you divide the expression $\displaystyle\sum_{i=1}^m f_\theta(x^{(i)})$ by $m$ you get the empirical mean of $f_\theta(x)$: $\frac{1}{m}\displaystyle\sum_{i=1}^m f_\theta(x^{(i)})$.

By definition the empirical distribution is $\hat p(x)=\frac{\#x}{m}$ where $\#x$ is the number of time $x$ appears in the dataset. Now the only thing left to understand is that the empirical mean of a function is the same as its mean given the empirical distribution. To see this fact, just write:

$$\frac{1}{m}\displaystyle\sum_{i=1}^m f_\theta(x^{(i)})=\frac{1}{m}\sum_x\sum_{x^{(i)}=x}f_\theta(x)=\frac{1}{m}\sum_x\#xf_\theta(x)=\sum_x \hat p(x)f_\theta(x)$$

Note: the double summation is simply grouping by values of $x$. To get the intuition, imagine you have to sum a lot of terms each being either 2,3 or 5. You can first sum the 2s, then sum the 3s then sum the 5s and add the three sums. It's what the formula is.

The expression on the right is by definition the expected value of $f_\theta(x)$ given the empirical distribution $E_{x\sim\hat p}f_\theta(x)$. So finally, the function in the second $\arg\max$ is just the function in the first $\arg\max$ divided by $m$. It is pretty clear that maximizing one or the other is the same.

edited Dec 27 '17 at 12:29

answered Dec 27 '17 at 11:30

Benoit Sanchez

8,487

The definition of empirical distribution here page 64 equ 3.28 is different. Also, can you please explain how did you get the two summation form? – humble Dec 27 '17 at 11:46
On wikipedia, the definition of empirical distribution is different – humble Dec 27 '17 at 11:58
Wikipedia defines the "empirical distribution function" that is the cumulative distribution of the empirical distribution. The definition 3.28 in the book is equivalent to mine since $\delta(x-x^{(i)})$ is 1 #x times, 0 otherwise. – Benoit Sanchez Dec 27 '17 at 12:15
The double summation is grouping: group the terms by value of $x$. Imagine you have to sum a lot of terms each being either 2,3 or 5. You can first sum the 2s, the sum the 3s then sum the 5s and add the three sums. It's what the formula is. – Benoit Sanchez Dec 27 '17 at 12:21
Thanks got it. Could you please add the above explanation for double summation in your answer so that I can accept it? – humble Dec 27 '17 at 12:27

How is maximum likelihood estimation written in terms of expectation with respect to empirical distribution defined by training data?

1 Answers1

Linked