I got the definition of log-likelihood by Goodfellow's Deep Learning book: \begin{equation} \label{eq:loglikelihood} \theta_{ML} = {argmax}\sum_{i=1}^{m} \log p_{model}(x_i; \theta). \end{equation} I saw the definition of empirical distribution in the same book and is something like this: \begin{equation} \label{eq:empiricaldistribution} \hat{p}_{data}(x) = \frac{1}{m} \sum_{i=1}^{m} \delta(x - x_{i}) \end{equation}
Where Dirac delta is $\delta(x - x_{i})=1$ when $x_{i}=x$ and 0 when they are different.
From the definition of expected value with respect to a function f(x) I could derive the log-likelihood as an expectation with respect to the empirical distribution by dividing the log-likelihood equation by m:
\begin{equation}\label{eq:expectedloglikelihood1} \frac{1}{m}\theta_{ML} = {argmax} \sum_{i=1}^{m} \frac{1}{m} \log p_{model}(x_i; \theta) \end{equation}
And by minimizing the equation I could derive the cross-entropy H function:
\begin{equation}\label{eq:expectedvalue)} {argmin} (-1) \mathbb{E}_{x\sim \hat{p}_{data}} (\log p_{model}(x_i; \theta)) = H(\hat{p}_{data},p_{model}) \end{equation}
As I've read more about here I've found an answer relating this proof with Monte-Carlo, here and reproduce the part that intrigues me:
Again, we don't the true probability distribution $p_{data}(x)$, but we have samples. By Monte-Carlo method, we have $$\sum p_{data(x)}\log p_{model}(x;\theta) \approx \frac{1}{m}\sum_{i=1}^{m}\log p_{model}(x^{(i)};\theta)$$
So how Monte-Carlo method works under these circunstances? To get the binary cross-entropy in the next step of the proof do I need to change the definition of Dirac delta?