26

I see this expectation in a lot of machine learning literature:

$$\mathbb{E}_{p(\mathbf{x};\mathbf{\theta})}[f(\mathbf{x};\mathbf{\phi})] = \int p(\mathbf{x};\mathbf{\theta}) f(\mathbf{x};\mathbf{\phi}) d\mathbf{x}$$

For example, in the context of neural networks, a slightly different version of this expectation is used as a cost function that is computed using Monte Carlo integration.

However, I am a bit confused about the notation that is used, and would highly appreciate some clarity. In classical probability theory, the expectation:

$$\mathbb{E}[X] = \int_x x \cdot p(x) \ dx$$

Indicates the "average" value of the random variable $X$. Taking it a step further, the expectation:

$$\mathbb{E}[g(X)]=\int_x g(x) \cdot p(x) \ dx$$

Indicates the "average" value of the random variable $Y=g(X)$. From this, it seems that the expectation:

$$\mathbb{E}_{p(\mathbf{x};\mathbf{\theta})}[f(\mathbf{x};\mathbf{\phi})]$$

Is shorthand for and the same as:

$$\mathbb{E}_{\mathbf{x}}[f(\mathbf{x};\mathbf{\phi})]$$

Where:

$$ \mathbf{x} \sim p(\mathbf{x};\mathbf{\theta})$$

And this indicates the average value of the random vector $\mathbf{y} = f(\mathbf{x};\mathbf{\phi})$. Is this correct?

By this logic, would this statement be correct too?

$$\mathbb{E}[X] = \mathbb{E}_{p(X)}[X]$$

Richard Hardy
  • 67,272
mhdadk
  • 4,940
  • 2
    Re "Is shorthand for and the same as": Not quite. Notice that the original expression explicitly mentions $\theta$ while the subsequent one does not. – whuber Sep 11 '20 at 18:35
  • 3
    You got it right! This is quite a confusing notation. I prefere to use the notation $$\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x}|\theta)}[X].$$ – MachineLearner Sep 11 '20 at 19:59
  • Hey @whuber, the parameter $\theta$ appears when I say that $\mathbf{x} \sim p(\mathbf{x};\theta)$. What I meant was the original notation is shorthand for the expectation together with the statement $\mathbf{x} \sim p(\mathbf{x};\theta)$. What do you think? – mhdadk Sep 11 '20 at 20:09
  • 2
    I think you need to rely on the conventions and context established by the author. There is no universal notation. – whuber Sep 11 '20 at 20:12
  • That is actually my struggle. I rarely find authors who establish conventions in papers. – mhdadk Sep 11 '20 at 20:18
  • 2
    $\mathbb E[\mathbf X]$ is ambiguous, while $$\mathbb{E}{\mathbf{X} \sim p(\mathbf{x}|\theta)}[X]$$and$$\mathbb{E}{p(\cdot|\theta)}[X]$$and$$\mathbb{E}{p(\mathbf{x}|\theta)}[X]$$are not. This is particularly true when considering varying values of a parameter $\theta$ such as$$\mathbb{E}{p(\cdot;\mathbf{\theta})}[\log p(\mathbf{X};\mathbf{\phi})]$$found eg in the EM algorithm. – Xi'an Sep 12 '20 at 08:08
  • Thanks @Xi'an for the help! – mhdadk Sep 12 '20 at 15:22
  • Just out of curiosity you mentioned neural networks and MC integration. I’m not familiar with such a model, could you elaborate? – jbuddy_13 Sep 13 '20 at 05:34
  • 1
    Hi @jbuddy_13, in a classical neural network architecture, the posterior probability of classes $\mathbf{y}=[y_1,y_2,...,y_K]$ given an input feature vector $\mathbf{x}$ is $p(\mathbf{y}|\mathbf{x};\mathbf{w})$, where $\mathbf{w}$ are the parameters of the network. Note that $\mathbf{y}$ is in one-hot encoding. This posterior probability is estimated using maximum likelihood estimation, and therefore the objective is to maximize $E_{p(\mathbf{x},\mathbf{y})}[log(p(\mathbf{y}|\mathbf{x};\mathbf{w}))]$. – mhdadk Sep 13 '20 at 12:09
  • 1
    ...This is done by sampling from $p(\mathbf{x},\mathbf{y})$ and then using MC integration to compute the expectation. See here for a similar derivation for logistic regression. – mhdadk Sep 13 '20 at 12:10

1 Answers1

7

The expression

$$\mathbb E[g(x;y;\theta;h(x,z),...)]$$

always means "the expected value with respect to the joint distribution of all things having a non-degenerate distribution inside the brackets."

Once you start putting subscripts in $\mathbb E$ then you specify perhaps a "narrower" joint distribution for which you want (for your reasons), to average over. For example, if you wrote $$\mathbb E_{\theta, z}[g(x;y;\theta;h(x,z),...)]$$ I would be inclined to believe that you mean only

$$\mathbb E_{\theta, z} = \int_{S_z}\int_{S_\theta}f_{\theta,z}(\theta, z)g(x;y;\theta;h(x,z),...) d\theta dz$$

and not $$\int_{S_z}\int_{S_\theta}\int_{S_x}\int_{S_y}f_{\theta,z,x,y}(\theta, z,x,y)g(x;y;\theta;h(x,z),...) d\theta\, dz\,dx \,dy$$

But it could also mean something else, see on the matter also https://stats.stackexchange.com/a/72614/28746