What does it mean to take the expectation with respect to a probability distribution?

Question

I see this expectation in a lot of machine learning literature:

$$\mathbb{E}_{p(\mathbf{x};\mathbf{\theta})}[f(\mathbf{x};\mathbf{\phi})] = \int p(\mathbf{x};\mathbf{\theta}) f(\mathbf{x};\mathbf{\phi}) d\mathbf{x}$$

For example, in the context of neural networks, a slightly different version of this expectation is used as a cost function that is computed using Monte Carlo integration.

However, I am a bit confused about the notation that is used, and would highly appreciate some clarity. In classical probability theory, the expectation:

$$\mathbb{E}[X] = \int_x x \cdot p(x) \ dx$$

Indicates the "average" value of the random variable $X$. Taking it a step further, the expectation:

$$\mathbb{E}[g(X)]=\int_x g(x) \cdot p(x) \ dx$$

Indicates the "average" value of the random variable $Y=g(X)$. From this, it seems that the expectation:

$$\mathbb{E}_{p(\mathbf{x};\mathbf{\theta})}[f(\mathbf{x};\mathbf{\phi})]$$

Is shorthand for and the same as:

$$\mathbb{E}_{\mathbf{x}}[f(\mathbf{x};\mathbf{\phi})]$$

Where:

$$ \mathbf{x} \sim p(\mathbf{x};\mathbf{\theta})$$

And this indicates the average value of the random vector $\mathbf{y} = f(\mathbf{x};\mathbf{\phi})$. Is this correct?

By this logic, would this statement be correct too?

$$\mathbb{E}[X] = \mathbb{E}_{p(X)}[X]$$

Re "Is shorthand for and the same as": Not quite. Notice that the original expression explicitly mentions $\theta$ while the subsequent one does not. — whuber, Sep 11 '20 at 18:35
You got it right! This is quite a confusing notation. I prefere to use the notation $$\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x}|\theta)}[X].$$ — MachineLearner, Sep 11 '20 at 19:59
Hey @whuber, the parameter $\theta$ appears when I say that $\mathbf{x} \sim p(\mathbf{x};\theta)$. What I meant was the original notation is shorthand for the expectation together with the statement $\mathbf{x} \sim p(\mathbf{x};\theta)$. What do you think? — mhdadk, Sep 11 '20 at 20:09
I think you need to rely on the conventions and context established by the author. There is no universal notation. — whuber, Sep 11 '20 at 20:12
That is actually my struggle. I rarely find authors who establish conventions in papers. — mhdadk, Sep 11 '20 at 20:18
$\mathbb E[\mathbf X]$ is ambiguous, while $$\mathbb{E}{\mathbf{X} \sim p(\mathbf{x}|\theta)}[X]$$and$$\mathbb{E}{p(\cdot|\theta)}[X]$$and$$\mathbb{E}{p(\mathbf{x}|\theta)}[X]$$are not. This is particularly true when considering varying values of a parameter $\theta$ such as$$\mathbb{E}{p(\cdot;\mathbf{\theta})}[\log p(\mathbf{X};\mathbf{\phi})]$$found eg in the EM algorithm. — Xi'an, Sep 12 '20 at 08:08
Just out of curiosity you mentioned neural networks and MC integration. I’m not familiar with such a model, could you elaborate? — jbuddy_13, Sep 13 '20 at 05:34
Hi @jbuddy_13, in a classical neural network architecture, the posterior probability of classes $\mathbf{y}=[y_1,y_2,...,y_K]$ given an input feature vector $\mathbf{x}$ is $p(\mathbf{y}|\mathbf{x};\mathbf{w})$, where $\mathbf{w}$ are the parameters of the network. Note that $\mathbf{y}$ is in one-hot encoding. This posterior probability is estimated using maximum likelihood estimation, and therefore the objective is to maximize $E_{p(\mathbf{x},\mathbf{y})}[log(p(\mathbf{y}|\mathbf{x};\mathbf{w}))]$. — mhdadk, Sep 13 '20 at 12:09
...This is done by sampling from $p(\mathbf{x},\mathbf{y})$ and then using MC integration to compute the expectation. See here for a similar derivation for logistic regression. — mhdadk, Sep 13 '20 at 12:10

Alecos Papadopoulos · Accepted Answer · 2022-08-15T00:10:55.103

The expression

$$\mathbb E[g(x;y;\theta;h(x,z),...)]$$

always means "the expected value with respect to the joint distribution of all things having a non-degenerate distribution inside the brackets."

Once you start putting subscripts in $\mathbb E$ then you specify perhaps a "narrower" joint distribution for which you want (for your reasons), to average over. For example, if you wrote $$\mathbb E_{\theta, z}[g(x;y;\theta;h(x,z),...)]$$ I would be inclined to believe that you mean only

$$\mathbb E_{\theta, z} = \int_{S_z}\int_{S_\theta}f_{\theta,z}(\theta, z)g(x;y;\theta;h(x,z),...) d\theta dz$$

and not $$\int_{S_z}\int_{S_\theta}\int_{S_x}\int_{S_y}f_{\theta,z,x,y}(\theta, z,x,y)g(x;y;\theta;h(x,z),...) d\theta\, dz\,dx \,dy$$

But it could also mean something else, see on the matter also https://stats.stackexchange.com/a/72614/28746

What does it mean to take the expectation with respect to a probability distribution?

1 Answers1

Linked