1

I have come across this conditional expansion a few times, and I can't seem to make sense of it.

$$p(z|y) = \int{p(z|f)p(f|y)df}$$

I would go about it like this:

\begin{align} \require{cancel} p(z|y) & = \frac{p(z,y)}{p(y)} \\ & = \frac{\int{p(z,f,y)df}}{p(y)} \\ & = \frac{\int{\cancel{p(y)}p(f|y)p(z|f,y)df}}{\cancel{p(y)}} \\ & = \int{p(z|f,\color{red}{y})p(f|y)df} \neq \int{p(z|f)p(f|y)df} \end{align}

How is it that during the expansion we can drop the conditional on y from $p(z|f,y)$? I've seen it in a lot of papers on variational inference, and it's on the wikipedia page for Bayesian Inference: $\hspace{1em} p(\tilde{x}|X,\alpha) = \int{p(\tilde{x}|\theta)p(\theta|X,\alpha)d\theta}.\hspace{1em}$ Why isn't the first item in the integral $p(\tilde{x}|X,\alpha,\theta)$? I feel like I am missing some fundamental thing about conditioning which allows shuffling the conditionals around like this.

Similar to this question, the conditioned on variable is present in all subsequent factors, why doesn't that happen in the above cases?

logan
  • 113
  • 2
    Yes, it is confusing. My take on it is that since $z$ and $y$ are data, conditional upon $f$, $z$ and $y$ are assumed independent - in which case, $p(z|f,y) = p(z|f)$. You don't want to build models $f$ such that there is still information in the observed data $y$ that is useful for helping to predict $z$ even knowing $f$, writing a little loosely, so we assume you don't. – jbowman Jun 04 '22 at 22:05
  • @jbowman maybe you can convert your comment into an answer? – gunes Jun 05 '22 at 08:36
  • 1
    @gunes - have done, thanks! – jbowman Jun 05 '22 at 13:46

1 Answers1

1

Yes, it is confusing. However, there is some logic behind it.

Since $z$ and $y$ are data, conditional upon $f$, $z$ and $y$ are assumed independent - in which case, $p(z|f,y)=p(z|f)$. To make a concrete example, if you know that $z$ and $y$ are both distributed according to $f = \mathrm{Negative\ Binomial}(3, 0.7)$, $y$ contains no information about $z$ that you don't already have from knowing the distribution $f$; therefore, $p(z|f,y) = p(z|f)$.

As to whether you can achieve this happy ideal in practice - you don't want to build models $f$ such that there is still information in the observed data $y$ that is useful for helping to predict $z$ even knowing $f$, writing a little loosely, so we assume you don't.

jbowman
  • 38,614
  • This particular example comes from sparse Gaussian Processes, where $y$ is a noisy observation of the underlying $f$, and we are trying to learn the best inducing points $z$. Is this independence saying that $z$ is independent of $y$, as long as I condition on $f$, because the information of $y$ is contained in/mediated through $f$? Will most latent variable models break down like this? – logan Jun 05 '22 at 17:59
  • There's nothing special, in this regard at any rate, about latent variable models. The model is conditional upon the observed data, and the unobserved data distribution is conditional upon the model; integrating out the model makes the unobserved data distribution conditional upon the observed data. – jbowman Jun 05 '22 at 18:11
  • Thanks, there is some serious probability-fu that I need to wrap my head around here! – logan Jun 05 '22 at 18:19