2

On these course notes, we are given the distribution of the posterior distribution https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf

enter image description here

This famous result can be found in many other places, such as https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf. However, I am confused about the coefficient (i.e., the $\dfrac{1}{\sqrt{2\pi}\sigma}$) term associated with the Gaussian for the posterior distribution. Here is what I am having trouble with.


We know that given $\mathcal{D} = (x_1, \ldots, x_n), x_i$ iid,

$$p(\mu|\mathcal{D}) \propto p(D|\mu)p(\mu)$$

Suppose that $$p(\mu) = \dfrac{1}{\sqrt{2\pi}\sigma_o} \exp{\dfrac{(x - \mu_o)^2}{2\sigma_o^2}}.$$ and $$p(D|\mu)= \dfrac{1}{(2\pi\sigma^2)^\frac{N}{2}} \exp(-\dfrac{1}{2\sigma^2}\sum\limits_{n = 1}^N (x_n - \mu))$$

then multiplying the expressions together, we obtain:

$$\dfrac{1}{(2\pi\sigma^2)^\frac{n}{2}} \exp{\left[-\dfrac{1}{2\sigma^2}\sum\limits_{n = 1}^N (x_n - \mu)\right]} \dfrac{1}{\sqrt{2\pi}\sigma_o} \exp{\left[\dfrac{(x - \mu_o)^2}{2\sigma_o^2}\right]}$$

While we can perform a complete the square inside of the exponential, what about the constants $\dfrac{1}{(2\pi\sigma^2)^\frac{n}{2}}$ and $\dfrac{1}{\sqrt{2\pi}\sigma_o}$?

I don't see how $\dfrac{1}{(2\pi\sigma^2)^\frac{n}{2}} *\dfrac{1}{\sqrt{2\pi}\sigma_o} = \dfrac{1}{\sqrt{2\pi}\sigma_n^2}$

where $\sigma_n^2 = (1/\sigma_o^2 + n/\sigma^2)^{-1}$ as shown in Lemma 6.

I tried playing around with the terms but I couldn't make them equal, even when $n = 1$. Have I made a mistake or is this because $p(\mu|\mathcal{D})$ is not a "true" probability distribution (i.e., doesn't integrate to 1)?

If so, how would people deal with this leading coefficient during simulation?

Addendum (see comment)

Bishop Pattern Reconigition and Machine Learning (2006) Pg. 98

enter image description here

  • 2
    It's just a constant of integration. Note that symbol "$\propto$" in your initial expression; it means "proportional to", as in "don't worry about the constant of integration until you're done with everything else, then it's whatever is needed to make the posterior integrate to one". – jbowman Jan 26 '18 at 04:25
  • It is $1/\sqrt{2\pi}\sigma_n$. – jbowman Jan 26 '18 at 04:28
  • To expand on that... if the constant is not $1/\sqrt{2\pi}\sigma_n$, what it really means is that $p(\mu|D)$ doesn't integrate to 1, which tells you that you have the wrong constant. The kernel / functional form is still that of a Gaussian distribution, though. – jbowman Jan 26 '18 at 04:36
  • @StackexchangeHouseNinja Did you read jbowman's comments, especially the first one? What do you think "$\propto$" means in $p(\mu \mid D) \propto p(\mu) p(D \mid \mu)$? – Juho Kokkala Jan 26 '18 at 20:09
  • @JuhoKokkala A distribution satisfies a list of properties, one of which being that the integral from $-\infty$ to $\infty$ is 1. Hence $p(\mu|D)$ is either a distribution, or it is proportional to some distribution but is not a true distribution. The fundamental question is the euqation after "I don't see how...". Does the equality hold true? I said I have tried to show that the LHS equal to RHS, but I cannot see to prove they are equal. I don't think anyone has clearly addressed this problem. – Shamisen Expert Jan 26 '18 at 20:21
  • @JuhoKokkala In other sources, it is not written $p(\mu|D) \propto p(\mu)p(D|\mu)$, but $p(\mu|D) = \mathcal{N}(\mu_n, \sigma_n)$. So I am not going to take "proptional to" on faith. Plus, I need to simulate these functions, and I must understand what "proportional to" means. Porportional with what constant? i just want to show that $p(\mu|D)$ indeed has the distribution of that of a Gaussian (or not). – Shamisen Expert Jan 26 '18 at 20:23
  • Does this help: https://stats.stackexchange.com/questions/64364/ – Juho Kokkala Jan 26 '18 at 20:51
  • You ask "does the equality hold true", but there was never any reason to think it did, since the original expression doesn't have an $=$ sign in it but instead has a $\propto$ sign in it. The LHS doesn't equal the RHS, and no-one has claimed it did. The LHS is proportional to the RHS. – jbowman Jan 26 '18 at 23:51
  • @jbowman The thing is that in other references, such as the widely used textbook Bishop's pattern recognition (2006), on page 98, equation (2.140) is written with an equality, i.e., $p(\mu|D) = \mathcal{N}(\mu|\mu_N, \sigma^2_N)$. This expression could only mean one thing, which is that the coefficient must be written as $1/(\sqrt{2\pi}\sigma_N^2)$ – Shamisen Expert Jan 27 '18 at 02:18
  • @jbowman See addendum. I have uploaded that portion of the text – Shamisen Expert Jan 27 '18 at 02:22
  • 1
    I suspect I see your problem. It appears to me that you think $p(D|\mu)p(\mu)$ is a Normal distribution. It isn't. It's proportional to a Normal distribution. That is why we use the $\propto$ symbol instead of the $=$ symbol in the expression $p(\mu|D) \propto p(D|\mu)p(\mu)$. More specifically, $p(D|\mu)p(\mu) \propto N(\mu_n, \sigma^2_n) = p(\mu|D)$. – jbowman Jan 27 '18 at 02:25
  • You won't be able to transform $p(D|\mu)p(\mu)$ into a $N(\mu_n, \sigma^2_n)$ no matter how hard you try, because they are not equal. They are proportional to each other. – jbowman Jan 27 '18 at 02:37
  • @jbowman It makes sense, but I will still search for an additional reference. The thing is I am trying to simulate/plot $p(\mu|D)$. I am currently using $p(\mu|D) = \mathcal{N}(\mu, \sigma^2)*\mathcal{N}(\mu_o, \sigma_o^2)$. However, I also suspect I might need to simulate $p(\mu|D) = \mathcal{N}(\mu_n, \sigma^2_n)$ instead. – Shamisen Expert Jan 27 '18 at 02:37
  • @jbowman What do you think of the second answer on this link: https://stats.stackexchange.com/questions/64364/why-is-posterior-density-proportional-to-prior-density-times-likelihood-function, it says that $p(\mu|D)$ is not a distribution, so $p(\mu|D) \neq \mathcal{N}(\mu_n,\sigma_n^2)$, specificially "The consequence of discarding $P(y)$ is that now the density $P(\theta | y)$ has lost some properties like integration to 1 over the domain of $\theta$" – Shamisen Expert Jan 27 '18 at 02:37
  • The second answer is wrong. $p(\mu|D)$ is a distribution, otherwise we'd denote it by $g(\mu|D)$ or something. Its wrongness is due to mere sloppiness, though, because reading through it's clear the author understood the fundamental point, which is the difference between "proportional to" and "equal to". By discarding $p(y)$ we've lost the $=$ sign and must replace it with a $\propto$ symbol, but the l.h.s. is unchanged and still a distribution. In your case (to compress two comments) you should definitely simulate $p(\mu|D)$ by $N(\mu_n, \sigma^2_n)$, because that's the real distribution. – jbowman Jan 27 '18 at 02:41

1 Answers1

2

So, just to note, $\Pr(\mu|D)$ is a real probability distribution or at least a distribution function. Read lines 8-12 in the commentary of

https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf

which you posted and you should be able to resolve your questions.

Dave Harris
  • 7,630
  • However, he posted the solution his or herself. It was just overlooked. It was buried inside his own reference. – Dave Harris Jan 26 '18 at 20:55