1

This question has been asked before but I'd like to come back to it because I point a precise issue out.

Suppose we want to estimate a function $f\left( x \right)$ from data $D = \left( {\left( {{x_1},{y_1}} \right),...,\left( {{x_n},{y_n}} \right)} \right)$ with ${y_i} = f\left( {{x_i}} \right) + {\xi _i}$, ${\xi _i}\mathop \sim \limits^{{\text{i}}{\text{.i}}{\text{.d}}{\text{.}}} \mathcal{N}\left( {0,{\sigma ^2}} \right)$ by Gaussian process functional regression.

Let $X = \left( {{x_1},...,{x_n}} \right)$ and $Y = \left( {{y_1},...,{y_n}} \right)$.

The likelihood is $p\left( {\left. Y \right|X , f , \sigma } \right) \propto {\sigma ^{ - n}}\prod\limits_{i = 1}^n {{e^{ - \frac{{{{\left( {{y_i} - f({x_i})} \right)}^2}}}{{2{\sigma ^2}}}}}} $

We have a ${\text{GP}}\left( {m\left( x \right),k\left( {x,x'} \right)} \right)$ prior on $f$ with hyperparameters $m$ and $k$.

Generally speaking, we can have hyperhyperparameters ${\rm M}$ and ${\rm K}$ for $m\left( x \right)$ and $k\left( {x,x'} \right)$.

Therefore Bayes rule writes

$ p\left( {\left. {f , \sigma , m , k ,{\rm M} , {\rm K}} \right|X , Y} \right) \propto \\ {\sigma ^{ - n}}\prod\limits_{i = 1}^n {{e^{ - \frac{{{{\left( {{y_i} - {x_i}} \right)}^2}}}{{2{\sigma ^2}}}}}} {\text{GP}}\left( {m\left( x \right),k\left( {x,x'} \right)} \right)p\left( {\left. m \right|{\rm M}} \right)p\left( {\left. k \right|{\rm K}} \right)p\left( {\rm M} \right)p\left( {\rm K} \right)p\left( \sigma \right) \\ $

and that's all we should need.

The problem is that we have something more that does not appear in Bayes rule at this point: the ${\text{GP}}\left( {m\left( x \right),k\left( {x,x'} \right)} \right)$ prior is used to assign the probability distribution $\mathcal{N}\left( {m\left( X \right),k\left( {X,X} \right)} \right)$ of r.v. $\left. {f\left( X \right)} \right|X,m,k$ that is used in the update equations to get the posterior Gaussian process.

It seems there is no place for $f(X)|X, m , k$ in Bayes rule because we can only add hyperparameters and $f\left( X \right)$ is not such an hyperparameter but a function of the piece of data $X$. $f(X)|X, m ,k$ doesn't look like a prior because it is conditional on $X$ and depends on $X$, nor a likelihood because the likelihood is ${\left. X , Y \right|f , \sigma }$ nor a posterior because all of them are conditional on $X , Y$.

How to plug $f(X)|X, m , k$ in Bayes rule above and where, given we can only add hyperparameters? If we can't, where does ${\mathcal{N}}\left( {m\left( X \right),k\left( {X,X} \right)} \right)$ come from?

So my question is: can we find a set/logical conjunction of hyperparameters that makes the ${\mathcal{N}}\left( {m\left( X \right),k\left( {X,X} \right)} \right)$ multivariate Gaussian appears somewhere in Bayes rule?

Student
  • 39
  • @whuber Is "Should we better use usual but incorrect notations or unusual but correct notations?" a question I could ask for instance on META please? – Student Apr 06 '23 at 13:38
  • That's not a bad question for Meta: I don't recall that it has been addressed there before. There are, however, many discussions about what makes a good question and they include some thoughts about the problems with using purely mathematical notation to try to express statistical questions. – whuber Apr 06 '23 at 14:08
  • @whuber If mathematics cause trouble, we are in trouble! You know, EVERYTHING is a matter of (e.g. logical) invariance under some group action and I just destroyed the invariance under the action of the permutation group $S_n$ by replacing the logical notations by tuple ones. That's extremely painful. Moreover, proper notations/data types are crucial for computing programming and now we must cast those tuples of tuples in many ways, instead of just casting between Boolean propositions and vectors when we enter the linear algebra stage. – Student Apr 06 '23 at 14:44
  • @SextusEmpiricus Typo corrected, thanks. This something is the $N(m(X),k(X,X))$ multivariate Gaussian distribution. – Student Apr 06 '23 at 15:39
  • 1
    Mathematics causes no trouble here on CV. We are discussing how to communicate with people. Doing that well requires a combination of suitable notation, language, and an understanding of intellectual communities, their modes of thought, and terminologies. I believe that very few people will understand what you are trying to ask, even if you were to copy this post verbatim to [math.se]. It does you no good to protest and insinuate that we are ignorant or stupid, because (even supposing that were true) it doesn't solve the problem. – whuber Apr 06 '23 at 15:40
  • 1
    BTW, I (and evidently others) appreciate your efforts in editing this post to help us understand it, as evidenced by the removal of downvotes. That's good progress and suggests you're almost there. – whuber Apr 06 '23 at 15:44
  • @SextusEmpiricus $m(X)$ and $k(X,X)$ occur in the GP update equations and are the expectation and covariance matrix of the $N(m(X),k(X,X))$ distribution. Since Bayes rule deals with probability distributions only, if we introduce an expectation and covariance, we necessarily introduce the corresponding distribution. Thanks for the other typo. – Student Apr 06 '23 at 15:47
  • @SextusEmpiricus By definition, $N(m(X),k(X,X))$ is the probability distribution of $f(X)|X,m,k$. – Student Apr 07 '23 at 07:15
  • 1
    @SextusEmpiricus No, the likelihood is $p(X,Y|f,\sigma) = p(Y|X,f,\sigma) p(X|f,\sigma) = p(Y|Xf,\sigma) p(X) \propto p(Y|Xf,\sigma) \propto {\sigma ^{ - n}}\prod\limits_{i = 1}^n {{e^{ - \frac{{{{\left( {{y_i} - f\left( {{x_i}} \right)} \right)}^2}}}{{2{\sigma ^2}}}}}} $ – Student Apr 07 '23 at 09:12

2 Answers2

5

The answer to the question:

where does $\mathcal{N}(m(X),k(X,X))$ come from ?

relies on two elementary pieces of information:

(1) Understanding how marginal posterior probabilities can be derived from Bayes' rule. Consider for example a model with two sets of unknown parameters $\theta$ and $\phi$, but with data $y$ that depends only on $\theta$, that is $P(y|\theta,\phi) = P(y|\theta)$. Applying Bayes' rule we have

$$ P(\theta,\phi | y ) \propto P(y|\theta) \pi(\theta,\phi)$$

Where $\pi(\theta,\phi)$ denotes the joint prior of $\theta$ and $\phi$. Now we can integrate both sides with respect to $\phi$ to obtain the marginal posterior probability of $\theta$: $$ P(\theta| y ) \equiv \int d\phi P(\theta,\phi | y ) \propto P(y|\theta) \int d\phi\pi(\theta,\phi) \equiv P(y|\theta) \pi(\theta)$$

Where $\pi(\theta) \equiv \int d\phi\pi(\theta,\phi)$ is the marginal prior of $\theta$.

In other words, the marginal posterior can simply be obtained by applying Bayes' rule to the marginal prior, when the likelihood depends only on a subset of the parameters:

$$ P(\theta| y ) \propto P(y|\theta) \pi(\theta).$$

(2) Understanding the marginal distributions of a Gaussian process: this in fact follows directly from the very Definition of a Gaussian process: we say that a stochastic process $\{f(x)|x \in \mathcal{X}\}$ is Gaussian if and only if every finite subset $(f(x_1),f(x_2),...,f(x_n))$ has a marginal multivariate normal distribution. The notation $f \sim \mathcal{GP}(m,k)$ is therefore equivalent to $(f(x_1),f(x_2),...,f(x_n)) \sim \mathcal N(\mathbf{m}, \mathbf{K})$ where $\mathbf{m} = (m(x_1),m(x_2),...,m(x_n))$ and $\mathbf{K}_{ij} = k(x_i,x_j)$.

Combining the above two pieces of information leads us to the answer in a straightforward manner. In GP Regression, we are estimating an unknown function $f$, that is, a quantity that has an infinite number of degrees of freedom. (If you don't like thinking about infinite quantities, you can instead just think about a very large number $N$, e.g. by restricting $f$ to some grid points. This doesn't change anything about the logic).

The observed data $y_i$ only depend on the finite subset $\{f(x_i)|i=1,...,n\}$. If we would like to find the marginal posterior of some other subset, e.g. $\{f(x_i)|i=n+1,...,m+n\}$, then we only need to know the marginal prior of $\mathbf{f}=(f(x_1),f(x_2),...,f(x_{n+m}))$, which is by definition multivariate normal. This is how we arrive at

$$ P(\mathbf{f} | \{y_i\} ) \propto \Pi_{i=1}^n P(y_i|f(x_i)) \times \mathcal N(\mathbf f | \mathbf{m} , \mathbf{K})$$

where $\mathbf{m}$ and $\mathbf K$ are derived from the mean and covariance functions $m(x)$ and $k(x,x')$ as explained above.

Notice that, if $P(y_i|f(x_i))$ is normal, then the marginal posterior probability of $\mathbf{f}$ is also multivariate normal, And this is true for any subset $\mathbf{f}$. But this is exactly the requirement for the posterior distribution of the entire function $f$ to be a Gaussian process itself!. We established, therefore, that in GP regression, the posterior probability is also a GP.

J. Delaney
  • 5,380
  • Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on [meta], or in [chat]. Comments continuing discussion may be removed. – whuber Apr 04 '23 at 12:19
1

where does ${\mathcal{N}}\left( {m\left( X \right),k\left( {X,X} \right)} \right)$ come from?

This occurs as prior.

$$f(X)|X,k \sim {\mathcal{N}}\left( {m\left( X \right),k\left( {X,X^\prime} \right)} \right)$$

Describes a distribution of different functions $f(X)$.

It does occur in your Bayesian rule

$$ p\left( {\left. {f , \sigma , m , k ,{\rm M} , {\rm K}} \right|X , Y} \right) \propto \\ {\sigma ^{ - n}}\prod\limits_{i = 1}^n {{e^{ - \frac{{{{\left( {{y_i} - {x_i}} \right)}^2}}}{{2{\sigma ^2}}}}}} \underbrace{{\text{GP}}\left( {m\left( x \right),k\left( {x,x'} \right)} \right)}_{{\mathcal{N}}\left( {m\left( X \right),k\left( {X,X} \right)} \right)}p\left( {\left. m \right|{\rm M}} \right)p\left( {\left. k \right|{\rm K}} \right)p\left( {\rm M} \right)p\left( {\rm K} \right)p\left( \sigma \right) \\ $$

  • 1
    No, the likelihood is $p(X,Y|f,\sigma) = p(Y|X,f,\sigma) p(X|f,\sigma) = p(Y|Xf,\sigma) p(X) \propto p(Y|Xf,\sigma) \propto {\sigma ^{ - n}}\prod\limits_{i = 1}^n {{e^{ - \frac{{{{\left( {{y_i} - f\left( {{x_i}} \right)} \right)}^2}}}{{2{\sigma ^2}}}}}} $ and has nothing to do with the $GP(m,k)$ functional prior. The kernel $k(x,x')$ is arbitrary, could be a SE or a Matérn kernel for instance. – Student Apr 07 '23 at 09:18
  • 1
    No, the likelihood deals with the measurement noise, the GP prior and its kernel with the function $f$ we want to model and estimate. – Student Apr 07 '23 at 09:25
  • 1
    No. Check the GP update equations in the i.i.d noisy case, e.g. https://arxiv.org/pdf/1807.02582.pdf, p. 16. You'll find a $\sigma^2I_n$ diagonal matrix: that's the likelihood. And you'll find a $k(X,X)$ matrix and a $m(X)$ vector: that's the GP prior, with arbitary $m$ and $k$. Two completely different and independent things, there is no relationship between the properties of the measurement noise and the properties of function $f$. – Student Apr 07 '23 at 09:40
  • 2
    @SextusEmpiricus the model is $y_i \sim \mathcal N (f(x_i),\sigma^2)$ and $f(x) \sim \mathcal{GP}(m,k)$. Student is correct in that the prior is not related to the measurement error. $(f(x_1),f(x_2),...,f(x_n)) \sim \mathcal N (\mathbf{m}, \mathbf{K} )$ is the marginal prior, by definition. – J. Delaney Apr 07 '23 at 10:27
  • Ah, that's another interpretation. You have $f+\epsilon$ where $f$ is considered to follow a prior defined by a Gaussian process and $\epsilon$ is additional noise. – Sextus Empiricus Apr 07 '23 at 10:28
  • @J.Delaney Thanks. I'm gonna write a sequel to my previous question. The conclusion was that GP regression returns the same posterior process for any partition $D = \bigcup\limits_{k = 1}^p {{D_k}}$ of the data but with completely different algorithms, e.g. in $O(n^3)$ with a full update and exponential in $n$ with sequential updates. But if the likelihood factorizes: $p\left( {\left. D \right|\Theta } \right) = \prod\limits_{k = 1}^p {p\left( {\left. {{D_k}} \right|\Theta } \right)} $ with e.g. i.i.d Gaussian noise, then we have... – Student Apr 07 '23 at 12:34
  • @... $p\left( {\left. \Theta \right|D} \right) \propto p\left( {\left. {{D_p}} \right|\Theta } \right)..\underbrace {.p\left( {\left. {{D_2}} \right|\Theta } \right)\underbrace {p\left( {\left. {{D_1}} \right|\Theta } \right)p\left( \Theta \right)}{p\left( {\left. \Theta \right|{D_1}} \right)}}{p\left( {\left. \Theta \right|{D_2}} \right)}$ Therefore, we should not only have the same posterior process but also the very same calculations (modulo different useless parentheses in the product over the data). – Student Apr 07 '23 at 12:37
  • @J.Delaney Oops:

    $p\left( {\left. \Theta \right|D} \right) \propto p\left( {\left. {{D_p}} \right|\Theta } \right)..\underbrace {.p\left( {\left. {{D_2}} \right|\Theta } \right)\underbrace {p\left( {\left. {{D_1}} \right|\Theta } \right)p\left( \Theta \right)}{p\left( {\left. \Theta \right|{D_1}} \right)}}{p\left( {\left. \Theta \right|{D_2},{D_1}} \right)}$

    – Student Apr 07 '23 at 12:46