That’s a sequel to my previous question Does Gaussian process functional regression fulfill the consistency condition?
The conclusion was that:
- Gaussian process regression with i.i.d. Gaussian noise returns the same posterior Gaussian process for any partition of the data;
- ... but with completely different calculations/algorithms. In particular, GP regression with full $n-$update (i.e. the trivial partition) has $O\left( {{n^3}} \right)$ generic computational complexity but GP regression with $n$ sequential $1-$updates (i.e. the atomic partition) has exponential computational complexity in $n$. That’s the reason why we never do $n$ sequential $1-$updates but a $(n-1)-$update followed by a $1-$ update in sequential/online learning, see e.g. Using Gaussian Processes to learn a function online.
Now, consider a Bayesian problem with data $D = \left( {{d_1},...,{d_n}} \right)$ and parameters $\Theta $:
$p\left( {\left. \Theta \right|D} \right) \propto p\left( {\left. D \right|\Theta } \right)p\left( \Theta \right)$
Proposition $1$: if the likelihood factorizes $p\left( {\left. D \right|\Theta } \right) = \prod\limits_{i = 1}^n {p\left( {\left. {{d_i}} \right|\Theta } \right)} $ and $\Theta$ is fixed once and for all
then the posterior calculations are exactly the same for any partition $D = \bigcup\limits_{j = 1}^p {{D_j}} $ of the data and any of its $p!$ permutations.
Proof: we have
$p\left( {\left. \Theta \right|D} \right) \propto p\left( {\left. D \right|\Theta } \right)p\left( \Theta \right) = \left( {p\left( {\left. {{D_p}} \right|\Theta } \right)...\underbrace {\left( {p\left( {\left. {{D_2}} \right|\Theta } \right)\underbrace {\left( {p\left( {\left. {{D_1}} \right|\Theta } \right)p\left( \Theta \right)} \right)}_{ \propto p\left( {\left. \Theta \right|{D_1}} \right)}} \right)}_{ \propto p\left( {\left. \Theta \right|{D_1},{D_2}} \right)}...} \right)$
Therefore, the only difference from one partition to another and from one permutation to another are the parentheses and the order of the products that are totally useless by the associative and commutative properties of the product. QED.
Proposition 1 just says that the likelihood $\prod\limits_{i = 1}^n {p\left( {\left. {{d_i}} \right|\Theta } \right)} $ and the full posterior remain the same regardless of how the data are grouped together and of their order of arrival.
Corollary $1$: GP regression with i.i.d. Gaussian noise is not a Bayesian method.
Proof: We have ${d_i} = \left( {{x_i},{y_i}} \right)$ and for i.i.d. Gaussian noise the likelihood factorizes
$p\left( {\left. D \right|\Theta } \right) = \prod\limits_{i = 1}^n {p\left( {\left. {{x_i},{y_i}} \right|f,\sigma } \right) = } \prod\limits_{i = 1}^n {p\left( {\left. {{y_i}} \right|{x_i},f,\sigma } \right)p\left( {\left. {{x_i}} \right|f,\sigma } \right)} \propto \prod\limits_{i = 1}^n {p\left( {\left. {{y_i}} \right|{x_i},f,\sigma } \right)} \propto {\sigma ^{ - n}}\prod\limits_{i = 1}^n {{e^{ - \frac{{{{\left( {{y_i} - f\left( {{x_i}} \right)} \right)}^2}}}{{2{\sigma ^2}}}}}} $
Moreover, $\Theta$ is fixed once and for all: $\Theta = \left( {f,\sigma ,m,k,{\rm M},{\rm K}} \right)$, see Is Gaussian process functional regression a truly Bayesian method (again)?
But the posterior calculations are not exactly the same from one partition/update scheme to another. QED.
In the same way, we have
Proposition $2$: if the likelihood factorizes and $\Theta$ is fixed once and for all, then Bayesian inference has $O(n)$ computational complexity.
Proof: Computing the prior $p\left( \Theta \right)$ has $O(1)$ computational complexity because it does not depend on $n$. Computing the likelihood $p\left( {\left. D \right|\Theta } \right) = \prod\limits_{i = 1}^n {p\left( {\left. {{d_i}} \right|\Theta } \right)} $ has $O(n)$ computational complexity. Computing the normalization constant $p\left( D \right) = \int {p\left( {\left. D \right|\Theta } \right)p\left( \Theta \right){\text{d}}\Theta } $ has $O(1)$ complexity because that's a $|\Theta|-$ dimensional integral that has nothing to do with $n$ (moreover, we don't need to compute it, it cancels out by Leibniz rule/Feynman trick). Therefore, computing the full posterior $p\left( {\left. \Theta \right|D} \right)$ has $O(n)$ computation complexity. Finally, drawing posterior inferences, taking Bayes estimators and computing credible intervals has $O(1)$ computational complexity because it involves $\left| \Theta \right|-$dimensional integrals whose complexity basically does not depend on $n$ (we just integrate different functions that depend on $n$ but the complexity of those integrals basically does not depend on $n$). All in all, Bayesian inference has $O(n)$ computational complexity. QED.
For one example of such truly Bayesian $O(n)$ functional regression algorithm, see Bayesian interpolation and deconvolution.
Corollary $2$: again, GP regression with i.i.d. Gaussian noise is not a Bayesian method.
Proof: GP regression does not have $O(n)$ computational complexity.
Is that correct please?