Is Gaussian process functional regression a Bayesian method (over again)?

Question

That’s a sequel to my previous question Does Gaussian process functional regression fulfill the consistency condition?

The conclusion was that:

Gaussian process regression with i.i.d. Gaussian noise returns the same posterior Gaussian process for any partition of the data;
... but with completely different calculations/algorithms. In particular, GP regression with full $n-$update (i.e. the trivial partition) has $O\left( {{n^3}} \right)$ generic computational complexity but GP regression with $n$ sequential $1-$updates (i.e. the atomic partition) has exponential computational complexity in $n$. That’s the reason why we never do $n$ sequential $1-$updates but a $(n-1)-$update followed by a $1-$ update in sequential/online learning, see e.g. Using Gaussian Processes to learn a function online.

Now, consider a Bayesian problem with data $D = \left( {{d_1},...,{d_n}} \right)$ and parameters $\Theta $:

$p\left( {\left. \Theta \right|D} \right) \propto p\left( {\left. D \right|\Theta } \right)p\left( \Theta \right)$

Proposition $1$: if the likelihood factorizes $p\left( {\left. D \right|\Theta } \right) = \prod\limits_{i = 1}^n {p\left( {\left. {{d_i}} \right|\Theta } \right)} $ and $\Theta$ is fixed once and for all

then the posterior calculations are exactly the same for any partition $D = \bigcup\limits_{j = 1}^p {{D_j}} $ of the data and any of its $p!$ permutations.

Proof: we have

$p\left( {\left. \Theta \right|D} \right) \propto p\left( {\left. D \right|\Theta } \right)p\left( \Theta \right) = \left( {p\left( {\left. {{D_p}} \right|\Theta } \right)...\underbrace {\left( {p\left( {\left. {{D_2}} \right|\Theta } \right)\underbrace {\left( {p\left( {\left. {{D_1}} \right|\Theta } \right)p\left( \Theta \right)} \right)}_{ \propto p\left( {\left. \Theta \right|{D_1}} \right)}} \right)}_{ \propto p\left( {\left. \Theta \right|{D_1},{D_2}} \right)}...} \right)$

Therefore, the only difference from one partition to another and from one permutation to another are the parentheses and the order of the products that are totally useless by the associative and commutative properties of the product. QED.

Proposition 1 just says that the likelihood $\prod\limits_{i = 1}^n {p\left( {\left. {{d_i}} \right|\Theta } \right)} $ and the full posterior remain the same regardless of how the data are grouped together and of their order of arrival.

Corollary $1$: GP regression with i.i.d. Gaussian noise is not a Bayesian method.

Proof: We have ${d_i} = \left( {{x_i},{y_i}} \right)$ and for i.i.d. Gaussian noise the likelihood factorizes

$p\left( {\left. D \right|\Theta } \right) = \prod\limits_{i = 1}^n {p\left( {\left. {{x_i},{y_i}} \right|f,\sigma } \right) = } \prod\limits_{i = 1}^n {p\left( {\left. {{y_i}} \right|{x_i},f,\sigma } \right)p\left( {\left. {{x_i}} \right|f,\sigma } \right)} \propto \prod\limits_{i = 1}^n {p\left( {\left. {{y_i}} \right|{x_i},f,\sigma } \right)} \propto {\sigma ^{ - n}}\prod\limits_{i = 1}^n {{e^{ - \frac{{{{\left( {{y_i} - f\left( {{x_i}} \right)} \right)}^2}}}{{2{\sigma ^2}}}}}} $

Moreover, $\Theta$ is fixed once and for all: $\Theta = \left( {f,\sigma ,m,k,{\rm M},{\rm K}} \right)$, see Is Gaussian process functional regression a truly Bayesian method (again)?

But the posterior calculations are not exactly the same from one partition/update scheme to another. QED.

In the same way, we have

Proposition $2$: if the likelihood factorizes and $\Theta$ is fixed once and for all, then Bayesian inference has $O(n)$ computational complexity.

Proof: Computing the prior $p\left( \Theta \right)$ has $O(1)$ computational complexity because it does not depend on $n$. Computing the likelihood $p\left( {\left. D \right|\Theta } \right) = \prod\limits_{i = 1}^n {p\left( {\left. {{d_i}} \right|\Theta } \right)} $ has $O(n)$ computational complexity. Computing the normalization constant $p\left( D \right) = \int {p\left( {\left. D \right|\Theta } \right)p\left( \Theta \right){\text{d}}\Theta } $ has $O(1)$ complexity because that's a $|\Theta|-$ dimensional integral that has nothing to do with $n$ (moreover, we don't need to compute it, it cancels out by Leibniz rule/Feynman trick). Therefore, computing the full posterior $p\left( {\left. \Theta \right|D} \right)$ has $O(n)$ computation complexity. Finally, drawing posterior inferences, taking Bayes estimators and computing credible intervals has $O(1)$ computational complexity because it involves $\left| \Theta \right|-$dimensional integrals whose complexity basically does not depend on $n$ (we just integrate different functions that depend on $n$ but the complexity of those integrals basically does not depend on $n$). All in all, Bayesian inference has $O(n)$ computational complexity. QED.

For one example of such truly Bayesian $O(n)$ functional regression algorithm, see Bayesian interpolation and deconvolution.

Corollary $2$: again, GP regression with i.i.d. Gaussian noise is not a Bayesian method.

Proof: GP regression does not have $O(n)$ computational complexity.

Is that correct please?

You already asked this question (twice) and got it answered. Computational complexity has nothing to do with being Bayesian or not. — Tim, Apr 18 '23 at 08:40
@Tim But Proposition 2 just says the opposite: if the likelihood factorizes, Bayesian inference has $O(n)$ computational complexity. Anything wrong with that? — Student, Apr 18 '23 at 08:52
@Tim Let's go step by step please. Do you agree that computing the prior is $O(1)$ or not? Do you agree that computing the likehood is $O(n)$ or not? Do you agree that drawing inference requires evaluating $|\Theta|-$dimensional integrals that have nothing to do with $n$ or not? — Student, Apr 18 '23 at 08:59
@Tim Yes, adding or multiplying $n$ terms with constant computational complexity has $O(n)$ computational complexity, that's really basic. I don't get your point. Do you mean that computing the likelihood does not have $O(n)$ computational complexity???? If it's not $O(n)$, what is it? — Student, Apr 18 '23 at 09:12
@Tim I don't get your point, where do we repeat a calculation an infinite number of times????? Right, I forgot the normalization constant $p\left( D \right) = \int {p\left( {\left. D \right|\Theta } \right)p\left( \Theta \right)d\Theta } $ in the complexity analysis. But 1) we don't even need to compute it because it cancels out when e.g. we compute posterior moments 2) it has $O(1)$ computational complexity in $n$ because it's a $|\Theta|-$ dimensional integral that has nothing to do with $n$. I will add it in Proposition 2. — Student, Apr 18 '23 at 09:31
I give up. You are cherry-picking arbitrary computations and calculating big-$O$ complexity for them while ignoring other computations to "prove" your point. It doesn't work like this. — Tim, Apr 18 '23 at 09:43
@Tim Please, forget Propostion 2 and Corollary 2 for a while and focus on Proposition 1 and Corollary 1. Can you disprove them or not? — Student, Apr 18 '23 at 09:52
But you should not give up but correct my statements, e.g. if computing the full posterior is not $O(n)$, what's its computational complexity??? — Student, Apr 18 '23 at 09:57
Which calculations am I ignoring??? (I just forgot the normalization constant but we don't care: it has nothing to do with $n$, we just integrate different functions depending on $n$ but this basically does not change the computational complexity of those $|\Theta|-$ dimensional integrals). — Student, Apr 18 '23 at 10:07
Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on [meta], or in [chat]. Comments continuing discussion may be removed. — Tim, Apr 18 '23 at 10:38
It is an unusual computing model indeed that supposes the cost of evaluating an integral (in arbitrarily many dimensions) is $O(1)$! If you have such a machine and can get the implicit constant small enough, scientists and engineers everywhere will beat a path to your door. I believe such a machine would be superior to a quantum computer. — whuber, Apr 18 '23 at 18:27
@whuber We have a $3$-dimensional Bayesian algorithm: the number of parameters $|\Theta|$n, the sample size $n$ and the number $d$ of variables of function $f$. Here, the computational complexity is $O(n^1 |\Theta|^3 |\Theta|^d)$. The normalization constant has $O(n^0 |\Theta|^3 |\Theta|^d )$ complexity. That's what I really mean when I say that it has $O(1)$ complexity. I should better write ${O_n}\left( 1 \right)$ to avoid any confusion. Still a matter of balance between short usual notations and heavy rigorous ones. Will edit the question accordingly. — Student, Apr 18 '23 at 21:27
@whuber Of course, I'm talking about the $O_n(1)$ computational complexity of the normalization constant once the full posterior is computed. The total computational complexity is $O_n(n)$ because we compute the likelihood before, which is the only thing that does depend on $n$... — Student, Apr 18 '23 at 21:36
@whuber I'm sorry, but I could not guess that you would consider that $O(n^0)$ complexity does imply $O(z^0)$ complexity for any $z$. Forget Proposition 2 and focus on Proposition 1 that is fairly elementary. It says that from one update scheme to another, the only difference in Bayes rule are the parentheses. And that they certainly don't matter by the associative property of the product. Hence, the calculations are exactly the same for any update scheme. But that's definitely not the case for GP regression, we even replace intractable sequential update schemes by others. Got it? — Student, Apr 19 '23 at 07:18
@whuber There was an implicit assumption in Proposition 2 and Proposition 1: $\Theta$ must be fixed once and for all and must not depend on the sample size $n$ as in mixture models due to latent allocation variables, see J. Delaney answer. Assumption made explicit. $\Theta$ is fixed once and for all in GP regression. — Student, Apr 19 '23 at 10:03
@whuber How can people upvote Tim's "answer" that does not even deal with Proposition 1 and Corollary 1, which is the most elementary, crystal clear proof that GP regression is not Bayesian one can imagine? Could continue by asking: what's the prior over function m(x)??? But if nobody can or is willing to confirm such a simple thing as Proposition 1 + Corollary 1, it's probably useless. — Student, May 20 '23 at 21:42
@whuber Conversely, why do I get 5 downvotes for such as clean proof? Don't we share the same maths? Or aren't all truths good to tell? By contrast, Proposition 2 is a quick and dirty but nevertheless meaningful basic complexity analysis. I obviously don't mean that computing the normalization constant or posterior quantities do not depend on $n$, I just mean that $n$ is not the dimension of those calculations. Truly Bayesian functional algorithms like Bretthorst's do have $O(n)$ computational complexity, not $O(n^3)$. — Student, May 20 '23 at 21:50
@whuber I really don't see much constructive and collaborative spirit on XV. Even if the truth is sometimes unexpected or even embarassing, searching and telling the truth is nothing but the goal and the duty of the scientist (Poincaré). — Student, May 20 '23 at 21:53
@whuber Deleting harmless comments is bad habit IMHO. I lost contact with the only guy who was willing to take a look at Bretthorst's paper :(. At least now you know why I REALLY care about obvious but nonsensical shortcut Boolean notations: they enjoy the commutative and associative properties. People should be allowed to use proper and correct notations (gently introduced in the first version of my first question). — Student, May 20 '23 at 22:08
@whuber Anyway, if obvious Boolean notations ($d_1 \wedge d_2$ instead of $d_1 , d_2$!) are nonsensical and I'm supposed to mean $O(z^0)$ for any $z$ when I just say $O(n^0)$, I guess I should be very happy not to get any (meaningful) objections against Proposition 1 and Corollary 1. — Student, May 24 '23 at 09:57
I see e.g. here https://stats.stackexchange.com/questions/252577/bayes-regression-how-is-it-done-in-comparison-to-standard-regression/252608#252608 that you do condition on $\sigma$ like any good Bayesian. But Rasmussen et al. don't. — Student, May 25 '23 at 09:15
@whuber Any idea why I get 5 downvotes for such simple things as Proposition 1 + Corollary 1 please? Just because they're disappointing? — Student, Jun 16 '23 at 12:10

Tim · Answer 1 · 2023-04-18T10:43:01.007

5

Computational complexity has nothing to do with being a Bayesian model. Bayes theorem is a mathematical concept, it has no computational complexity whatsoever. There are different Bayesian models, where some have closed-form solutions so can be solved "instantly", some would need you to take complicated integrals. For the latter, we have many algorithms for finding the, usually approximate, solutions, where each of the algorithms has different computational complexity.

As you can learn from Who Are The Bayesians? there is no clear definition of "Bayesianism" but one of the key concepts is the subjectivist interpretation of probability. The Bayesian model is the one defined in terms of priors and likelihood and using Bayes theorem, which is clearly the case of Gaussian processes. Notice however that there are generalizations of the Bayesian approach that would be considered by many still as Bayesian approaches, e.g. ABC (see approximate-bayesian-computation) that do not even try to directly calculate the Bayes theorem and consider scenarios where this is not possible.

Finally, you seem to be sticking to the idea of Bayesian updating but notice that for some Bayesian models, it would not be possible in practice to do such an update at all. For example, if you are using Markov Chain Monte Carlo for sampling from the posterior distribution to get an approximation of it, there is no simple way of using those samples as a prior for another model.

edited Apr 18 '23 at 10:43

answered Apr 18 '23 at 08:59

Tim

138,066

Please, see my comments above and just disprove Proposition 2: if the likelihood factorizes, then Bayesian inference has $O(n)$ computational complexity. – Student Apr 18 '23 at 09:01
I believe you agree that computing the full posterior has $O(n)$ computional complexity. Therefore, you need to prove that taking Bayes estimator from the full posterior does depend on $n$. But it does NOT because we just need to evaluate $|\Theta|-$ dimensional integrals (e.g. for computing the marginal posterior moments) that have nothing to do with $n$. – Student Apr 18 '23 at 09:05
What about Proposition 1 please? This one is even more difficult to disprove, isn't it? – Student Apr 18 '23 at 09:41
Please provide your feedback about Proposition 1 and Corollary 1. If you can't disprove them, then you'll agree that GP regression is not Bayesian and you'll be in a better position to appreciate Proposition 2 and Corollary 2. – Student Apr 18 '23 at 09:50
If I could, I would offer the biggest Bounty for any disproof of Propostion 1/Corollary 1. Please tell me if you can disprove them or not. – Student Apr 18 '23 at 10:31
2

@Student I'm backing off the discussion as it leads nowhere. I already answered your question. – Tim Apr 18 '23 at 10:41
No you didn't AT ALL: what about Proposition 1/Corollary 1???? Besides, if $O(n)$ truly Bayesian functional regression algorithms are of interest to you, see e.g. https://bayes.wustl.edu/glb/deconvolution.pdf. – Student Apr 18 '23 at 10:46
... not to mention that you're talking about Markov chains while the question and propositions deal with i.i.d. likelihoods $p\left( {\left. D \right|\Theta } \right) = \prod\limits_{i = 1}^n {p\left( {\left. {{d_i}} \right|\Theta } \right)} $!!! – Student Apr 18 '23 at 11:27
Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on [meta], or in [chat]. Comments continuing discussion may be removed. – Tim Apr 18 '23 at 13:08
How can people upvote this "answer" that does not even deal with Proposition 1 and Corollary 1, which is the most elementary, crystal clear proof that GP regression is not Bayesian one can imagine? Bayesian inference is NP-complete and MCMC is out of topic. Could continue by asking: what's the prior over function m(x)??? But if nobody can or is willing to confirm such a simple thing as Proposition 1 + Corollary 1, it's probably useless. – Student May 20 '23 at 21:42
@Student as the answer and comments by others say, your propositions are flawed and irrelevant to the question. Your propositions are not pre-requirements for something to be Bayesian. The likelihood would be a product only for independent random variables. I'm not going to continue the discussion. – Tim May 21 '23 at 06:16
Don't you really understand that Prop 1 and Corollary 1 are correct??? Yes, I'm considering basic, standard GP regression with i.i.d. Gaussian noise for which the likelihood factorizes. There is absolutely no problem. – Student May 22 '23 at 11:29
@Student If $Y_1, Y_2, ..., Y_n$ are i.i.d. Gaussians then this is not a Gaussian process. In the Gaussian process, they are not i.i.d. because they are a random function, by definition. I strongly recommend you first refer to a handbook, e.g. the one by Rasmussen. – Tim May 22 '23 at 11:35
I'm talking about the data $d_1, d_2, ..., d_n$ conditionally upon parameters $\Theta$, not the function $f(x)$! If you really believe that Prop 1 and Corollary 1 are not correct, please may you be kind and wise enough just to disprove them? – Student May 22 '23 at 11:43
@Student As said multiple times before. Comments are not meant for discussion. EOT. – Tim May 22 '23 at 11:53
Despite the fact that the likelihood is $p(X, Y|f, \sigma)$, not $p(Y|X, f)$ as stated in Rasmussen, chapter 2, section 2.1, eq. 2.3 ($\sigma$ missing is an unexpected basic mistake), Rasmussen's likelihood with i.i.d. Gaussian noise is the same as in Corollary 1 and does factorize as required in Proposition 1. You told me that I'm rude but I don't know why and I can't make progress, on my side I'm just trying to understand why you don't understand Proposition 1 and Corollary 1. You can move that to chat, I can't. – Student May 23 '23 at 09:18
@Student maybe that's what's your confusion: eq 2.3 in sec 2.1 is not a Gaussian Process, it is just a linear regression. Yes, it "factorizes", it is a product. The Bayesian updating can be done sequentially and the computational complexity of an algorithm using for updating it is much smaller than with GP. In the book, the GPs are discussed in sec 2.2 and the following, and there the likelihood is not a product anymore, GP is all about modeling the covariance function (= non-independent samples). Because of non-independent samples, GP does not "factorize" and proposition 1 does not apply. – Tim May 23 '23 at 10:12
It appears that Rasmussen et al. are not Bayesian. In frequentist probability theory, $X \sim N(\mu,\sigma^2)$ means that $p(X)$ is Gaussian. But in Bayesian probability theory, it means that $p(X|\mu, \sigma)$ is Gaussian... and that $p(X)$ is NOT Gaussian unless $p(\mu)$ and $p(\sigma)$ are Dirac. Rasmussen at al. do not condition on $\sigma$ (and $\mu=0$), which makes sense only in frequentist probability theory. That may explain in some extent why they believe that GP regression is Bayesian. But it is not just like their likelihood. – Student May 23 '23 at 10:39
@Student you're misreading it, GP is a Bayesian model. The notation you are describing has nothing to do with Bayesian vs frequentist approach, there are no different notations. – Tim May 23 '23 at 10:45
Sorry, I missed your comment. Yes, the GP functional prior has nothing to do in the likelihood since it is conditional on function $f(x)$. Conditionally on $f(x)$, the data samples $(x_1, y_1), (x_2, y_2), ...$ are supposed to be mutually independant for i.i.d. Gaussian noise. THIS HAS ABSOLUTELY NOTHING TO DO with the fact that the function images $f(x_1), f(x_2), ...$ are NOT MUTUALLY independent since they have covariance matrix $k(X,X)$.In other words, we have data samples, not function $f(x)$ samples. Proposition 1 only relies on the fact that the data are i.i.d. conditionally on $f(x)$. – Student May 23 '23 at 10:56
@Student $y = f(x) + \varepsilon$, so if $f(x)$ are not independent, than $y$'s cannot be as well. I strongly recommend you read the text carefully, try writing it down and digesting it, and maybe try implementing it and playing around with the code and the numbers. You seem to be confusing many concepts and mixing together things that should not be mixed (like big-O notation NP-completeness, Bayes theorem) and use the notation very loosely, what may lead to further confusion (mixing different levels of abstraction). – Tim May 23 '23 at 11:08
Once again, the likelihood is conditional on $f(x)$ ($w$ in Rasmussen eq. 2.3). Therefore, conditionally on $f(x)$, the likelihood factorizes as in Rasmussen eq. 2.3., Proposition 1 and Corollary 1. That's it. I don't mix anything, you can check that Bretthorst's truly Bayesian functional regression is $O(n)$ and that complexity theory has something to do with Bayesian inference since it is $NP-$complete, e.g. https://link.springer.com/chapter/10.1007/978-1-4612-2404-4_12 – Student May 23 '23 at 11:33
Do you really believe I'm a beginner as far as Bayes is concerned? – Student May 23 '23 at 11:36
I could ask many other badass questions like: What's the prior over function $m(x)$??? If GP regression is Bayesian, then this prior must exist. But it can't be a Dirac since $m(x)$ is updated. Nor an invisible "uniform process" prior $p(m(x)) \propto 1$, otherwise we would be in great touble to define necessary integrals like $p\left( {f\left( x \right)} \right) = \int {p\left( {f\left( x \right),m\left( x \right)} \right){\text{D}}m} $. So what is $p(m(x))$??? GP functional regression requires to estimate a function $m(x)$ in order to estimate another function $f(x)$!!! That's amazing. – Student May 23 '23 at 12:00
Any serious probabilist (should) understands the difference between the frequentist $p(X)$ and the Bayesian $p(X|\mu, \sigma)$ for Gaussian r.v. $X$ and, subsequently, that $\sigma$ is missing in Rasmussen eq. 23 $p(Y|X, f)$ instead of $p(Y|X, f, \sigma)$, should this likelihood be Bayesian. You can ask for instance XV user Xian = Pr. Christian Robert if don't trust me. – Student May 23 '23 at 14:10
@Student your attitude really rude and annoying. I was trying to help, but I am not interested in following this or any other such discussion in the future. Also, keep in mind that for such an attitude you not only will make people less likely to answer your questions and engage with you but also you could sooner or later get banned by the community. Over and out. – Tim May 23 '23 at 14:16
Frequentist do not condition on parameters because they are NOT random variables. Bayesians do. Rasmussen et al. do not condition on $\sigma$ and $\mu=0$ in eq. 23. Therefore they are not Bayesian. That's really bayesic and I don't see why it should cause any trouble. – Student May 23 '23 at 14:22
Let us continue this discussion in chat. – Student May 23 '23 at 20:23

score 1 · Answer 2 · answered Apr 18 '23 at 17:47

1

Your Proposition 1 is wrong, because calculating the posterior can require an increasing number of computations for every additional data point.

A simple example that demonstrates it is a mixture distribution: suppose that $$ x \sim pf_1(x|\theta) + (1-p)f_2(x|\varphi).$$

Namely, $x$ comes from a mixture distribution with two components that are parametrized by $\theta$ and $\varphi$, on which you have some prior $\pi(\theta,\varphi)$. The posterior distribution after observing the first sample $x_1$ will become also a mixture distribution with two components:

$$P(\theta,\varphi|x_1) \propto pf_1(x_1|\theta)\pi(\theta,\varphi) + (1-p)f_2(x_1|\varphi)\pi(\theta,\varphi)$$

The posterior distribution after observing the second sample $x_2$ will then become a mixture distribution with four components :

And so on. After observing $n$ samples the posterior will be a mixture of $2^n$ components, so the amount of calculations requires grows exponentioaly.

Since this proposition is wrong, everything else that follows from it is clearly also wrong.

answered Apr 18 '23 at 17:47

J. Delaney

5,380

How such a straigthforward, immediate consequence of the associative property of the product as Proposition 1 could ever be wrong is well beyond my understanding??? Short answer: don't forget the UNKNOWN mixture allocation variables/parameters ${a_i} = \left{ \begin{gathered} {\text{1 if }}{x_i} \sim {f_1}\left( {\left. x \right|\theta } \right) \hfill \ {\text{2 if }}{x_i} \sim {f_2}\left( {\left. x \right|\varphi } \right) \hfill \ \end{gathered} \right.$ in your likelihood $p\left( {\left. {{x_1},...,{x_n}} \right|\theta ,\varphi ,p,{a_1},...,{a_n}} \right)$... – Student Apr 18 '23 at 18:48
And don't forget that $\Theta = \left( {\theta ,\varphi ,{a_1},...,{a_n}} \right)$. Rewrites Bayes rule accordingly and check that Proposition 1 holds.... – Student Apr 18 '23 at 18:50
Latent but mandatory allocation variables. – Student Apr 18 '23 at 19:04
Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on [meta], or in [chat]. Comments continuing discussion may be removed. – Tim Apr 18 '23 at 19:19

Is Gaussian process functional regression a Bayesian method (over again)?

2 Answers2