-1

I'm learning about Gaussian process functional regression but the more I learn, the more questions arise.

So we want to estimate a function $f\left( x \right)$ from data $D = \left( {\left( {{x_1},{y_1}} \right),...,\left( {{x_n},{y_n}} \right)} \right)$ by updating a ${\text{GP}}\left( {m\left( x \right),k\left( {x,x'} \right)} \right)$ prior.

I tell to myself that GP regression should fulfill the "consistency condition" (don't know how it's called in the literature?): if we partition the data set $D$, update the ${\text{GP}}\left( {m\left( x \right),k\left( {x,x'} \right)} \right)$ prior with the first part, then update the posterior process of the previous step with the second part and so forth until the last part, we should get the same final posterior process at the end as if we directly update ${\text{GP}}\left( {m\left( x \right),k\left( {x,x'} \right)} \right)$ with whole data $D$. Otherwise, we would get different estimations/inferences, depending on said partition.

On the one hand, it is well-known that GP regression has $O\left( {{n^3}} \right)$ generic computational complexity.

On the other hand, if we successively update with each datum $\left( {{x_i},{y_i}} \right),\;i = 1,n$ one by one, the update equations are

$\left\{ \begin{gathered} {m^{i + 1}}(x) = {m^i}(x) + {k^i}\left( {x,{x_i}} \right){\left( {{k^i}\left( {{x_i},{x_i}} \right) + {\sigma ^2}} \right)^{ - 1}}\left( {{y_i} - {m^i}\left( {{x_i}} \right)} \right) \hfill \\ {k^{i + 1}}(x,x') = {k^i}(x,x') - {k^i}\left( {x,{x_i}} \right){\left( {{k^i}\left( {{x_i},{x_i}} \right) + {\sigma ^2}} \right)^{ - 1}}{k^i}\left( {{x_i},x'} \right) \hfill \\ \end{gathered} \right.$

Each recursion requires the evaluation of 4 terms.

Hence, applying the recursion again to those 4 terms requires the evaluation of 16 terms, 8 of them being different for $m(x)$: ${{m^{i-1}}\left( {{x}} \right)}$, ${{m^{i-1}}\left( {{x_{i-1}}} \right)}$ , ${{m^{i-1}}\left( {{x_{i}}} \right)}$, ${k^{i-1}}\left( {x,{x_{i-1}}} \right)$, ${k^{i-1}}\left( {x_{i-1},{x_{i-1}}} \right)$, ${k^{i-1}}\left( {x,{x_{i}}} \right)$, ${k^{i-1}}\left( {x_{i-1},{x_{i}}} \right)$ and ${k^{i-1}}\left( {x_{i},{x_{i}}} \right)$.

And so forth. It appears that updating the kernel sequentially with each datum one by one has exponential computational complexity in $n$!

Therefore, GP regression doesn't fulfill the consistency condition: it gives completely different results (or no result at all) depending on how the data is partitioned.

Correct?

Student
  • 39
  • How does chain rule refer to computational complexity? – Tim Mar 31 '23 at 09:29
  • 1
    Are you asking if sequentially (online) updating of the GP should get the same posterior in the end as if using the whole sample at once? See this question – Firebug Mar 31 '23 at 09:40
  • @Tim I'm not sure I understand your question. But if we update with each datum (xi,yi) one by one, what's the computational complexity please? The complexity of each step doesn't depend on n and seems to be constant, right? Therefore, the complexity of n steps should be O(n). Conversely, if it's still O(n^3), then each step should be O(n^2). How is it possible? – Student Mar 31 '23 at 09:42
  • @Firebug Thanks for the link. Yes, I'm asking something like that but I'm not yet satisfied by this question and its answer. If we update sequentially with each datum, we have 1x1 matrices, that is no matrix at all. It seems to me that updating m(x) and k(x,x') in each step has constant complexity, not depending on n. Therefore updating sequentially one by one should have O(n) complexity. Conversely, I don't see how each step could be O(n^2) to make O(n^3) after n steps. – Student Mar 31 '23 at 09:58
  • @J.Delaney May you be kind enough to provide me some references about GP regression O(n) complexity please? – Student Mar 31 '23 at 10:00
  • @J.Delaney If you know the right terminology for this property, please tell me. – Student Mar 31 '23 at 10:01
  • Chain rule is about Bayes theorem, to calculate it you need math, it has nothing to do with computational complexity because you don't need computer for that. As discussed in the link given by @Firebug chain rule applies to GP. You are mixing two unrelated things. – Tim Mar 31 '23 at 10:05
  • @Tim Chain rule also is about calculus/derivatives, you can replace this label by anything you want. Please just answer my question: if we update sequentially with each datum one by one, what's the computational complexity? O(n) or something else? – Student Mar 31 '23 at 10:14
  • What you are missing is that the posterior after each step depends on the previous $n$ data points, so when you perform the updates sequentially they do not all have the same complexity. – J. Delaney Mar 31 '23 at 10:28
  • @Tim I want to implement GP regression by myself. I ask myself: what's the most efficient implementation? If I do a full n-update, I need to invert the K_XX + sigma^2 I nxn matrix in the m(x) and k(x,x') update equations, which basically has O(n^3) computation complexity. But if I do n 1-updates, I just need to invert n 1x1 matrices. Seems like the second implementation is much more frendly. – Student Mar 31 '23 at 10:31
  • @Student implementing it efficiently and the fact that chain rule applies are completely different things. Even for much simpler problems, the fact that chain rule applies has nothing to do with implementation. For example, chain rule could apply as a mathematical concept, but using it as a part of implementation could lead to numerical errors making the result useless, so the fact would not be helpful for guiding the implementation. If efficient implementation is your aim, it'd probably be better to re-phrase your question to asking "how to implement GP efficiently". – Tim Mar 31 '23 at 10:37
  • @Tim If n-update is O(n^3), but n 1-updates are O(n), I do not hesitate: I chose O(n). Moreover, at a first glance, inverting 1x1 matrices is expected be numerically more stable than inverting a nxn matrice (depends on its condition number). I just would like to check that both implementations return the same results in theory. That is not obvious. – Student Mar 31 '23 at 10:47
  • @J.Delaney I don't get that. The m(x) and k(x,x') update equations remain the same in every step, whatever m(x) and k(x,x') from the previous step. How could the complexity not be constant? Can you elaborate on this please? I'm surprised by your claim that GP is O(n). I used to believe it was O(n^3) generically, so that we find plenty of papers targetting sub-cubic complexity in special cases. – Student Mar 31 '23 at 11:38
  • 1
    You are not accounting for the matrix products at every iteration – Firebug Mar 31 '23 at 13:57
  • 1
    Again, pertaining your update, the iteration is not $O(1)$, you are dismissing the cost of the matrix product, which grows every iteration – Firebug Apr 01 '23 at 10:44
  • @Firebug Yes, that was completely wrong, sorry. J. Delaney and myself now agree that the computational complexity is exponential in $n$. See your answer, his and the updated question please. – Student Apr 01 '23 at 14:59

3 Answers3

3

In computer science, the idea you describe is called the divide and conquer algorithm.

While often very successful, here it fails. This is because you have to take all cross correlations between the prior $n-1$ points and the new data point into account. So nothing is gained in general by doing it sequentially or subdividing the problem in any other way.

g g
  • 2,686
  • Very, very interesting. Assuming that GP regression is Bayesian, and regarding Bayesian probability theory as the extension of Boolean logic under uncertainty, doesn't it mean that GP regression violate the associative property of the logical conjunction, e.g. ( ((x1,y1)^...^(xm,ym) ) ^ ( (xm+1,ym+1)^...^(xn,yn) ) <> ( (x1,y1)^...^(xn,yn) ) ??? – Student Mar 31 '23 at 14:36
  • I am not sure I even understand what "( ((x1,y1)^...^(xm,ym) ) ^ ( (xm+1,ym+1)^...^(xn,yn) ) <> ( (x1,y1)^...^(xn,yn) )" means ??? – g g Mar 31 '23 at 18:44
  • That's the associative property: the logical conjunction is the same regardless of the parentheses. But with GP regression, we get different estimates depending on those parentheses/on the partition of the data set. – Student Apr 01 '23 at 08:23
3

Unlike parametric models that are defined by a fixed number of parameters, a Gaussian process is defined by a mean and covariance functions. While it is common to use a prior where those functions have a simple closed form (such as a squared exponential covariance function), Once you update the model with arbitrary data those functions no longer have this simple form. The updated mean and covariance functions depend in some complicated way on all the previous data points, so when you perform sequential updating, each step has an increasing complexity.

Suppose for example that the covariance function after $n$ updates is $k^{(n)}(x,x')$, and you observe the next data point $x_{n+1}$. The updated covariance $k^{(n+1)}(x,x')$, as you pointed out, is

$$k^{(n+1)}(x,x')=k^{(n)}(x,x') -k^{(n)}(x,x_{n+1})(\sigma^2 + k^{(n)}(x_{n+1},x_{n+1}))^{-1} k^{(n)}(x',x_{n+1})$$

Now evaluating this function at arbitrary points $x$ and $x'$ requires four different evaluations of $k^{(n)}(\cdot,\cdot)$, namely $k^{(n)}(x,x_{n+1})$, $k^{(n)}(x',x_{n+1})$, $k^{(n)}(x_{n+1},x_{n+1})$ and $k^{(n)}(x',x)$. But each of those function evaluations will itself require that you calculate the previous covariance $k^{(n-1)}(\cdot,\cdot)$ at another set of points, for example to evaluate $k^{(n)}(x',x_{n+1})$ you will need $k^{(n-1)}(x_n,x_{n+1})$ and so on. This naïve recursive calculation will require therefore a total of $4^n$ function evaluations, which is unlikely to be very efficient.

Of course the sequential calculation will eventually give you exactly the same result as updating once with all the data, as those are mathematically equivalent.

J. Delaney
  • 5,380
  • 1
    I'm not sure what you mean by weak/strong recursions. Regarding your update equations, again, recall that $k(x,x')$ is not a number but a function, that you will have to evaluate in the next step at a different point. Therefore the recursion step is not a simple $O(1)$. (Trying to actually implement this in code might be the best way for you to see it) – J. Delaney Apr 01 '23 at 10:16
  • Oops, forget the previous comment, it should have been deleted. You are right: it appears that updating sequentially has exponential computational complexity in n. I've updated my question accordingly. But I don't yet agree with your conclusion: how a O(n^3) algorithm for the full n-update can give the same result as a O(4^n) algorithm after n 1-update?????????????? – Student Apr 01 '23 at 14:35
  • Thanks a lot for your input, we are on the right tracks. Why do you say that the O(n^3) and O(4^n) algorithms give the same results please??? At least, that are not computationally equivalent AT ALL! – Student Apr 01 '23 at 14:43
  • You just forgot the fact that there are redundant terms: if we apply the recursion to the 4 terms in the $m(x)$ update equations, there are 16 terms, but only 8 of them are different. So the computational complexity is below O(4^n) but clearly exponential. So, my question boils down to: do a O(n^3) and an exponential algorithm give the same results or not? You claim that they give the same results but it must be proved. – Student Apr 01 '23 at 14:53
  • If you provide a proof that both algos return the same results, I'll validate your answer. – Student Apr 01 '23 at 15:33
  • 2
    The same calculation can be done by different algorithms with different complexities, the argument above just shows that the recursive calculation is less efficient than the direct one. The equivalence of the two simply follows from the fact that they both implement Bayes' Theorem and that the observations are conditionally independent, i.e. $P(y_i,y_j|m,k)=P(y_i|m,k)P(y_j|m,k)$ – J. Delaney Apr 01 '23 at 17:03
  • That's the point: the whole problem comes from the fact that GP regression is NOT a straightfoward application of Bayes rule. Bayes rule writes $p\left( {\left. {f,\sigma ,\Theta } \right|{\mathbf{X}},{\mathbf{Y}}} \right) \propto p\left( {\left. {\mathbf{Y}} \right|{\mathbf{X}},f,\sigma } \right)p\left( {\left. f \right|\Theta } \right)p\left( \Theta \right)p\left( \sigma \right)$ with $p\left( {\left. {\mathbf{Y}} \right|{\mathbf{X}},f,\sigma } \right) \sim {\text{N}}{\left( {0,{\sigma ^2}} \right)^n}$ and... – Student Apr 01 '23 at 17:24
  • e.g. $p\left( {\left. f \right|\Theta } \right) \sim {\text{GP}}\left( {m\left( x \right),k\left( {x,x'} \right)} \right)$ . There is nothing like $p\left( {\left. {f\left( {\mathbf{X}} \right)} \right|{\mathbf{X}}} \right) \sim {\text{N}}\left( {m\left( {\mathbf{X}} \right),k({\mathbf{X}},{\mathbf{X}})} \right)$ in Bayes rule. Bayes rule doesn't know about GP conditioning. Should GP be truly Bayesian, this discussion would not exist: by the associative property of the logical conjunction, Bayes rule doesn't care at all how the data are partitioned, we get the very same algo in any case... – Student Apr 01 '23 at 17:29
  • Gaussian process regression is a straightforward application of Bayes rule. If you don't see why, it might be best to post a new question and explain your arguments more clearly, as this seems to be a slightly different question. – J. Delaney Apr 01 '23 at 18:02
  • I see why GP regression is not Bayesian: because we find truly Bayesian functional regression/estimation/interpolation/extrapolation methods and algorithm. See e.g. https://bayes.wustl.edu/glb/deconvolution.pdf Truly Bayesian algos necessarilly have $O(n)$ complexity because computing the likelihood $p\left( {\left. {\mathbf{Y}} \right|{\mathbf{X}},f,\sigma } \right) \sim {\text{N}}{\left( {0,{\sigma ^2}} \right)^n}$ is $O(n)$ and the rest of the calculations does not depend on $n$. – Student Apr 01 '23 at 18:15
  • This argument is incorrect (e.g. besides the likelihood you also need to find $p(Y)$ which depends on $n$). As I said, you can post a new question on this issue. This is not a good place to have this discussion so I am not going to continue it here. – J. Delaney Apr 01 '23 at 18:31
  • Check Bretthorst paper please. There is no place for $[p\left( {\left. {f\left( {\mathbf{X}} \right)} \right|{\mathbf{X}}} \right) \sim {\text{N}}\left( {m\left( {\mathbf{X}} \right),k({\mathbf{X}},{\mathbf{X}})} \right)$ because it is neither a prior, nor a likelihood, nor a posterior. There is no place for $Y$ too, only for ${\left. {\mathbf{Y}} \right|{\mathbf{X}},f,\sigma }$. Please trust me, Bayes rule above is correct and is the basis for those $O(n)$ algorithms. Thanks for your input. – Student Apr 01 '23 at 18:52
  • I asked the "question" here: https://stats.stackexchange.com/questions/611582/is-gaussian-process-functional-regression-a-truly-bayesian-method-again

    Looking forward to your feedback...

    – Student Apr 02 '23 at 16:46
1

The updating equations (adapted from this answer) are:

$$K_{i+1}^{-1} = \begin{bmatrix} K_i & k_* \\ k_*^T & k_{**}\end{bmatrix}^{-1} = \begin{bmatrix}K_i^{-1} + S_i v_iv_i^T & -S_n v_i \\ -S_iv_i^T & S_i\end{bmatrix}$$

where

$$S_i := (k_{**} - k_*^Tv_i)^{-1}$$

The complexity of computing $S_i$ given $v_i$ is $\propto i$, while the cost of computing $v_i = K_i^{-1}k_*$ as well as the $v_iv_i^T$ product is $\propto i^2$.

So the cost increases quadratically per iteration.

It's easy to show that:

$$\sum_{i=1}^n i^2=\frac{n^3}{3}+\frac{n^2}{2}+\frac{n}{6}$$

[1] jld (https://stats.stackexchange.com/users/30005/jld), Using Gaussian Processes to learn a function online, URL (version: 2021-04-21): https://stats.stackexchange.com/q/520830

Firebug
  • 19,076
  • 6
  • 77
  • 139
  • My question was not useless: now, J. Delaney and myself agree that updating sequentially with each datum one by one has exponential computational complexity in $n$. Please see his answer and my question, updated accordingly. Therefore, the question boils down to: do a O(n^3) algorithm (for the full n-update) and an exponential algorithm (for the sequential update with each datum one by one) give the same results or not??? – Student Apr 01 '23 at 14:56
  • @Student don't change the question when it already has answers – Firebug Apr 01 '23 at 15:11
  • @Student other than that, with the exception of numerical error (which will be a problem for sure with all these subtractions and quotients), both solutions are mathematically equivalent – Firebug Apr 01 '23 at 15:14
  • Why??? That is my question that has not changed: how a O(n^3) algo for the full n-update can give the same result as another algo with exponential computationaly complexity??? I need a proof please. The fact that the have completely different computational complexity makes me think that they return completely different results, but this must be proved as well. – Student Apr 01 '23 at 15:18
  • @Student What do you mean? Why would they necessarily return different results? It is often the case that, for a class of problem, multiple algorithms exist with different complexities. – Firebug Apr 01 '23 at 15:26
  • That's the question: do they return the same results or not? For the time being, from my point of view, everything is possible. But the exponential computational complexity for the 1-update is quite unexpected and does not help at all for proving that both algos return the same results. – Student Apr 01 '23 at 15:31
  • 2
    @Student because complexity and the result are absolutely not linked. Mathematically speaking, both results are equivalent. Just take the last iteration in the iterative approach and compare both: it's the same result. – Firebug Apr 01 '23 at 15:33
  • If polynomial algos could return the same results as exponential ones, everything would be simple! If you know where I can find a proof that both algos return the same results, that would be great. I'll investigate this carefully. First, I will compare 1-update and 2-update algorithms... – Student Apr 01 '23 at 15:40
  • @Student Are implying that algorithms that perform the same computation necessarily need to have the same computational complexity? If that was true there would be no need to develop new algorithms for problems where an algorithm already exists. Which we obviously know is not true. – Firebug Apr 01 '23 at 15:41
  • No, I don't mean anything like that. My job is precisely to design and develop such computationally efficient algorithms. – Student Apr 01 '23 at 15:46
  • But n-update is in P and 1-update is in EXP that are completely different complexity classes. – Student Apr 01 '23 at 15:47
  • @Student I am sorry, but I don't see how this relates with anything we discussed, my answer, or the question. – Firebug Apr 01 '23 at 15:54
  • Have you read J. Delaney's answer and the update question please? Do you agree that the 1-update algorithm is in EXP? – Student Apr 01 '23 at 16:08
  • @Student no (see the answer I linked), but it's not important for the question. The fact is that both solutions are mathematically equivalent. – Firebug Apr 01 '23 at 16:14
  • This is shown in the first equation in my answer – Firebug Apr 01 '23 at 16:19
  • Your answer and the link deal with yet another implementation: an i-update followed by a 1-update. I trust you: it certainly has O(n^3) computational complexity like the full i+1 update. The situation is completely different with 1-updates only: the m(x) and k(x,x') values that we need to evaluate recursively make a tree with an exponential number of branches. – Student Apr 01 '23 at 16:34
  • @Student the formula is recursive, the previous $i$ updates can be obtained with the same formula, leading to the same answer. – Firebug Apr 01 '23 at 16:38
  • Let me take a closer look to that... – Student Apr 01 '23 at 17:13
  • Got it. Very funny. It reminds me the Wick-Isserlis recursion for computing the $k$-th order moments of multivariate Gaussians. Unfortunately, it involves $k!!$ terms. Computing the moments is conjectured to be NP-hard in the order and the dimension. I struggled hard to approximate high-order moments of high-dimensional Gaussian in polynomial time in some special cases of interest. – Student Apr 03 '23 at 17:26