Is removing duplicate data necessary for Gaussian Process Regression (GPR)?

Question

I will consider Non-Noisy Observations i.e. $y=f(x)$ Lets say we have the following data set of 5 training examples with one of the examples duplicated $(1,2,3,4,4)$ maps to $(2,4,6,8,8)$. Since for GPR we have to invert a Kernel Matrix and a Kernel matrix containing duplicate inputs will not be invertible we should remove duplicate training examples when doing GPR with non-noisy observation. Am I right in my reasoning ? Kindly comment

score 8 · Accepted Answer · edited Apr 13 '17 at 12:44

8

The duplicate data add no additional information, and rank-deficiency in the kernel matrix is fatal to the process. Removing them has literally no inferential consequence.

That said, numerically, the kernel matrix $K$ will occasionally become numerically singular if some points are too close together (but not necessarily identical). In this scenario, you can either identify and deal with the problem points (deletion, merging them, whatever) or you can some (small) noise: $\hat{K}=K+\epsilon I$. Usually $\epsilon=10^{-6}$ is sufficient for me, or you can perform a spectral decomposition of $K$ and then for each eigenvalue $\lambda_i$, replace it with $\hat{\lambda_i}=\max{\{\lambda_i, \epsilon\lambda_{\max}\}}$ for some small $\epsilon.$ The idea here is that you've effectively pinned the smallest eigenvalue of the matrix relative to the largest, and this may be a more "minimal" intervention into the matrix. This is an area where I'm not sure there are any good solutions.

The numerical component of the problem is considered in more detail on this thread:

Ill-conditioned covariance matrix in GP regression for Bayesian optimization

edited Apr 13 '17 at 12:44

Community

1

answered Mar 31 '16 at 02:54

Sycorax

90,934

1

I don't think this is correct in all cases. It is very common to assume your observations have gaussian noise for example. Then the noise will be reduced from these combined observations. In this case the kernel matrix will remain full rank also. – j__ Mar 31 '16 at 11:48
The question is about non-noisy observations. – Sycorax Mar 31 '16 at 12:01
Ah sorry I must have read that too quickly - a bad habit that I picked up when I started using the app. In that case I agree completely (up vote) – j__ Mar 31 '16 at 12:02
You don't have to add small noise. Look at the trick in Rasmussen & Williams 2006, Gaussian Processes for Machine Learning, (3.26): instead of decomposing K, they use matrix inversion lemma and decompose $W^{1/2} K W^{1/2} + I$ instead.
Anyway, thanks for confirmation that removing duplicate data is necessary...
– Tomas May 14 '20 at 19:58

Is removing duplicate data necessary for Gaussian Process Regression (GPR)?

1 Answers1

Linked

Related