1

When performing k-nearest neighbor analysis on a large dataset, using a kd-tree algorithm can greatly speed up the search. I've tried researching this question, but have not found an answer - can the use of a kd-tree introduce bias into the nearest neighbor search? Does the splitting that kd-tree performs mean that certain groups get excluded from being nearest neighbor?

Follow up questions: If there is potential to introduce bias, how can you test whether it is actually occurring?

Are there other algorithms that are better than kd-tree for large datasets?

Many thanks

rw2
  • 1,108

1 Answers1

1

Standard search procedures using kd-tree structures to estimate the k nearest neighbors compute the exact list of k nearest neighboors (NN). In other words, you get the same result than those given using a (time-consuming) exhaustive search. So, in principle, there should be no bias due to the use of kd-tree to solve the NN problem.

Nevertheless, some approximate algorithm relying on kd-tree-like structure also exists (e.g. best bin first) but to my knowledge are not the "standard implementation".

In particular conditions (searching NN sequentially for similar points), the kd-tree structure can be extended with a caching structure to speed-up the process while remaining exact (e.g. cached kd-tree).

beuhbbb
  • 5,043