Comparing two spatial distributions without computing distance matrices

Asked Nov 08 '16 at 13:15

Active Nov 09 '16 at 11:53

Viewed 139 times

We have around N=1000 entities but for now consider in the example just seven, labelled 1-7 below. We can lay them out in 2D in D different ways - just two are shown below. I would like to be able to compare these distributions, cluster them and so on. The orientation of the distributions doesn't matter, so the two shown below are mostly similar. The obvious way to do it is to calculate a NxN distance matrix and cluster those. However, we are running into memory problems for N=1000 and D>5000 using this approach. Is there some more efficient way? We are currently thinking of sampling a smaller N out of the dataset. This would have the advantage that we can bootstrap/repeat these samplings to get some idea of robustness. Is there another way?

edited Nov 09 '16 at 11:53

asked Nov 08 '16 at 13:15

uncoolbob

Sounds like a job for locality sensitive hashing. – Sycorax Nov 08 '16 at 16:35

Comparing two spatial distributions without computing distance matrices

0 Answers0