We have around N=1000 entities but for now consider in the example just seven, labelled 1-7 below. We can lay them out in 2D in D different ways - just two are shown below. I would like to be able to compare these distributions, cluster them and so on. The orientation of the distributions doesn't matter, so the two shown below are mostly similar. The obvious way to do it is to calculate a NxN distance matrix and cluster those. However, we are running into memory problems for N=1000 and D>5000 using this approach. Is there some more efficient way? We are currently thinking of sampling a smaller N out of the dataset. This would have the advantage that we can bootstrap/repeat these samplings to get some idea of robustness. Is there another way?
Asked
Active
Viewed 139 times
