2

Suppose I have a very large binary matrix representing $n$ customers and the $m$ products they bought, with $n$ and $m$ both rather large (in the order of millions). The matrix is also pretty sparse.

As computing all (or even just the $k$) exact nearest neighbors for every customer is clearly an intractable problem, I'm looking for a way to approximate it.

One idea I had is to use minhashing/LSH to first segment the customers into approximately $\frac{n}{k}$ buckets (this can be done by tweaking the similarity threshold I assume) and then simply consider that the $k$ neighbors of any given point are the other members of its corresponding bucket. Another slightly less efficient variant that might produce better results would be to segment into $\frac{n}{2k}$ buckets (i.e. bigger ones), and then search for the closest $k$ neighbors by brute force, inside a single bucket.

These schemes would obviously not yield exact results, but I wonder if they might be acceptable nonetheless. I'd also be interested in any other method or idea to tackle this problem.

cjauvin
  • 613

1 Answers1

3

The segmented approaches you propose appears reasonable and as do your concerns about the limitations of them. I have not worked on this field but given you problem formulation have you considered K-SVD? I am proposing this because you can take advantage both of the numerical tools available for sparse SVD decompositions as well as the conceptual framework of using SVD information to construct a $k$-means clustering. There are quite a few papers already available where you can check if the assumptions behind this approach can be generalized to accommodate your situation's characteristics (eg. here and here); in addition MATLAB code is already available by the original authors' website. Finally, check this CV link on "https://stats.stackexchange.com/questions/43099/how-to-explain-the-connection-between-svd-and-clustering", it should give a a first helpful nudge.

Glorfindel
  • 1,118
  • 2
  • 12
  • 18
usεr11852
  • 44,125