3

Are there any open source implementations out there that can efficiently solve the following.

  1. I'm given a fixed set $S$ of $n$-dimensional vectors of size $N$, where $N$ is of the order of a million.

  2. Given an n-dimensional vector $v$, I want to find the top $K$ vectors $w_1,\ldots,w_k$ from $S$, such that the cosine similarity of $v$ and each $w_i$ is maximized.

Here $K$ should be a parameter that I can choose at query time. I know there are various metric data structures that can be used for queries such as these. There's also a paper from Google from last NIPS, where they do this. For example this paper from Google uses an internal library for it:

http://papers.nips.cc/paper/7157-multiscale-quantization-for-fast-similarity-search

dst
  • 31
  • 2

3 Answers3

1

If your vectors are dense, you can use a map-reduce code for matrix multiplication (its the best which can be said with the limited information you have provided).

If the vectors are sparse, you can do much better. See

http://www.jmlr.org/papers/volume14/bosagh-zadeh13a/bosagh-zadeh13a.pdf

The authors provide code for their work

Sid
  • 2,617
1

What you're describing seems like ideal application for Locality Sensitive Hashing. There exist several libraries that implement nearest neighbor search using LSH.

Some (Python) examples:

Jakub Bartczuk
  • 5,786
  • 1
  • 16
  • 36
1

Facebook’s FAISS is very good at this kind of thing, and also scalable:

https://code.fb.com/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

Alex R.
  • 13,897
  • Thanks. May I know the difference between Faiss index and database index, please? – Avv Mar 14 '23 at 02:42