Cosine similarity indexing?

Question

Are there any open source implementations out there that can efficiently solve the following.

I'm given a fixed set $S$ of $n$-dimensional vectors of size $N$, where $N$ is of the order of a million.
Given an n-dimensional vector $v$, I want to find the top $K$ vectors $w_1,\ldots,w_k$ from $S$, such that the cosine similarity of $v$ and each $w_i$ is maximized.

Here $K$ should be a parameter that I can choose at query time. I know there are various metric data structures that can be used for queries such as these. There's also a paper from Google from last NIPS, where they do this. For example this paper from Google uses an internal library for it:

http://papers.nips.cc/paper/7157-multiscale-quantization-for-fast-similarity-search

score 1 · Answer 1 · answered Feb 04 '18 at 21:12

If your vectors are dense, you can use a map-reduce code for matrix multiplication (its the best which can be said with the limited information you have provided).

If the vectors are sparse, you can do much better. See

http://www.jmlr.org/papers/volume14/bosagh-zadeh13a/bosagh-zadeh13a.pdf

The authors provide code for their work

score 1 · Answer 2 · answered Feb 04 '18 at 22:03

1

What you're describing seems like ideal application for Locality Sensitive Hashing. There exist several libraries that implement nearest neighbor search using LSH.

Some (Python) examples:

answered Feb 04 '18 at 22:03

Jakub Bartczuk

5,786
1
16
36

score 1 · Answer 3 · answered Feb 01 '19 at 06:09

1

Facebook’s FAISS is very good at this kind of thing, and also scalable:

https://code.fb.com/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

answered Feb 01 '19 at 06:09

Alex R.

13,897

Thanks. May I know the difference between Faiss index and database index, please? – Avv Mar 14 '23 at 02:42

Cosine similarity indexing?

3 Answers3