So I have been exploring ways to efficiently compute the most similar sentence to a given sentence from a record of, say, 1 million sentences.
During my search, I stumbled upon gensim and doc2vec, but my dictionary is way too small for it to work efficiently.
My sentences are basically names of food ingredients, like:
milk, banana, apple ...
So my aim is as follows:
I have a record of say 1 million sentences, like
[[milk spinach], [apple banana] ...]
You can assume each sentence is the ingredients of a recipe.
Given the user's ingredients, I want to get the most similar ingredient list from above.
So the similarity should be higher for those sentences which share more ingredients (irrespective of the order in the sentence).
I am looking for an approach to efficiently query the records rather than computing similarity against each record for the new sentence.