Similarity between two sentences (list of words)

Question

So I have been exploring ways to efficiently compute the most similar sentence to a given sentence from a record of, say, 1 million sentences.

During my search, I stumbled upon gensim and doc2vec, but my dictionary is way too small for it to work efficiently.

My sentences are basically names of food ingredients, like:

milk, banana, apple ...

So my aim is as follows:

I have a record of say 1 million sentences, like

[[milk spinach], [apple banana] ...]

You can assume each sentence is the ingredients of a recipe.

Given the user's ingredients, I want to get the most similar ingredient list from above.

So the similarity should be higher for those sentences which share more ingredients (irrespective of the order in the sentence).

I am looking for an approach to efficiently query the records rather than computing similarity against each record for the new sentence.

Does this answer your question? [Find the similarity metric between two strings](https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings) — bad_coder, Dec 07 '21 at 18:50
Since order does not matter - at first glance I suggest a table with a column for each word that exists. Each "sentence" becomes a row in the table with boolean values at a few locations. You can do simple minkowski distances from here between rows, or you can be more creative later and make a similarity matrix out of any metric you want. — D Adams, Dec 07 '21 at 18:51
Is it possible to create a query index for this which makes the query part faster ? Currently (as per the comments) it's more like calculating similarity scores for each and then taking the top n ones. — Apoorv Jain, Dec 07 '21 at 20:32
A simple bag-of-words representation of each 'sentence', compared by cosine-similarity against other sentences, might work OK. Or other 'pointwise mutual information' (PMI)-like calculations, just on raw counts of same-vs-different ingredients. If using word-vectors, the "word mover's distance" calculation might also work - though it quickly becomes expensive on large texts. — gojomo, Dec 08 '21 at 03:39

0 Answers0