Jaccard distance vs Levenshtein distance for fuzzy matching

Asked Oct 14 '16 at 15:52

Active Feb 11 '17 at 18:13

Viewed 1.1k times

My data is similar to the following data, but far bigger and more complex.

Apple
Banana
Those fruits
Tomato 
Cocumber
These vegetables

I would like to get the following result:

Those fruits
These vegetables

Using the agrep/agrepl function in R I received a first result. However agrep and agrepl use the Levenshtein distance as default. An alternative would be the Jaccard distance.

Jaccard distance vs Levenshtein distance: Which distance is better for fuzzy matching?

There is already a similar question: Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients - in sentence matching. However I would like to know which distance works best for Fuzzy matching.

Extra credits: Are other distance measure (e.g. N-Gram, Cosine, Geometric, Manhattan) also useful for Fuzzy matching? Implementations in R are also welcome.

edited Apr 13 '17 at 12:44

Community

asked Oct 14 '16 at 15:52

Ferdi

5,179

2

If your data are far bigger, something like LSH might be faster and also preserve the "fuzzy" property in a very specific sense. Some LSH schemes are easily demonstrated to be probabilistic Jaccard similarity. – Sycorax Oct 14 '16 at 16:12
2

in R you have the stringdist package. You might to check that one out. – phiver Oct 15 '16 at 06:47

Jaccard distance vs Levenshtein distance for fuzzy matching

0 Answers0