4

My data is similar to the following data, but far bigger and more complex.

Apple
Banana
Those fruits
Tomato 
Cocumber
These vegetables

I would like to get the following result:

Those fruits
These vegetables

Using the agrep/agrepl function in R I received a first result. However agrep and agrepl use the Levenshtein distance as default. An alternative would be the Jaccard distance.

Jaccard distance vs Levenshtein distance: Which distance is better for fuzzy matching?

There is already a similar question: Properties of Levenshtein, N-Gram, cosine and Jaccard distance coefficients - in sentence matching. However I would like to know which distance works best for Fuzzy matching.

Extra credits: Are other distance measure (e.g. N-Gram, Cosine, Geometric, Manhattan) also useful for Fuzzy matching? Implementations in R are also welcome.

Ferdi
  • 5,179
  • 2
    If your data are far bigger, something like LSH might be faster and also preserve the "fuzzy" property in a very specific sense. Some LSH schemes are easily demonstrated to be probabilistic Jaccard similarity. – Sycorax Oct 14 '16 at 16:12
  • 2
    in R you have the stringdist package. You might to check that one out. – phiver Oct 15 '16 at 06:47

0 Answers0