31

Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this.

smci
  • 29,564
  • 18
  • 109
  • 144
mbq
  • 18,286
  • 6
  • 47
  • 72

4 Answers4

20

And stringdist in the stringdist package does it too, even faster than levenshteinDist under certain conditions (1)

Ben
  • 40,397
  • 18
  • 126
  • 218
  • 3
    stringdist has sped up significantly since that blog you link to: it now uses multiple cores. –  Feb 26 '16 at 17:02
17

levenshteinDist (from the RecordLinkage package) calls compiled C code. Give it a try.

MichaelChirico
  • 32,615
  • 13
  • 106
  • 186
George Dontas
  • 28,739
  • 18
  • 104
  • 145
  • 2
    Just noting the RecordLinkage package is apparently no longer maintained and has been pulled from CRAN. The `stringdist` package is the solution now. – Brian Stamper Feb 27 '20 at 17:42
6

You could try stringDist from Biostrings as well

MichaelChirico
  • 32,615
  • 13
  • 106
  • 186
Aaron Statham
  • 1,988
  • 1
  • 14
  • 16
1

You could also use levenshtein_distance() from the textTinyR package. I got 'calloc' memory errors with all other packages when it came to larger character vectors of around 30k characters. Only textTinyR worked for me!

interrobang
  • 83
  • 1
  • 7