What are some algorithms for comparing how similar two strings are?

Question

I need to compare strings to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may differ. For example, consider the following two titles:

std::string first = "Henry C. Harper v. The Law Offices of Huey & Luey, LLP";

As opposed to:

std::string second = "Harper v. The Law Offices of Huey & Luey, LLP";

A human can quickly gauge that these are most likely one and the same. The current approach I have taken is to normalize the strings by lowercasing all letters and removing all punctuation and spaces giving:

std::string firstNormalized = "henrycharpervthelawofficesofhueylueyllp";

And:

std::string secondNormalized = "harpervthelawofficesofhueylueyllp";

Comparing in this case, one is a sub-sequence of the other, but you can imagine other more complex variations where that does not necessarily occur, yet they have significant sub-sequences in common. There could also be occasional human entry errors such as transposed letters and spelling errors.

Perhaps some kind of character diff program could help? I've seen good line diff programs for comparing differences in code to be checked in, is there something like that on a character basis, maybe in boost? If you could count the number of consecutive characters in common and take the ratio to the characters unshared, perhaps that would be a good heuristic?

In the end, I need a Boolean decision as to whether to consider them the same or not. It doesn't have to be perfect, but it should ideally rarely be wrong.

What algorithm can I use that will give me some kind of quantification as to how similar the two strings are to each other which I can then convert into a yes/no answer by way of some heuristic?

I've used the Levenshtein distance before. Easy to implement... http://en.wikipedia.org/wiki/Levenshtein_distance — souldzin, Mar 08 '13 at 21:32
Sorry, not constructive... Here is the [wiki page you were looking for](http://en.wikipedia.org/wiki/String_metric). — djechlin, Mar 08 '13 at 21:41
@WhozCraig: Thanks, but that would not be fair, make that your answer and collect the rep. :) — Daniel Frey, Mar 08 '13 at 21:57
@bamboon There is no c++ or boost in the question. There isn't a problem the OP has, other than not having looked for long enough to find what algorithms are available. Not meaning to be rude, but there are probably better places to ask this. Yes it's interesting to me too. — Peter Wood, Mar 08 '13 at 21:59
@MarkRansom: That's the point, WhozCraig did that, so I won't take the credit for it. — Daniel Frey, Mar 08 '13 at 22:00
@DanielFrey, sorry I didn't see that. I wouldn't worry about who gets credit, unless it makes you uncomfortable to have your name associated with something that isn't really yours. From the site's perspective, the most important point is to have answers. — Mark Ransom, Mar 08 '13 at 22:05
tons of implementations for edit distance: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance — zzk, Mar 08 '13 at 22:08
Concerning your request for a 'boolean' decision the best algorithms I believe only return a distance between two strings. Given your case, you might want to consider a method that compares the words of one string to the words of another. — souldzin, Mar 08 '13 at 22:11
@MarkRansom: I don't have a problem with my name being associated, I just think it's unfair to take credit for other people's work. But with all those comments around, it should be clear by now :) — Daniel Frey, Mar 08 '13 at 22:16
possible duplicate of [Getting the closest string match](http://stackoverflow.com/questions/5859561/getting-the-closest-string-match) — Alain, Mar 09 '13 at 03:38
@bamboon because OP is asking for a list and there are multiple correct answers because there aren't enough criteria. It's a discussion on string algorithms. Good SO questions have a unique best answer. — djechlin, Mar 09 '13 at 04:52
Another very useful article for those interested: http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/ — aug, Mar 05 '15 at 23:01

Daniel Frey · Accepted Answer · 2013-03-08T21:58:43.260

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
Hamming Distance : The number of characters that are different in two equal length strings.
Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

score 15 · Answer 2 · edited May 23 '17 at 22:47

15

Damerau Levenshtein distance is another algorithm for comparing two strings and it is similar to the Levenshtein distance algorithm. The difference between the two is that it can also check transpositions between characters and hence may give a better result for error correction.

For example: The Levenshtein distance between night and nigth is 2 but Damerau Levenshtein distance between night and nigth will be 1 because it is just a swap of a pair of characters.

edited May 23 '17 at 22:47

Captain Man

6,270
4
44
68

answered Nov 10 '13 at 13:53

Ankit Chaurasia

585
5
7

3

Please add references (web, books, papers...) – Max Leske Nov 10 '13 at 14:11

score 7 · Answer 3 · answered Feb 03 '15 at 19:06

You could use ngrams for that. For example, transform the two strings in word trigrams (usually lowercase) and compare the percentage of them that are equal to one another.

Your challenge is to define a minimum percentage for similarity.

http://en.wikipedia.org/wiki/N-gram

score 3 · Answer 4 · answered May 08 '19 at 17:30

Another algorithm that you can consider is the Simon White Similarity:

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    if string is None:
        return ""

    s = string.lower()
    return [s[i : i + 2] for i in list(range(len(s) - 1))]

def simon_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union = len(pairs1) + len(pairs2)

    if union == 0 or union is None:
        return 0

    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

score 0 · Answer 5 · answered Jul 27 '19 at 14:03

You may use the algorithm of computing the length of the longest common sub-sequence to solve the problem. If the length of the longest common sub-sequence for both the input strings is less than the length of either of the strings, they are unequal.

You may use the approach of dynamic programming to solve the problem and optimize the space complexity as well in case you don't wish to figure out the longest common sub-sequence.

What are some algorithms for comparing how similar two strings are?

5 Answers5

Linked