77

I need to compare strings to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may differ. For example, consider the following two titles:

std::string first = "Henry C. Harper v. The Law Offices of Huey & Luey, LLP";

As opposed to:

std::string second = "Harper v. The Law Offices of Huey & Luey, LLP";

A human can quickly gauge that these are most likely one and the same. The current approach I have taken is to normalize the strings by lowercasing all letters and removing all punctuation and spaces giving:

std::string firstNormalized = "henrycharpervthelawofficesofhueylueyllp";

And:

std::string secondNormalized = "harpervthelawofficesofhueylueyllp";

Comparing in this case, one is a sub-sequence of the other, but you can imagine other more complex variations where that does not necessarily occur, yet they have significant sub-sequences in common. There could also be occasional human entry errors such as transposed letters and spelling errors.

Perhaps some kind of character diff program could help? I've seen good line diff programs for comparing differences in code to be checked in, is there something like that on a character basis, maybe in boost? If you could count the number of consecutive characters in common and take the ratio to the characters unshared, perhaps that would be a good heuristic?

In the end, I need a Boolean decision as to whether to consider them the same or not. It doesn't have to be perfect, but it should ideally rarely be wrong.

What algorithm can I use that will give me some kind of quantification as to how similar the two strings are to each other which I can then convert into a yes/no answer by way of some heuristic?

Bhargav Rao
  • 45,811
  • 27
  • 120
  • 136
WilliamKF
  • 38,693
  • 65
  • 183
  • 286
  • 7
    I've used the Levenshtein distance before. Easy to implement... http://en.wikipedia.org/wiki/Levenshtein_distance – souldzin Mar 08 '13 at 21:32
  • Is there a Levenshtein distance in Boost? – WilliamKF Mar 08 '13 at 21:36
  • 1
    Sorry, not constructive... Here is the [wiki page you were looking for](http://en.wikipedia.org/wiki/String_metric). – djechlin Mar 08 '13 at 21:41
  • @djechlin Why? This is an interesting question. – Stephan Dollberg Mar 08 '13 at 21:50
  • @WhozCraig: Thanks, but that would not be fair, make that your answer and collect the rep. :) – Daniel Frey Mar 08 '13 at 21:57
  • @bamboon There is no c++ or boost in the question. There isn't a problem the OP has, other than not having looked for long enough to find what algorithms are available. Not meaning to be rude, but there are probably better places to ask this. Yes it's interesting to me too. – Peter Wood Mar 08 '13 at 21:59
  • @MarkRansom: That's the point, WhozCraig did that, so I won't take the credit for it. – Daniel Frey Mar 08 '13 at 22:00
  • @DanielFrey, sorry I didn't see that. I wouldn't worry about who gets credit, unless it makes you uncomfortable to have your name associated with something that isn't really yours. From the site's perspective, the most important point is to have answers. – Mark Ransom Mar 08 '13 at 22:05
  • tons of implementations for edit distance: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance – zzk Mar 08 '13 at 22:08
  • Concerning your request for a 'boolean' decision the best algorithms I believe only return a distance between two strings. Given your case, you might want to consider a method that compares the words of one string to the words of another. – souldzin Mar 08 '13 at 22:11
  • @MarkRansom: I don't have a problem with my name being associated, I just think it's unfair to take credit for other people's work. But with all those comments around, it should be clear by now :) – Daniel Frey Mar 08 '13 at 22:16
  • possible duplicate of [Getting the closest string match](http://stackoverflow.com/questions/5859561/getting-the-closest-string-match) – Alain Mar 09 '13 at 03:38
  • @bamboon because OP is asking for a list and there are multiple correct answers because there aren't enough criteria. It's a discussion on string algorithms. Good SO questions have a unique best answer. – djechlin Mar 09 '13 at 04:52
  • Another very useful article for those interested: http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/ – aug Mar 05 '15 at 23:01

5 Answers5

101

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

Daniel Frey
  • 54,116
  • 13
  • 115
  • 176
15

Damerau Levenshtein distance is another algorithm for comparing two strings and it is similar to the Levenshtein distance algorithm. The difference between the two is that it can also check transpositions between characters and hence may give a better result for error correction.

For example: The Levenshtein distance between night and nigth is 2 but Damerau Levenshtein distance between night and nigth will be 1 because it is just a swap of a pair of characters.

Captain Man
  • 6,270
  • 4
  • 44
  • 68
Ankit Chaurasia
  • 585
  • 5
  • 7
7

You could use ngrams for that. For example, transform the two strings in word trigrams (usually lowercase) and compare the percentage of them that are equal to one another.

Your challenge is to define a minimum percentage for similarity.

http://en.wikipedia.org/wiki/N-gram

noderman
  • 1,884
  • 18
  • 34
3

Another algorithm that you can consider is the Simon White Similarity:

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    if string is None:
        return ""

    s = string.lower()
    return [s[i : i + 2] for i in list(range(len(s) - 1))]
def simon_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union = len(pairs1) + len(pairs2)

    if union == 0 or union is None:
        return 0

    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union
0

You may use the algorithm of computing the length of the longest common sub-sequence to solve the problem. If the length of the longest common sub-sequence for both the input strings is less than the length of either of the strings, they are unequal.

You may use the approach of dynamic programming to solve the problem and optimize the space complexity as well in case you don't wish to figure out the longest common sub-sequence.

nmg_vikas
  • 1
  • 1