Find most common substring in a list of strings?

Question

I have a Python list of string names where I would like to remove a common substring from all of the names.

And after reading this similar answer I could almost achieve the desired result using SequenceMatcher.

But only when all items have a common substring:

From List:
string 1 = myKey_apples
string 2 = myKey_appleses
string 3 = myKey_oranges

common substring = "myKey_"

To List:
string 1 = apples
string 2 = appleses
string 3 = oranges

However I have a slightly noisy list that contains a few scattered items that don't fit the same naming convention.

I would like to remove the "most common" substring from the majority:

From List:
string 1 = myKey_apples
string 2 = myKey_appleses
string 3 = myKey_oranges
string 4 = foo
string 5 = myKey_Banannas

common substring = ""

To List:
string 1 = apples
string 2 = appleses
string 3 = oranges
string 4 = foo
string 5 = Banannas

I need a way to match the "myKey_" substring so I can remove it from all names.

But when I use the SequenceMatcher the item "foo" causes the "longest match" to be equal to blank "".

I think the only way to solve this is to find the "most common substring". But how could that be accomplished?

Basic example code:

from difflib import SequenceMatcher

names = ["myKey_apples",
"myKey_appleses",
"myKey_oranges",
#"foo",
"myKey_Banannas"]

string2 = names[0]
for i in range(1, len(names)):
    string1 = string2
    string2 = names[i]
    match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

print(string1[match.a: match.a + match.size]) # -> myKey_

I think that you want to not actually remove the most common substring. It will usually be only one letter. For example, all of your strings contain the letter `s` (e.g. `apples` ends in `s`). There are two numbers of interest: length of the sub-string and percentage of the search space containing the sub-string. As you increase one, you decrease the other. — Toothpick Anemone, Oct 28 '19 at 05:54
I think this would help: https://en.wikipedia.org/wiki/Longest_common_substring_problem — Jethro Cao, Oct 28 '19 at 05:56
If I understand this question correctly, the goal is to find the [longest common substring of several strings](https://stackoverflow.com/questions/40556491/how-to-find-the-longest-common-substring-of-multiple-strings). — Anderson Green, Apr 10 '22 at 19:20

score 11 · Accepted Answer · answered Oct 28 '19 at 05:56

Given names = ["myKey_apples", "myKey_appleses", "myKey_oranges", "foo", "myKey_Banannas"]

An O(n^2) solution I can think of is to find all possible substrings and storing them in a dictionary with the number of times they occur :

substring_counts={}

for i in range(0, len(names)):
    for j in range(i+1,len(names)):
        string1 = names[i]
        string2 = names[j]
        match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
        matching_substring=string1[match.a:match.a+match.size]
        if(matching_substring not in substring_counts):
            substring_counts[matching_substring]=1
        else:
            substring_counts[matching_substring]+=1

print(substring_counts) #{'myKey_': 5, 'myKey_apples': 1, 'o': 1, '': 3}

And then picking the maximum occurring substring

import operator
max_occurring_substring=max(substring_counts.iteritems(), key=operator.itemgetter(1))[0]
print(max_occurring_substring) #myKey_

Now that's heavenly stuff! Even worked with this crazy list: `["sti_myKey_123","sti_myKey_233","stimm_myKey_676","sti_myKey_879","foo","sti_myKey_2345","sti_myKey_test3","ti_myKey_123"]` Result="sti_myKey_". Many thanks :) — Logic1, Oct 28 '19 at 07:10
Others may also want to look at SequenceMatcher.get_matching_blocks https://www.kite.com/python/docs/difflib.SequenceMatcher.get_matching_blocks — Under-qualified NASA Intern, Feb 24 '22 at 21:00

score 1 · Answer 2 · answered Oct 28 '19 at 05:54

Here's a overly verbose solution to your problem:

def find_matching_key(list_in, max_key_only = True):
  """
  returns the longest matching key in the list * with the highest frequency
  """
  keys = {}
  curr_key = ''

  # If n does not exceed max_n, don't bother adding
  max_n = 0

  for word in list(set(list_in)): #get unique values to speed up
    for i in range(len(word)):
      # Look up the whole word, then one less letter, sequentially
      curr_key = word[0:len(word)-i]
      # if not in, count occurance
      if curr_key not in keys.keys() and curr_key!='':
        n = 0
        for word2 in list_in:
          if curr_key in word2:
            n+=1
        # if large n, Add to dictionary
        if n > max_n:
          max_n = n
          keys[curr_key] = n
    # Finish the word
  # Finish for loop  
  if max_key_only:
    return max(keys, key=keys.get)
  else:
    return keys    

# Create your "from list"
From_List = [
             "myKey_apples",
             "myKey_appleses",
             "myKey_oranges",
             "foo",
             "myKey_Banannas"
]

# Use the function
key = find_matching_key(From_List, True)

# Iterate over your list, replacing values
new_From_List = [x.replace(key,'') for x in From_List]

print(new_From_List)
['apples', 'appleses', 'oranges', 'foo', 'Banannas']

Needless to say, this solution would look a lot neater with recursion. Thought I'd sketch out a rough dynamic programming solution for you though.

RoyM · Answer 3 · 2019-10-28T06:27:05.523

I would first find the starting letter with the most occurrences. Then I would take each word having that starting letter, and take while all these words have matching letters. Then in the end I would remove the prefix that was found from each starting word:

from collections import Counter
from itertools import takewhile

strings = ["myKey_apples", "myKey_appleses", "myKey_oranges", "berries"]

def remove_mc_prefix(words):
    cnt = Counter()
    for word in words:
        cnt[word[0]] += 1
    first_letter = list(cnt)[0]

    filter_list = [word for word in words if word[0] == first_letter]
    filter_list.sort(key = lambda s: len(s)) # To avoid iob

    prefix = ""
    length = len(filter_list[0])
    for i in range(length):
        test = filter_list[0][i]
        if all([word[i] == test for word in filter_list]):
            prefix += test
        else: break
    return [word[len(prefix):] if word.startswith(prefix) else word for word in words]

print(remove_mc_prefix(strings))

Out: ['apples', 'appleses', 'oranges', 'berries']

Find most common substring in a list of strings?

3 Answers3

Linked