3

I want to generate a wildcard string from a pair of file names. Kind of an inverse-glob. Example:

file1 = 'some foo file.txt'
file2 = 'some bar file.txt'
assert 'some * file.txt' == inverse_glob(file1, file2)

Use difflib perhaps? Has this been solved already?

Application is a large set of data files with similar names. I want to compare each pair of file names and then present a comparison of pairs of files with "similar" names. I figure if I can do a reverse-glob on each pair, then those pairs with "good" wildcards (e.g. not lots*of*stars*.txt nor *) are good candidates for comparison. So I might take the output of this putative inverse_glob() and reject wildcards that have more than one * or for which glob() doesn't produce exactly two files.

Community
  • 1
  • 1
Bob Stein
  • 14,529
  • 8
  • 77
  • 98
  • The general solution to this problem is probably not very easily found. You talk about the filenames being similar. It is most likely possible to find a simpler solution, taking advantage of this. Perhaps you can give some examples of typical filenames? – JohanL May 05 '17 at 18:26
  • @JohanL thanks for thinking outside the box. I'd like to focus this article on the glob-inverse strategy so the question is more useful for posterity. I did simplify the question to 2 files, which you're right is much less general and much simpler. To answer, my files in the wild differ in different ways in different places, e.g. "filename.txt" and "filename2.txt" or "the 24MHz run new.sr" and "the 16MHz run old.sr" – Bob Stein May 05 '17 at 18:45

1 Answers1

2

For instance:

Filenames:

names = [('some foo file.txt','some bar file.txt', 'some * file.txt'),
         ("filename.txt", "filename2.txt", "filenam*.txt"),
         ("1filename.txt", "filename2.txt", "*.txt"),
         ("inverse_glob", "inverse_glob2", "inverse_glo*"),
         ("the 24MHz run new.sr", "the 16MHz run old.sr", "the *MHz run *.sr")]

def inverse_glob(...):

    import re
    def inverse_glob(f1, f2, force_single_asterisk=None):
        def adjust_name(pp, diff):
            if len(pp) == 2:
                return pp[0][:-diff] + '?'*(diff+1) + '.' + pp[1]
            else:
                return pp[0][:-diff] + '?' * (diff + 1)

        l1 = len(f1); l2 = len(f2)
        if l1 > l2:
            f2 = adjust_name(f2.split('.'), l1-l2)
        elif l2 > l1:
            f1 = adjust_name(f1.split('.'), l2-l1)

        result = ['?' for n in range(len(f1))]
        for i, c in enumerate(f1):
            if c == f2[i]:
                result[i] = c

        result = ''.join(result)
        result = re.sub(r'\?{2,}', '*', result)
        if force_single_asterisk:
            result = re.sub(r'\*.+\*', '*', result)
        return result

Usage:

for name in names:
    result = inverse_glob(name[0], name[1])
    print('{:20} <=> {:20} = {}'.format(name[0], name[1], result))
    assert name[2] == result

Output:

some foo file.txt    <=> some bar file.txt    = some * file.txt  
filename.txt         <=> filename2.txt        = filenam*.txt  
1filename.txt        <=> filename2.txt        = *.txt  
inverse_glob         <=> inverse_glob2        = inverse_glo*
the 24MHz run new.sr <=> the 16MHz run old.sr = the *MHz run *.sr

Tested with Python:3.4.2

stovfl
  • 14,172
  • 7
  • 20
  • 46