How can I do multiple substitutions using regex?

Question

I can use this code below to create a new file with the substitution of a with aa using regular expressions.

import re

with open("notes.txt") as text:
    new_text = re.sub("a", "aa", text.read())
    with open("notes2.txt", "w") as result:
        result.write(new_text)

I was wondering do I have to use this line, new_text = re.sub("a", "aa", text.read()), multiple times but substitute the string for others letters that I want to change in order to change more than one letter in my text?

That is, so a-->aa,b--> bb and c--> cc.

So I have to write that line for all the letters I want to change or is there an easier way. Perhaps to create a "dictionary" of translations. Should I put those letters into an array? I'm not sure how to call on them if I do.

Emmett Butler · Accepted Answer · 2013-12-30T17:09:23.543

76

The answer proposed by @nhahtdh is valid, but I would argue less pythonic than the canonical example, which uses code less opaque than his regex manipulations and takes advantage of python's built-in data structures and anonymous function feature.

A dictionary of translations makes sense in this context. In fact, that's how the Python Cookbook does it, as shown in this example (copied from ActiveState http://code.activestate.com/recipes/81330-single-pass-multiple-replace/ )

import re 

def multiple_replace(dict, text):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text) 

if __name__ == "__main__": 

  text = "Larry Wall is the creator of Perl"

  dict = {
    "Larry Wall" : "Guido van Rossum",
    "creator" : "Benevolent Dictator for Life",
    "Perl" : "Python",
  } 

  print multiple_replace(dict, text)

So in your case, you could make a dict trans = {"a": "aa", "b": "bb"} and then pass it into multiple_replace along with the text you want translated. Basically all that function is doing is creating one huge regex containing all of your regexes to translate, then when one is found, passing a lambda function to regex.sub to perform the translation dictionary lookup.

You could use this function while reading from your file, for example:

with open("notes.txt") as text:
    new_text = multiple_replace(replacements, text.read())
with open("notes2.txt", "w") as result:
    result.write(new_text)

I've actually used this exact method in production, in a case where I needed to translate the months of the year from Czech into English for a web scraping task.

As @nhahtdh pointed out, one downside to this approach is that it is not prefix-free: dictionary keys that are prefixes of other dictionary keys will cause the method to break.

edited Dec 30 '13 at 17:09

answered Mar 02 '13 at 13:53

Emmett Butler

5,469
2
28
46

Wow thanks is pretty much what I was looking for. I have one more basic question, how do I ignore uppercase letters? So if I had A and I wanted to also translate that to aa without adding it to the dictionary. – Euridice01 Mar 02 '13 at 14:26
1

@Euridice01: If you want to ignore case, specify `re.I` flag in `re.compile`. – nhahtdh Mar 02 '13 at 14:47
2

Your current solution is not yet configure for the use case where there exists a pair of word, one of which is prefix of the other. The order of appearance in the alternation matter. I think at least you should state this assumption. – nhahtdh Mar 02 '13 at 14:49
Your dict may not contain regex, so dict = {"\s":""} does not strip spaces (though dict = { " ":""} will ) – Alexx Roche May 27 '15 at 18:41
Thanks for the codes! It works great! But we can't use Regex in the `dict` right? Like in the dict, writing "w.*?d":"you" ? – Penny Aug 12 '17 at 14:04
1

I can't get re:I to work in this case (as per the suggestion by @nhahtdh) Penny: I can't see how it's possible to use wildcards in this case. I've tried by without success. – thescoop Mar 16 '19 at 09:58
2

@thescoop: Ask a new question with your code. And if you want to use regex in the map, you would need to rewrite the function to remove the re.escape in the compile and change the custom replacement function to look for which group is responsible for the match and look up the corresponding replacement (in which case the input should be an array of tuples rather than dict). – nhahtdh Mar 18 '19 at 02:53
Would someone please explain what this lambda does exactly in the return statement? Is there a simpler way to do it? I am not that familiar with the lambda expressions yet. – ex1led Feb 27 '20 at 13:10
Awesome answer! I used this approach in a script ("hn.py", available on GitHub) where I use a dictionary to invoke multiple changes to lines in a file, and save those results (in situ). https://github.com/victoriastuart/hacker_news_scraper – Victoria Stuart May 05 '20 at 17:52
if I use OrderedDict will the replacements happen in ORDER ?? – sten May 28 '21 at 01:53
This can be simplified by replacing `mo.string[mo.start():mo.end()]` with `mo.group()`, since they are equivalent, as mentioned in the docs [here](https://docs.python.org/3/library/re.html#re.Match.end) – Johnny Mayhew Aug 05 '21 at 16:14
What if the I want the pattern replacement to be done with a function? – oeter Apr 20 '22 at 15:02

nhahtdh · Answer 2 · 2013-03-02T14:03:37.770

You can use capturing group and backreference:

re.sub(r"([characters])", r"\1\1", text.read())

Put characters that you want to double up in between []. For the case of lower case a, b, c:

re.sub(r"([abc])", r"\1\1", text.read())

In the replacement string, you can refer to whatever matched by a capturing group () with \n notation where n is some positive integer (0 excluded). \1 refers to the first capturing group. There is another notation \g<n> where n can be any non-negative integer (0 allowed); \g<0> will refer to the whole text matched by the expression.

If you want to double up all characters except new line:

re.sub(r"(.)", r"\1\1", text.read())

If you want to double up all characters (new line included):

re.sub(r"(.)", r"\1\1", text.read(), 0, re.S)

score 9 · Answer 3 · answered Aug 27 '19 at 15:58

You can use the pandas library and the replace function. I represent one example with five replacements:

df = pd.DataFrame({'text': ['Billy is going to visit Rome in November', 'I was born in 10/10/2010', 'I will be there at 20:00']})

to_replace=['Billy','Rome','January|February|March|April|May|June|July|August|September|October|November|December', '\d{2}:\d{2}', '\d{2}/\d{2}/\d{4}']
replace_with=['name','city','month','time', 'date']

print(df.text.replace(to_replace, replace_with, regex=True))

And the modified text is:

0    name is going to visit city in month
1                      I was born in date
2                 I will be there at time

You can find the example here

In my case this was less efficient than using regular expressions directly, maybe there are some contexts where this is not the case? — Pablo, Nov 14 '19 at 08:36
In case you want to apply multiple replacements at once in a pandas data frame using vectorization — George Pipis, Nov 20 '19 at 12:01

score 5 · Answer 4 · edited May 23 '17 at 12:32

Using tips from how to make a 'stringy' class, we can make an object identical to a string but for an extra sub method:

import re
class Substitutable(str):
  def __new__(cls, *args, **kwargs):
    newobj = str.__new__(cls, *args, **kwargs)
    newobj.sub = lambda fro,to: Substitutable(re.sub(fro, to, newobj))
    return newobj

This allows to use the builder pattern, which looks nicer, but works only for a pre-determined number of substitutions. If you use it in a loop, there is no point creating an extra class anymore. E.g.

>>> h = Substitutable('horse')
>>> h
'horse'
>>> h.sub('h', 'f')
'forse'
>>> h.sub('h', 'f').sub('f','h')
'horse'

score 5 · Answer 5 · answered May 22 '20 at 10:02

None of the other solutions work if your patterns are themselves regexes.

For that, you need:

def multi_sub(pairs, s):
    def repl_func(m):
        # only one group will be present, use the corresponding match
        return next(
            repl
            for (patt, repl), group in zip(pairs, m.groups())
            if group is not None
        )
    pattern = '|'.join("({})".format(patt) for patt, _ in pairs)
    return re.sub(pattern, repl_func, s)

Which can be used as:

>>> multi_sub([
...     ('a+b', 'Ab'),
...     ('b', 'B'),
...     ('a+', 'A.'),
... ], "aabbaa")  # matches as (aab)(b)(aa)
'AbBA.'

Note that this solution does not allow you to put capturing groups in your regexes, or use them in replacements.

score 2 · Answer 6 · answered Dec 20 '17 at 12:51

I found I had to modify Emmett J. Butler's code by changing the lambda function to use myDict.get(mo.group(1),mo.group(1)). The original code wasn't working for me; using myDict.get() also provides the benefit of a default value if a key is not found.

OIDNameContraction = {
                                'Fucntion':'Func',
                                'operated':'Operated',
                                'Asist':'Assist',
                                'Detection':'Det',
                                'Control':'Ctrl',
                                'Function':'Func'
}

replacementDictRegex = re.compile("(%s)" % "|".join(map(re.escape, OIDNameContraction.keys())))

oidDescriptionStr = replacementDictRegex.sub(lambda mo:OIDNameContraction.get(mo.group(1),mo.group(1)), oidDescriptionStr)

score 1 · Answer 7 · edited Dec 25 '19 at 02:24

If you dealing with files, I have a simple python code about this problem. More info here.

import re 

 def multiple_replace(dictionary, text):
  # Create a regular expression  from the dictionaryary keys

  regex = re.compile("(%s)" % "|".join(map(re.escape, dictionary.keys())))

  # For each match, look-up corresponding value in dictionaryary
  String = lambda mo: dictionary[mo.string[mo.start():mo.end()]]
  return regex.sub(String , text)


if __name__ == "__main__":

dictionary = {
    "Wiley Online Library" : "Wiley",
    "Chemical Society Reviews" : "Chem. Soc. Rev.",
} 

with open ('LightBib.bib', 'r') as Bib_read:
    with open ('Abbreviated.bib', 'w') as Bib_write:
        read_lines = Bib_read.readlines()
        for rows in read_lines:
            #print(rows)
            text = rows
            new_text = multiple_replace(dictionary, text)
            #print(new_text)
            Bib_write.write(new_text)

How can I do multiple substitutions using regex?

7 Answers7

Linked

Related