396

I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?

z.write(article.replace('</html>.+', '</html>'))
Vukašin Manojlović
  • 3,557
  • 3
  • 18
  • 31
user1442957
  • 6,403
  • 5
  • 20
  • 19

4 Answers4

655

No. Regular expressions in Python are handled by the re module.

article = re.sub(r'(?is)</html>.+', '</html>', article)

In general:

text_after = re.sub(regex_search_term, regex_replacement, text_before)
Thomas Weller
  • 49,619
  • 19
  • 114
  • 198
Ignacio Vazquez-Abrams
  • 740,318
  • 145
  • 1,296
  • 1,325
  • 1
    How would I apply the re model to my 'article' variable? – user1442957 Jul 13 '12 at 18:05
  • I tried the following to no avail `z.write(re.sub(r' – user1442957 Jul 13 '12 at 18:17
  • 3
    Is the tag not lowercase, or is it followed by a `'\n'`? You can make it case-insensitive (`(?i)` flag) and make `.` match newlines (`(?s)` flag) with `r'(?is) – MRAB Jul 13 '12 at 18:32
  • 2
    Using flags would be more readable, i.e. adding flags=re.DOTALL | re.IGNORECASE as last argument iso the (?is) in the pattern. – parvus Jul 08 '21 at 05:14
85

In order to replace text using regular expression use the re.sub function:

sub(pattern, repl, string[, count, flags])

It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.

Examples

>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'

>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'
Andre Pena
  • 52,662
  • 43
  • 183
  • 224
7

You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like

z.write(article[:article.index("</html>") + 7]

This is much cleaner, and should be much faster than a regex based solution.

Julian
  • 2,371
  • 19
  • 20
  • 11
    Not so clean; you have to hard-code the length of " – Daniel Griscom Feb 28 '16 at 20:44
  • @DanielGriscom : what about `len(str(' – Ole Aldric Mar 03 '18 at 13:35
  • @OleAnders Better, but then you're duplicating that string, which opens another possibility for error. – Daniel Griscom Mar 03 '18 at 14:30
  • @OleAnders ... and just realized; no need for the `str()`; just use `len(' – Daniel Griscom Mar 03 '18 at 16:00
  • 2
    I was pretty much assuming this was a throwaway script - both the regex approach and the string search approach have all sorts of inputs they'll fail on. For anything in production, I would want to be doing some sort of more sophisticated parsing than either regex or simple string search can accomplish. – Julian Mar 03 '18 at 18:42
4

For this particular case, if using re module is overkill, how about using split (or rsplit) method as

se='</html>'
z.write(article.split(se)[0]+se)

For example,

#!/usr/bin/python

article='''<html>Larala
Ponta Monta 
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')

se='</html>'
z.write(article.split(se)[0]+se)

outputs out.txt as

<html>Larala
Ponta Monta 
</html>
norio
  • 3,476
  • 2
  • 21
  • 32