python .replace() regex

Question

I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?

z.write(article.replace('</html>.+', '</html>'))

Warning: parsing HTML with regular expressions [leads to madness](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — Adam Rosenfield, Jul 13 '12 at 18:08
I have a bunch of garbage after my closing html tag and I just want to remove it. — user1442957, Jul 13 '12 at 18:11
But what if your HTML has a quoted string, comment, JavaScript, or CDATA containing ` — Adam Rosenfield, Jul 13 '12 at 18:16

score 655 · Accepted Answer · edited Dec 02 '20 at 11:56

655

No. Regular expressions in Python are handled by the re module.

article = re.sub(r'(?is)</html>.+', '</html>', article)

In general:

text_after = re.sub(regex_search_term, regex_replacement, text_before)

edited Dec 02 '20 at 11:56

Thomas Weller

49,619
19
114
198

answered Jul 13 '12 at 18:05

Ignacio Vazquez-Abrams

740,318
145
1,296
1,325

1

How would I apply the re model to my 'article' variable? – user1442957 Jul 13 '12 at 18:05
I tried the following to no avail `z.write(re.sub(r' – user1442957 Jul 13 '12 at 18:17
3

Is the tag not lowercase, or is it followed by a `'\n'`? You can make it case-insensitive (`(?i)` flag) and make `.` match newlines (`(?s)` flag) with `r'(?is) – MRAB Jul 13 '12 at 18:32
2

Using flags would be more readable, i.e. adding flags=re.DOTALL | re.IGNORECASE as last argument iso the (?is) in the pattern. – parvus Jul 08 '21 at 05:14

Andre Pena · Answer 2 · 2020-02-06T04:40:36.830

In order to replace text using regular expression use the re.sub function:

sub(pattern, repl, string[, count, flags])

It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.

Examples

>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'

>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'

score 7 · Answer 3 · answered Jul 13 '12 at 19:01

7

You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like

z.write(article[:article.index("</html>") + 7]

This is much cleaner, and should be much faster than a regex based solution.

answered Jul 13 '12 at 19:01

Julian

2,371
19
20

11

Not so clean; you have to hard-code the length of " – Daniel Griscom Feb 28 '16 at 20:44
@DanielGriscom : what about `len(str(' – Ole Aldric Mar 03 '18 at 13:35
@OleAnders Better, but then you're duplicating that string, which opens another possibility for error. – Daniel Griscom Mar 03 '18 at 14:30
@OleAnders ... and just realized; no need for the `str()`; just use `len(' – Daniel Griscom Mar 03 '18 at 16:00
2

I was pretty much assuming this was a throwaway script - both the regex approach and the string search approach have all sorts of inputs they'll fail on. For anything in production, I would want to be doing some sort of more sophisticated parsing than either regex or simple string search can accomplish. – Julian Mar 03 '18 at 18:42

score 4 · Answer 4 · answered Jun 24 '17 at 20:08

For this particular case, if using re module is overkill, how about using split (or rsplit) method as

se='</html>'
z.write(article.split(se)[0]+se)

For example,

#!/usr/bin/python

article='''<html>Larala
Ponta Monta 
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')

se='</html>'
z.write(article.split(se)[0]+se)

outputs out.txt as

<html>Larala
Ponta Monta 
</html>

python .replace() regex

4 Answers4

Linked