Python - Using regex to find multiple matches and print them out

Question

I need to find content of forms from HTML source file, I did some searching and found very good method to do that, but the problem is that it prints out only first found, how can I loop through it and output all form contents, not just first one?

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matchObj = re.search('<form>(.*?)</form>', line, re.S)
print matchObj.group(1)
# Output: Form 1
# I need it to output every form content he found, not just first one...

You really don't want to parse HTML with regular expressions. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Wooble, Oct 11 '11 at 11:06
Please refer this [http://stackoverflow.com/questions/3873361/finding-multiple-occurrences-of-a-string-within-a-string-in-python][1] [1]: http://stackoverflow.com/questions/3873361/finding-multiple-occurrences-of-a-string-within-a-string-in-python — avasal, Oct 11 '11 at 11:07

score 95 · Accepted Answer · edited May 25 '19 at 23:26

95

Do not use regular expressions to parse HTML.

But if you ever need to find all regexp matches in a string, use the findall function.

import re
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matches = re.findall('<form>(.*?)</form>', line, re.DOTALL)
print(matches)

# Output: ['Form 1', 'Form 2']

edited May 25 '19 at 23:26

Stan James

2,345
1
26
35

answered Oct 11 '11 at 11:09

Petr Viktorin

62,694
8
78
78

1

what does the re.S do? – Charlie Parker Feb 21 '14 at 03:09
3

Makes the `'.'` special character match any character at all, including a newline; without this flag, `'.'` will match anything *except* a newline. ( http://docs.python.org/2/library/re.html#re.S ) – Petr Viktorin Feb 21 '14 at 09:03
Oh, I see, I did go to the webpage but didn't understand the documentation because nothing was underneath re.S but now I see how to read the documentation, re.S and re.DOTALL are the same...thanks! – Charlie Parker Feb 21 '14 at 16:55
You're welcome! `re.DOTALL` is more clear, I've updated the answer. – Petr Viktorin Feb 22 '14 at 23:06
This is the best method. Just to confirm, as findall returns a normal array, access results with matches[0], matches[1], etc – moyo Nov 25 '21 at 12:00

score 33 · Answer 2 · answered Oct 11 '11 at 12:34

33

Instead of using re.search use re.findall it will return you all matches in a List. Or you could also use re.finditer (which i like most to use) it will return an Iterator Object and you can just use it to iterate over all found matches.

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
for match in re.finditer('<form>(.*?)</form>', line, re.S):
    print match.group(1)

answered Oct 11 '11 at 12:34

Aamir Rind

36,955
19
118
157

1

what does the re.S do? – Charlie Parker Feb 21 '14 at 03:16
`re.finditer` is exactly what I needed!Thanks! – shellbye Apr 25 '16 at 07:06
1

@Pinocchio docs say: re.S is the same as re.DOTALL ``Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.`` (posted this because I believe people like me often come to stackoverflow.com to find answers quickly) – Anton Jun 08 '17 at 11:25

score 6 · Answer 3 · answered Oct 11 '11 at 11:06

6

Using regexes for this purpose is the wrong approach. Since you are using python you have a really awesome library available to extract parts from HTML documents: BeautifulSoup.

answered Oct 11 '11 at 11:06

ThiefMaster

298,938
77
579
623

1

Oh I didn't knew, I just discovered Python yesterday. :) – Stan Oct 11 '11 at 11:12

Python - Using regex to find multiple matches and print them out

3 Answers3

Linked

Related