51

I need to find content of forms from HTML source file, I did some searching and found very good method to do that, but the problem is that it prints out only first found, how can I loop through it and output all form contents, not just first one?

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matchObj = re.search('<form>(.*?)</form>', line, re.S)
print matchObj.group(1)
# Output: Form 1
# I need it to output every form content he found, not just first one...
Stan
  • 24,526
  • 50
  • 156
  • 238
  • 5
    You really don't want to parse HTML with regular expressions. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Wooble Oct 11 '11 at 11:06
  • Please refer this [http://stackoverflow.com/questions/3873361/finding-multiple-occurrences-of-a-string-within-a-string-in-python][1] [1]: http://stackoverflow.com/questions/3873361/finding-multiple-occurrences-of-a-string-within-a-string-in-python – avasal Oct 11 '11 at 11:07

3 Answers3

95

Do not use regular expressions to parse HTML.

But if you ever need to find all regexp matches in a string, use the findall function.

import re
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matches = re.findall('<form>(.*?)</form>', line, re.DOTALL)
print(matches)

# Output: ['Form 1', 'Form 2']
Stan James
  • 2,345
  • 1
  • 26
  • 35
Petr Viktorin
  • 62,694
  • 8
  • 78
  • 78
  • 1
    what does the re.S do? – Charlie Parker Feb 21 '14 at 03:09
  • 3
    Makes the `'.'` special character match any character at all, including a newline; without this flag, `'.'` will match anything *except* a newline. ( http://docs.python.org/2/library/re.html#re.S ) – Petr Viktorin Feb 21 '14 at 09:03
  • Oh, I see, I did go to the webpage but didn't understand the documentation because nothing was underneath re.S but now I see how to read the documentation, re.S and re.DOTALL are the same...thanks! – Charlie Parker Feb 21 '14 at 16:55
  • You're welcome! `re.DOTALL` is more clear, I've updated the answer. – Petr Viktorin Feb 22 '14 at 23:06
  • This is the best method. Just to confirm, as findall returns a normal array, access results with matches[0], matches[1], etc – moyo Nov 25 '21 at 12:00
33

Instead of using re.search use re.findall it will return you all matches in a List. Or you could also use re.finditer (which i like most to use) it will return an Iterator Object and you can just use it to iterate over all found matches.

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
for match in re.finditer('<form>(.*?)</form>', line, re.S):
    print match.group(1)
Aamir Rind
  • 36,955
  • 19
  • 118
  • 157
  • 1
    what does the re.S do? – Charlie Parker Feb 21 '14 at 03:16
  • `re.finditer` is exactly what I needed!Thanks! – shellbye Apr 25 '16 at 07:06
  • 1
    @Pinocchio docs say: re.S is the same as re.DOTALL ``Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.`` (posted this because I believe people like me often come to stackoverflow.com to find answers quickly) – Anton Jun 08 '17 at 11:25
6

Using regexes for this purpose is the wrong approach. Since you are using python you have a really awesome library available to extract parts from HTML documents: BeautifulSoup.

ThiefMaster
  • 298,938
  • 77
  • 579
  • 623