0

I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line <h1>Hello World!</h1>.I want to retain only "Hello World!". In order to remove the tags, I used re.sub('<.*>','',string). For obvious reasons the result I get is an empty string (The regexp identifies the first and last angle brackets and removes everything in between). How could I get over this issue?

PaulDaviesC
  • 1,101
  • 2
  • 15
  • 31

5 Answers5

1

Parse the HTML using BeautifulSoup, then only retrieve the text.

Sunjay Varma
  • 4,735
  • 6
  • 33
  • 50
1

You can make the match non-greedy: '<.*?>'

You also need to be careful, HTML is a crafty beast, and can thwart your regexes.

Ned Batchelder
  • 345,440
  • 70
  • 544
  • 649
1

make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy

off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/

akonsu
  • 27,402
  • 26
  • 115
  • 185
1

Use a parser, either lxml or BeautifulSoup:

import lxml.html
print lxml.html.fromstring(mystring).text_content()

Related questions:

Using regular expressions to parse HTML: why not?

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

Community
  • 1
  • 1
Marco Mariani
  • 13,284
  • 6
  • 38
  • 55
0

Beautiful Soup is great for parsing html!

You might not require it now, but it's worth learning to use it. Will help you in the future too.

varunl
  • 17,851
  • 5
  • 28
  • 46