3

I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,
<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
should return:
Many hundreds of cultivars exist.

P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?

John Y
  • 13,437
  • 1
  • 45
  • 71
rarora7777
  • 77
  • 1
  • 1
  • 5
  • 1
    Check out http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string for removing unicode in Python. – silent1mezzo Apr 11 '12 at 21:00

2 Answers2

12
soup.findAll('p')

here is a reference

0x90
  • 37,093
  • 35
  • 149
  • 233
6

Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

soup = BeautifulSoup(value)

for tag in soup.findAll('p'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

Reference

Community
  • 1
  • 1
silent1mezzo
  • 2,734
  • 4
  • 25
  • 45