2

i am trying to use beautifulsoup to get birthdays for persons from wikipedia. for example the birthday for http://en.wikipedia.org/wiki/Ezra_Taft_Benson is August 4, 1899. to get to the bday, i am using the following code:

bday = url.find("span", class_="bday")

However it is picking up the instance where bday appears in the html code as part of another tag. i.e <span class="bday dtstart published updated">1985-11-10 </span>.

is there a way to match the exact class tag with bday only?

I hope the question is clear as currently I am getting the bday to be 1985-11-10 which is not the correct date.

Pierre GM
  • 18,799
  • 3
  • 53
  • 65
user1496289
  • 1,703
  • 4
  • 12
  • 12

3 Answers3

4

When all other matching methods of BeautifulSoup fail, you can use a function taking a single argument (tag):

>>> url.find(lambda tag: tag.name == 'span' and tag.get('class', []) == ['bday'])
<span class="bday">1899-08-04</span>

The above searches for a span tag whose class attribute is a list of a single element ('bday').

efotinis
  • 13,956
  • 5
  • 29
  • 36
  • this was a great simple solution! thanks. what is the lambda tag doing? – user1496289 Sep 23 '12 at 18:34
  • The `lambda` creates an anonymous function with a single argument (tag). You could define a separate, named function and pass its name to `find()` instead, but for short, one-off functions `lambda` is [more preferable](http://stackoverflow.com/a/890188/12320). – efotinis Sep 23 '12 at 19:27
1

I would have gone about it this way:

import urllib
from BeautifulSoup import BeautifulSoup

url = 'http://en.wikipedia.org/wiki/Ezra_Taft_Benson'
file_pointer = urllib.urlopen(url)
html_object = BeautifulSoup(file_pointer)

bday = html_object('span',{'class':'bday'})[0].contents[0] 

This returns 1899-08-04 as the value of bday

That1Guy
  • 6,729
  • 4
  • 47
  • 55
0

Try using lxml with the beautifulsoup parser. The following finds <span> tags with only the bday class (which in the case of this page there is only one):

>>> from lxml.html.soupparser import fromstring
>>> root = fromstring(open('Ezra_Taft_Benson'))
>>> span_bday_nodes = root.findall('.//span[@class="bday"]')
[<Element span at 0x1be9290>]
>>> span_bday_node[0].text
'1899-08-04'
Pedro Romano
  • 10,519
  • 3
  • 42
  • 49