multiple values for class attribute

Question

i am trying to use beautifulsoup to get birthdays for persons from wikipedia. for example the birthday for http://en.wikipedia.org/wiki/Ezra_Taft_Benson is August 4, 1899. to get to the bday, i am using the following code:

bday = url.find("span", class_="bday")

However it is picking up the instance where bday appears in the html code as part of another tag. i.e <span class="bday dtstart published updated">1985-11-10 </span>.

is there a way to match the exact class tag with bday only?

I hope the question is clear as currently I am getting the bday to be 1985-11-10 which is not the correct date.

score 4 · Accepted Answer · answered Sep 23 '12 at 13:45

4

When all other matching methods of BeautifulSoup fail, you can use a function taking a single argument (tag):

>>> url.find(lambda tag: tag.name == 'span' and tag.get('class', []) == ['bday'])
<span class="bday">1899-08-04</span>

The above searches for a span tag whose class attribute is a list of a single element ('bday').

answered Sep 23 '12 at 13:45

efotinis

13,956
5
29
36

this was a great simple solution! thanks. what is the lambda tag doing? – user1496289 Sep 23 '12 at 18:34
The `lambda` creates an anonymous function with a single argument (tag). You could define a separate, named function and pass its name to `find()` instead, but for short, one-off functions `lambda` is [more preferable](http://stackoverflow.com/a/890188/12320). – efotinis Sep 23 '12 at 19:27

score 1 · Answer 2 · answered Sep 24 '12 at 15:46

I would have gone about it this way:

import urllib
from BeautifulSoup import BeautifulSoup

url = 'http://en.wikipedia.org/wiki/Ezra_Taft_Benson'
file_pointer = urllib.urlopen(url)
html_object = BeautifulSoup(file_pointer)

bday = html_object('span',{'class':'bday'})[0].contents[0]

This returns 1899-08-04 as the value of bday

Pedro Romano · Answer 3 · 2012-09-23T13:21:38.757

0

Try using lxml with the beautifulsoup parser. The following finds <span> tags with only the bday class (which in the case of this page there is only one):

>>> from lxml.html.soupparser import fromstring
>>> root = fromstring(open('Ezra_Taft_Benson'))
>>> span_bday_nodes = root.findall('.//span[@class="bday"]')
[<Element span at 0x1be9290>]
>>> span_bday_node[0].text
'1899-08-04'

edited Sep 23 '12 at 13:21

answered Sep 23 '12 at 13:13

Pedro Romano

10,519
3
42
49

multiple values for class attribute

3 Answers3