23

I am trying to extract Meta Description for fetched webpages. But here I am facing the problem of case sensitivity of BeautifulSoup.

As some of the pages have <meta name="Description and some have <meta name="description.

My problem is very much similar to that of Question on Stackoverflow

The only difference is that I can't use lxml .. I have to stick with Beautifulsoup.

Community
  • 1
  • 1
Nitin
  • 728
  • 6
  • 19

6 Answers6

18

You can give BeautifulSoup a regular expression to match attributes against. Something like

soup.findAll('meta', name=re.compile("^description$", re.I))

might do the trick. Cribbed from the BeautifulSoup docs.

Will McCutchen
  • 12,917
  • 3
  • 42
  • 43
17

A regular expression? Now we have another problem.

Instead, you can pass in a lambda:

soup.findAll(lambda tag: tag.name.lower()=='meta',
    name=lambda x: x and x.lower()=='description')

(x and avoids an exception when the name attribute isn't defined for the tag)

MikeyB
  • 3,238
  • 1
  • 25
  • 38
  • Using bs4 i'm getting "find_all() got multiple values for keyword argument 'name'" with that :/ – Joaolvcm Feb 20 '14 at 11:14
  • @Joaolvcm “You [can’t use](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments) a keyword argument to search for HTML’s ‘name’ element, because Beautiful Soup uses the name argument to contain the name of the tag itself. Instead, you can give a value to ‘name’ in the attrs argument.” TL;DR: `soup.find_all(lambda tag: ..., {"name": lambda x: ...})`. – Alex Shpilkin Sep 21 '18 at 14:28
10

With minor changes it works.

soup.findAll('meta', attrs={'name':re.compile("^description$", re.I)})
Nitin
  • 728
  • 6
  • 19
6

With bs4 use the following:

soup.find('meta', attrs={'name': lambda x: x and x.lower()=='description'})
Emmanuel
  • 61
  • 1
  • 2
2

Better still use a css attribute = value selector with i argument for case insensitivity

soup.select('meta[name="description" i]')
ashleedawg
  • 18,752
  • 7
  • 68
  • 96
QHarr
  • 80,579
  • 10
  • 51
  • 94
-5

change case of the html page source. Use functions such as string.lower(), string.upper()

Lucifer
  • 1
  • 1
  • 4