9

Can someone help me parse a html file to get the links for all the images in the file in python?

Preferably with out a 3rd party module...

Thanks!

user377419
  • 4,151
  • 12
  • 40
  • 55

3 Answers3

11

You can use Beautiful Soup. I know you said without a 3rd party module. However, this is an ideal tool for parsing HTML.

import urllib2
from BeautifulSoup import BeautifulSoup
page = BeautifulSoup(urllib2.urlopen("http://www.url.com"))
page.findAll('img')
citruspi
  • 6,381
  • 4
  • 24
  • 42
Russell Dias
  • 67,406
  • 5
  • 48
  • 71
10

only using PSL

from html.parser import HTMLParser
class MyParse(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag=="img":
            print(dict(attrs)["src"])

h=MyParse()
page=open("index.html").read()
h.feed(page)
Kabie
  • 10,108
  • 1
  • 35
  • 44
2

It's generally accepted that lxml is faster than Beautiful Soup (ref). Its tutorial can be found here: (link) You may also take a look at this old stackoverflow post.

Community
  • 1
  • 1
Overmind Jiang
  • 603
  • 5
  • 17