Parsing a HTML Page?

Question

Hi I have a task to build an application that display news from various websites(BBC News, CNN, etc)

I came up with 2 ideas to either parse an RSS Feed of the news site or parse the html pages of each news article.

However after researching abit on RSS feeds i found out it is hard to parse an image from mainly because not all rss feeds have images.

Therefore what do you recommend as a good HTML document parser which i can extract the Title, Description, Data and Image of the news article.

There are many libraries available for various programming languages. What language will you be building the application in? — Mr Lister, Feb 29 '12 at 10:19
c sharp i found HTMLAgility pack just now :D i think its a good choice however if the website changes its layout in the near future have to change the code — adi bon, Feb 29 '12 at 10:41

score 1 · Answer 1 · answered Feb 29 '12 at 10:20

1

See this article:

http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/

Will get you on your feet in no time if you are using PHP / are undecisive

answered Feb 29 '12 at 10:20

rickyduck

3,910
13
57
90

score 0 · Answer 2 · answered Feb 29 '12 at 10:20

0

I suggest regular expressions, but you'll need to cook an expressions for each website.

Or you can use DOM.

But anyways you'll always need to follow all the changes on all the WWW you want to parse. And you'll need a different set of rules for each website.

answered Feb 29 '12 at 10:20

Jiří Herník

2,241
1
21
24

thats the problem I'm trying to create something similiar to flipboard (iOS Application) and i read somewhere that they parse the pages requested by the user. However i dont think its good programming practice to stay repeating the same method that parses a website for each different website – adi bon Feb 29 '12 at 10:44

score 0 · Answer 3 · edited May 23 '17 at 11:55

Use a DOM parser with for getting your content. Do NOT use regex. RegEx match open tags except XHTML self-contained tags explains it very well.

Find a DOM parser for the language of your choice, then use XPath or similar for querying the DOM object. Another nice solution for people with experience in javascript for manipulating the DOM, check out phanomJS, it's awesome and it's what i use as the backend of all my content scrapers now a days.

Cheers

score 0 · Answer 4 · answered Feb 29 '12 at 10:35

I'd recommend a DOM Parser as well. I've used PHP Simple HTML DOM Parser in the past and would recommend it. Its fairly fast and handles broken HTML as well, something regular expressions would struggle with.

However, if you are willing to do away with images where RSS does not provide them, parsing an RSS feed should be a lot easier since it is a valid XML document.

Parsing a HTML Page?

4 Answers4