Alternate to regular expressions for parsing HTML page for links

Question

So I am trying to build a web crawler. I have started by passing the request and getting all the HTML of the page in response.

Next I thought of using regular expressions for extracting links from the HTML page. However the more I try to learn them the more tricky them seem.

Are there any alternatives to regular expressions (it may seem a discussion question but it is not I have searched the internet and haven't found a satisfactory answer).

you want the HTML utility pack http://htmlagilitypack.codeplex.com/ — Liam, Aug 06 '13 at 12:58

score 2 · Answer 1 · answered Aug 06 '13 at 12:57

2

HtmlAgilityPack is the most famous library for parsing HTML in .NET .

answered Aug 06 '13 at 12:57

xanatos

106,283
12
188
265

score 1 · Answer 2 · answered Aug 06 '13 at 12:58

1

Regular expressions can't be used for HTML parsing (see http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html), use proper HTML parser like HtmlAgilityPack :

http://www.nuget.org/packages/HtmlAgilityPack

answered Aug 06 '13 at 12:58

Antonio Bakula

19,697
6
77
100

Alternate to regular expressions for parsing HTML page for links

2 Answers2