0

So I am trying to build a web crawler. I have started by passing the request and getting all the HTML of the page in response.

Next I thought of using regular expressions for extracting links from the HTML page. However the more I try to learn them the more tricky them seem.

Are there any alternatives to regular expressions (it may seem a discussion question but it is not I have searched the internet and haven't found a satisfactory answer).

akuzma
  • 1,568
  • 6
  • 22
  • 49
Win Coder
  • 6,348
  • 11
  • 51
  • 79

2 Answers2

2

HtmlAgilityPack is the most famous library for parsing HTML in .NET .

xanatos
  • 106,283
  • 12
  • 188
  • 265
1

Regular expressions can't be used for HTML parsing (see http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html), use proper HTML parser like HtmlAgilityPack :

http://www.nuget.org/packages/HtmlAgilityPack

Antonio Bakula
  • 19,697
  • 6
  • 77
  • 100