-2

I have Bing html and i want to parse the results from it with :

    string BingRegex = "<div class=\"sb_tlst\"><h3><a href=\"(.*?)\"";
    string[] results = Regex.Matches(responseStr, BingRegex).Cast<Match>().Select(m => m.Value).ToArray();

I get the results to the array but it add the pattern to each result , something like :

<div class=\"sb_tlst\"><h3><a href=\"www.cnn.com\"
<div class=\"sb_tlst\"><h3><a href=\"www.google.com\"
<div class=\"sb_tlst\"><h3><a href=\"www.gmail.com\"

Any idea how can i fix this and get only the url?

John Saunders
  • 159,224
  • 26
  • 237
  • 393
YosiFZ
  • 7,466
  • 20
  • 105
  • 206

2 Answers2

5

I would suggest not to use regex to parse HTML. Use HtmlAgilityPack as suggested here. Then just use XPath to get the value of attribute you need.

The XPath for your sample div

<div class="sb_tlst">
    <h3>
        <a href="www.gmail.com"/>
    </h3>
</div>

would be

/div[@class='sb_tlst']/h3/a/@href
carla
  • 1,880
  • 1
  • 34
  • 41
Pavel K
  • 3,482
  • 2
  • 27
  • 40
2

Aside from doing this with an HTML parser (which is a better idea), replace:

Select(m => m.Value)

with:

Select(m => m.Value.Groups[1].Value)

Although you'll probably want to throw in a little error handling to check that the group is actually populated.

But the best solution is not to use Regex or an HTML parser, but instead use the Bing search API because this is exactly what it's for.

Matt Burland
  • 43,406
  • 17
  • 95
  • 164