0

I would like some help from someone more experienced with regex expressions. I have html code from which I want to parse values for hyperlinks. The code from whole page can be found in the attached html below:

http://dl.dropbox.com/u/4571235/example.html

I want to get the hyperlink after each 'compare prices' button in the document.

Every advice is welcome. Thanks in advance, Laziale

Laziale
  • 7,633
  • 43
  • 138
  • 244
  • 3
    Maybe read this first: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Elias Van Ootegem Apr 24 '12 at 18:37

3 Answers3

1

check here.

and try this code:

public static bool isValidUrl(ref string url)
{
    string pattern = @"^(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z0-9\-\._\?\,\'/\\\+&%\$#\=~])*[^\.\,\)\(\s]$";
    Regex reg = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
    return reg.IsMatch(url);
}
Mitja Bonca
  • 4,058
  • 5
  • 23
  • 30
  • I want to get only those links for compare prices button. Not all the links on the form. Is that possible? Thanks – Laziale Apr 24 '12 at 18:40
0

I see that there are also other URLs in the source code - I can suggest the following regex, but it will work correctly ONLY IF each 'compare prices' text is followed directly by the url that you are interested in (i.e. if there is no other url between the 'correct' one). If there is a 'compare prices' text without a matching url the regex will need changed based on some rules.

value="Compare prices"(?:.*?)<a\s+href="([^"]*?)"

The url will be in the matching group 1.

Joanna Derks
  • 3,983
  • 3
  • 23
  • 32
0

Usually a link is in an "a tag", or an "a link" or "img src="url".
If it is in an a href tag you could just check for valid a href and then perform the validation on just those for starters...
0. First get all the inner html in the form that your buttons are contained in.
1. Then grab up just the a href tags for further inspection... pattern="<a[^>]*>" or pattern="<link[^>]*>" or pattern="<img[^>]*>"
2. Then for each of the tags pull out the link, src and href tags
3. Then check to see if the url is valid.
Note: if you can do step 0 then you can most linkly just get all the attributes of a given type and then perform a regular expression on them as well.

RetroCoder
  • 2,444
  • 10
  • 51
  • 80