0

I am using this regex to get all image urls in an html file:

(?<=img\s*\S*src\=[\x27\x22])(?<Url>[^\x27\x22]*)(?=[\x27\x22])

Is there any way to modify this regex to exclude any img tags that are commented out with html comment ""?

Andrey
  • 19,434
  • 24
  • 100
  • 171
  • Why not use a proper HTML parser instead? – Pekka Feb 24 '12 at 18:01
  • 2
    [The pony he comes...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Niet the Dark Absol Feb 24 '12 at 18:02
  • @Pekka: because I can't guarantee the html to be 100% "correct" - the app is getting it from non-IT personnel so there is a good chance of [badly] malformed html. – Andrey Feb 24 '12 at 18:06

2 Answers2

2

If your regex already works for extracting images (which would be a miracle in itself), consider a regex to strip HTML comments, like so:

<!--.*?-->

Replace that with an empty string, and any images inside the comment will no longer show up in your other regex.

Alternatively, if you're using PHP (you didn't tag a programming language), you can use the strip_tags function with "<img>" as the "allowable tags" parameter. This will strip out HTML comments, as well as other tags that may interfere with your regex.

Niet the Dark Absol
  • 311,322
  • 76
  • 447
  • 566
0

It's actually also very simple when using the HTML agility pack, there's a bunch of settings in there that helps fixing bad HTML if needed. Like:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.OptionAutoCloseOnEnd = true;
doc.OptionCheckSyntax = false;
doc.OptionFixNestedTags = true;
// etc, just set them before calling Load or LoadHtml

http://htmlagilitypack.codeplex.com/

string textToExtractSrcFrom = "... your text here ...";

doc.LoadHtml(textToExtractSrcFrom);

var nodes = doc.DocumentNode.SelectNodes("//img[@src]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
    string src = node.Attributes["src"].Value;
}

//or 
var links = nodes.Select(node => node.Attributes["src"].Value);
jessehouwing
  • 96,701
  • 20
  • 235
  • 310