-3

I need to filter links and images from html pages with c++ and regex and I came up with this phrase:

<\s*(a.*?href|img.*?src)\s*=\s*\"(.*?)\".*?\s*> 

unfortunately this will also find links and images within comments which it shouldn't. I tried some negative look-aheads without success.

halfer
  • 19,471
  • 17
  • 87
  • 173
Doodle
  • 111
  • 2
  • 8
  • 4
    please read this once: https://stackoverflow.com/a/1732454/2815219 – Raman Sahasi Jul 22 '17 at 18:17
  • I need to extract all links and images from websites for a webcrawler project for my university. extracts all links and images but we shouldnt get those within comments. For example the this regex will find which it should as well as which it shouldn't – Doodle Jul 22 '17 at 18:18
  • 1
    Don't use regex for that. Use a proper HTML parser. – Jesper Juhl Jul 22 '17 at 18:22
  • unfortunately we are not allowed to use a HTML parser – Doodle Jul 22 '17 at 18:25
  • Why can you not use an HTML parser? – halfer Jul 22 '17 at 19:17
  • 2
    That's an insane requirement. Parsing general HTML is not a suitable job for a regex. My suggestion i is to use a regex to remove HTML comments and CDATA sections and then search - but I'm sure that won't handle all the cases. Note that links can be surrounded by single quotes as well as double. I'm sure I've forgotten some other gotchas – Martin Bonner supports Monica Jul 22 '17 at 19:23
  • @Casimir: possibly, though academic institutions are rather known for placing entirely unrealistic or daft limitations on assignments, such that they become rather poor examples of how to best solve the problem `:o)`. – halfer Jul 22 '17 at 21:14
  • @halfer: there's indeed a lot of pedagogical wares in books/tutorials and other, that choose html for training ground (it's clearly due to a lack of imagination.). It's sad because there's a lot of *real life* and more useful possible examples. But it isn't only a regex problem, think about oop or database tutorials with unrealistic example about cars with number of doors, colors, speed... Authors such lords speak to the peasants of the Middle Ages. – Casimir et Hippolyte Jul 22 '17 at 21:22

1 Answers1

0

There's no reason to do everything at once. Also, you didn't say what environment/editor/programming language, so I picked my favorite, C#.

  1. Remove all comments:

using

var s1 = source.Replace("<!--.*?-->", "");
  1. Extract links with your existing regex:

using

var s2 = Regex.Matches(s1, "<\\s*(a.*?href|img.*?src)\\s*=\\s*\"(.*?)\".*?\\s*> ");
NetMage
  • 24,279
  • 3
  • 31
  • 50