0

Here is my regex to scrap image from page.

preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches

But it fails when image url is like this:

src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Adolescent_girl_sad_0001.jpg/200px-Adolescent_girl_sad_0001.jpg"

I think it need to add OR operation in above regex to allove image starting with //.

documentation says | pipe will do or operation. But how to add it in above regex?

Programming_crazy
  • 1,277
  • 1
  • 19
  • 39
  • You already have used it successfully in the `(?:png|jpg)` part, so why not do it again? – Bergi Dec 14 '13 at 14:02
  • BTW, it would be easier to make `https?` [as a whole](http://www.regular-expressions.info/brackets.html) [optional](http://www.regular-expressions.info/optional.html) than to use some [alternatives (pipe)](http://www.regular-expressions.info/alternation.html). – Bergi Dec 14 '13 at 14:04
  • Are you looking for image links in Wikipedia pages? For those, there even is a special API: https://www.mediawiki.org/wiki/API:Properties#images_.2F_im – Bergi Dec 14 '13 at 14:06
  • @Bergi: i already tried this: `if(preg_match_all('/\bhttps?:|//\/\/\S+(?:png|jpg)\b/', $html, $matches))` which give error `Warning: preg_match_all(): Unknown modifier '/' in F:\wamp\www\img.php on line 10` – Programming_crazy Dec 14 '13 at 14:08
  • 1
    Is it ok to just parse out the "src" value, using '/src=([\'|"])(.+?)\1/' – Andrew Dec 14 '13 at 14:09
  • @Andrew: it is cool but I only want to parse src for image – Programming_crazy Dec 14 '13 at 14:11
  • 1
    If you only want png|jpg, then '/]+src=([\'"])([^>\'"]+?\.(?:png|jpg))\1/i' – Andrew Dec 14 '13 at 14:15
  • @Programming_crazy: What you were looking for is `'/\b(https?\/\/:|\/\/)\S+…` – Bergi Dec 14 '13 at 14:19
  • @Andrew: to get the result `$matches[0][0];` is ok? it give nothing – Programming_crazy Dec 14 '13 at 14:19
  • @Andrew [THE PONY HE COMES](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Niet the Dark Absol Dec 14 '13 at 14:19
  • Empty? See this: http://ideone.com/sHnRWx. -> string(6) "xx.jpg" – Andrew Dec 14 '13 at 14:24

1 Answers1

1

You could just avoid the wrath of the pony instead...

$dom = new DOMDocument();
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
$sources = array();
foreach($image as $img) $sources[] = $img->getAttribute("src");

Done!

Niet the Dark Absol
  • 311,322
  • 76
  • 447
  • 566