0

I have this code:

$tags = implode("|", array("a", "script", "link", "iframe", "img", "object"));
$attrs = implode("|", array("href", "src", "data"));
$any_tag = "\w+(?:\s*=\s*[\"'][^\"']*[\"'])?";
$replace = array(
    "/(<(?:$tags)(?:\s*$any_tag)*\s*(?:$attrs)=[\"'])(?![\"']?(?:data:|#))([^'\"]+)([\"'][^>]*>)/" => function($match) {
        return $match[1] . $match[2] . $match[3]; // return same data
    }
);
$page = preg_replace_callback_array($replace, $page);
echo $page;

and I'm runing this code against https://duckduckgo.com/d2038.js and $page is empty after executing replace, why? If I've added print_r($match); in callback I've got:

Array
(
    [0] => <a href='/a'>
    [1] => <a href='
    [2] => /a
    [3] => '>
)

the same happen if I assign the value of replace to another variable. Why the page is empty?

If I runing this in regex101 it match more elements https://regex101.com/r/CPGuKd/1 and it don't clear the output.

jcubic
  • 56,835
  • 46
  • 206
  • 357

1 Answers1

1

The final cooked regex from within your code is this:

(<(?:a|script|link|iframe|img|object)(?:\s*\w+(?:\s*=\s*["'][^"']*["'])?)*\s*(?:href|src|data)=["'])(?!["']?(?:data:|#))([^'"]+)(["'][^>]*>)

which is different from your live demo and causes a catastrophic backtracking.

According to your live demo there should be a little change in PHP code:

"/(<(?:$tags)(?:\s*$$any_tag)*...
                   ^
revo
  • 45,845
  • 14
  • 70
  • 113
  • Since on iterating over input string regex engine would fail sooner if it has to while using a `$` (end of input string / line). Yes, it doesn't make sense without analyzing regex itself so I'd go with a possessive quantifier `\s++`. Now makes sense. – revo Dec 30 '16 at 10:49
  • I've tested again and it don't work, it don't match `` – jcubic Dec 30 '16 at 10:51