0

I have the following input

<div style="s1">title1</div>
<div style="s1">content1</div>
<div style="s1">title2</div>
<div style="s1">content2</div>

I know title1 and title2 and I want to collect content1 and content2

I would need something like this:

<div style="s1">title1</div>.*?<div style="s1">(.*?)</div>

but since regexp is greedy, it matches until the end so it returns

content1</div>
    <div style="s1">title2</div>
    <div style="s1">content2

I would like to add to the pattern a list of tags that should not be included in the match.

Something like:

<div style="s1">title1</div>.*?<div style="s1">(.*?[^<div])</div>

where I refer with [^<div] to a not contain stuff. This should be multiple options, probably with the use of |

How can I do it?

Pentium10
  • 198,623
  • 118
  • 409
  • 488

3 Answers3

4

Obligitory link.

Now that that is out of the way, just do some dom manipulation and xpath:

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $x = new DOMXPath($dom);        

    foreach($x->query("//div") as $node)
    {
       if (trim($node->textContent) == 'title1')
       {
           $content['title1'] = $node->nextSibling->textContent;
       }
    }

Now wasn't that easy? So no more regexing html kay?

Community
  • 1
  • 1
Byron Whitlock
  • 51,185
  • 28
  • 118
  • 166
0
<div style="s1">title1</div>.*<div style="s1">(([^<]|<[^\/])*)</div>

Try this - it means find anything excepting < or < not followed by / - if you want, i can add there condition for sub-divs etc.

SergeS
  • 10,948
  • 2
  • 28
  • 35
0

Just use the U option = ungreedy : http://.php.net/manual/fr/reference.pcre.pattern.modifiers.php

soju
  • 24,553
  • 3
  • 65
  • 68