How can I extract or cut html contents of inside
........
? The html source is not correctly formatted

Question

<html>
    <head><title>bla bla</title></head>
    <body>
    <div id="mainContent" xmlns:h="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
        bla bla .....
    </div>
    </body>
</html>

I need to extract that division. How can I do it using PHP 5?

The html source is not currectly formatted. There are some undefined attributes.

You do NOT want to use a regex for this - http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — ThiefMaster, Apr 23 '12 at 08:01
possible duplicate of [How to parse and process HTML with PHP?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php) — Peter Ajtai, Apr 23 '12 at 08:14

Ja͢ck · Accepted Answer · 2012-04-23T09:10:16.813

1

If your HTML is not well formed, you can still use stuff like DOMDocument, e.g.:

$d = new DOMDocument;
$d->loadHTML($htmlstring);

$x = new DomXPath($d);

foreach ($x->query('//div[@id="mainContent"]') as $node) {
    echo $node->nodeValue;
}

Alternatively, just prefix the HTML with <!DOCTYPE html> so that you can use getElementById as per normal.

edited Apr 23 '12 at 09:10

answered Apr 23 '12 at 08:41

Ja͢ck

166,373
34
252
304

Jack · Answer 2 · 2012-04-23T09:32:08.787

0

/<div id=\"mainContent\".*?</div>/gs

http://regexr.com?30o0l if you want to capture everything from the div opening tag to the closing tag.

edited Apr 23 '12 at 09:32

answered Apr 23 '12 at 08:44

Jack

5,550
9
44
71

This will match anything tile the **last** closing tag. It will work only for this very simple example. – stema Apr 23 '12 at 08:54

How can I extract or cut html contents of inside ........ ? The html source is not correctly formatted

2 Answers2

How can I extract or cut html contents of inside
........
? The html source is not correctly formatted