PHP Regular Expression Pains

Question

So I have written a small bit of code to convert tables to divs for our mobile sites.

Here is an excerpt of the code:

function replaceTables($table, $html) {

            $tempTable = preg_replace('/<table[^>]*>(.*?)<\/table>/is', '<div style="width: 90%; margin: auto;">$1</div><div style="clear: both;"></div>', $table);
            $html = str_replace($table, $tempTable, $html);

            preg_match_all('/(?!<table[^>]*>).*?<tr[^>]*>.*?<\/tr>.*?(?<!<\/table>)/is', $tempTable, $rows, PREG_OFFSET_CAPTURE);

            for ($i = 0; $i < count($rows[0]); $i++) {
                $tempRow = $rows[0][$i][0];

                preg_match_all('/(?!<table[^>]*>).*?<td[^>]*>.*?<\/td>.*?(?<!<\/table>)/is', $tempRow, $cols, PREG_OFFSET_CAPTURE);

                $numCols = count($cols[0]);
                $colWidth = 100/$numCols;

                for ($x = 0; $x < $numCols; $x++) {
                    $tempCol = $cols[0][$x][0];
                    $cols[0][$x][0] = preg_replace('/<td[^>]*>(.*?)<\/td>/is', '<div style="width: ' . $colWidth . '%; float: left;">$1</div>', $cols[0][$x][0]);
                    $tempRow = str_replace($tempCol, $cols[0][$x][0], $tempRow);
                }

                $tempRow = preg_replace('/<tr[^>]*>(.*?)<\/tr>/is', '<div style="clear: both;">$1</div>', $tempRow);
                $tempTable = str_replace($rows[0][$i][0], $tempRow, $tempTable);
            }

            $html = str_replace($table, $tempTable, $html);

            return $html;
        }

        if ($mobile && $page->type_id != 16) {
            // replace tables with divs for better mobile support

            preg_match_all('/<table[^>]*>.*?<\/table>/is', $this->html, $tables, PREG_OFFSET_CAPTURE);

            for ($y = 0; $y < count($tables[0]); $y++) {
                preg_match_all('/<table[^>]*>.*?<\/table>/is', $tables[0][$y][0], $nestedTables, PREG_OFFSET_CAPTURE);

                if (count($nestedTables[0]) > 0) {
                    //echo count($nestedTables[0]) . "<br />";
                    //print_r($nestedTables[0][0][0]);
                    for ($y = 0; $y < count($nestedTables[0]); $y++) {
                        $this->html = replaceTables($nestedTables[0][$y][0], $this->html);
                    }
                }
                $this->html = replaceTables($tables[0][$y][0], $this->html);
            }
            //$this->html = preg_replace('/<table[^>]*>(.*?)<\/table>/is', '<div style="width: 90%; margin: auto;">$1<div style="clear: both;"></div></div>', $this->html);
        }
        return $this->html;

I am having issues with nested tables, the regular expression is finding the first occurrence of the closing table tag instead of the one I would need it to find.

If someone could lead me in the direction of a better regular expression or a different solution to replace tables with divs that would be great. The solution must be by manipulating a string, so that there will not have to be an overhaul of our templated system.

Thanks

You are trying to parse HTML with regex and you are calling this *a pain*? Seriously? Some other users of this site have a different definition of it: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. I am even wearing a T-Shirt with this poem that was inspired by people like you. — Darin Dimitrov, Sep 25 '12 at 20:08
Read this thread: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Tchoupi, Sep 25 '12 at 20:08
Please refrain from parsing HTML with RegEx as it will [drive you į̷̷͚̤̤̖̱̦͍͗̒̈̅̄̎n̨͖͓̹͍͎͔͈̝̲͐ͪ͛̃̄͛ṣ̷̵̞̦ͤ̅̉̋ͪ͑͛ͥ͜a̷̘͖̮͔͎͛̇̏̒͆̆͘n͇͔̤̼͙̩͖̭ͤ͋̉͌͟eͥ͒͆ͧͨ̽͞҉̹͍̳̻͢](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use an [HTML parser](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) instead. — Madara's Ghost, Sep 25 '12 at 20:09
Wouldn't it just be better to write the website in divs instead of tables so you don't have to have some function do this for you? — Catfish, Sep 25 '12 at 20:10
@Catfish: Apparently OP works with some sort of framework, or is converting old pages. (at least I hope so) — Madara's Ghost, Sep 25 '12 at 20:16
Yes, we have a proprietary system for hosting templated websites. These websites can have arbitrary code (html/javascript) added to individual pages or globally to the header and body sections. So I have added some code to swap out said templates with mobile templates that I have developed, but as you know tables look terrible on mobile devices. I need to be able to swap tables for divs, and I cannot rely on our developers to use only divs, or our clients for that matter, in place of tables. — noub, Sep 25 '12 at 20:29
but you don't use regex to parse html???? of course its giving you pains — geekman, Sep 25 '12 at 20:37
@noub: Make a golden rule. No tables allowed. For specific cases, talk to the head developer (in the rare case where someone actually needs a table). — Madara's Ghost, Sep 25 '12 at 20:41
Yes, I am in the process of moving from regex to find the tables to using the DOMDocument to first parse it, then using regex to replace the individual tables and their elements. Unfortunately I cannot make that rule to enforce the use of divs over tables, it simply cannot be done. — noub, Sep 25 '12 at 20:54

score 1 · Answer 1 · answered Sep 25 '12 at 20:34

As many have said, parsing HTML with regex is not likely to be the ideal method. Still, I've done a bit of research to try to help, under the assumption that you are bound to this approach for some reason.

It sounds like you may be running into issues related to how PHP is interpretting the greediness of your regex pattern. I see that you use a lot of ? quantifiers, which may be making this run a non-greedy search (based on what I'm reading at http://php.net/manual/en/reference.pcre.pattern.modifiers.php at least). There is a chance that you can fix this by using the U modifier on some or all of your regex patterns. This will invert the greediness, which could have the effect of making your ? quantifiers greedy again.

That said, you have a complicated set of regex checks there, so there is definitely a chance of this causing some unintended behaviors as well. I suggest you test and see.

For reference, you would invoke the U modifier by placing it after the closing / of the regex, much like you have done with i and s in some places.

The `U` modifier is _never_ a good idea! - it is unnecessary and its only effect is to confuse. — ridgerunner, Sep 26 '12 at 01:07

score 0 · Answer 2 · answered Sep 25 '12 at 20:39

What has worked the best for me is to process HTML content using the following steps:

Convert content to UTF-8, using utf8_encode($s), if not already UTF-8.
Convert to XHTML using $tidy->repairFile($file, array('output-xhtml'=>true), 'utf8');
Build DOM using $sx = simplexml_load_file($file, 'SimpleXMLElement', LIBXML_NOENT);
Parse DOM using $sx->xpath($xpath);

I hope that helps!

PHP Regular Expression Pains

2 Answers2