0

I want to parse out HTML from a string selectively. I have used strip_tags to allow div's, but I don't want to keep the div styles/classes from the string. That is, I want:

<div class="something">text</div>
<div style="something">text</div>

to simply become:

<div>text</div>

in both cases.

Can anyone help? Thanks!

Bart Kiers
  • 161,100
  • 35
  • 287
  • 281
Alex
  • 1
  • 1
  • 1
  • possible duplicate of [php regexp: remove all attributes from an html tag](http://stackoverflow.com/questions/3026096/php-regexp-remove-all-attributes-from-an-html-tag) – Gordon Nov 14 '10 at 22:28
  • *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Nov 14 '10 at 22:28

3 Answers3

2

replace the following regex with nothing:

(?<=<div.*?)(?<!=\t*?"?\t*?)(class|style)=".*?"
J V
  • 10,544
  • 8
  • 47
  • 67
  • 1
    What if there is an attribute containing `class=` or `style=` like `
    `?
    – Gumbo Nov 14 '10 at 19:37
  • @J V: That won’t fix it, see for example `
    `.
    – Gumbo Nov 14 '10 at 19:42
  • Ok, it's getting complicated now, but I think I got it... Honestly, if the html is so screwed up regex is the last thing to worry about :) – J V Nov 14 '10 at 19:44
  • 1
    Never mind the whitespace, this regex won't work because it requires variable-length lookbehinds, and PHP (like most flavors) doesn't do that. Lookbehinds should never be your first resort anyway; there's almost always an easier way. – Alan Moore Nov 14 '10 at 21:10
  • Ah, in that case I cave to vincent :) – J V Nov 14 '10 at 21:41
1

Here is an example:

preg_replace('`<div (style="[^"]*"|class="[^"]*")>([^<]*)</div>`i', "<div>$1</div>", $str);

Basically, this matches the content of a div with a style or a class attribute. Then, you remove everything to keep only <div>content</div>.

It's longer than J V's version, but it won't replace something like <div style="blablabla" color="blablabla">content</div>, for instance. May or may not be what you want.

Vincent Savard
  • 32,695
  • 10
  • 65
  • 72
  • I see a problem using the very example the OP gave :) (Hint, repeaters are greedy) – J V Nov 14 '10 at 19:37
  • Actually, the . class is greedy. [^"] is not, it stops after the first " encountered. No worries, I test my code before I post (usually at least!) – Vincent Savard Nov 14 '10 at 19:39
  • Think about it, it doesn't make sense. I have a class that matches every character but ". What happens when it encounters a "? It stops matching. This has nothing to do with * or any quantifier. As I said, I tested my code with OP's example, it works correctly. – Vincent Savard Nov 14 '10 at 19:43
  • Ah yes I see... Although mine only deletes the style/class attribute itself so any other attributes remain. – J V Nov 14 '10 at 19:45
0

As an option to regexp (which always freaks me out), I'd suggest so use xml_parse_into_struct.

See at php.net and it's first example.

Teson
  • 6,414
  • 7
  • 42
  • 67