heavy regex - really time consuming

Question

I have follow regex to detect start and end script tags in html file:

<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

meaning in short: <script NOT</s > NOT</s </script>

it works but need really long time to detect <script>, even minutes or hours for long strings

lite version work perfect even for long string:

<script[^<]*>[^<]*</script>

however the extended pattern I also use for others tags like <a> where < and > are possible as values of attributes

python test for you:

import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()

how can I fix it? The inner part of regex (after <script>) should be changed and simplify.

PS :) Anticipate yours answers about wrong aproach like using regex in html parsing, I know very well many html/xml parsers, and even better what I can expect in often broken html code, regex is really usefull here.

comment: well, I need handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together BeautifulSoup is only 2k lines, which not handling every html and just extending regex from sgmllib.

and main reason is that I must know exactly position where every tag start and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []

@Cylian: atomic grouping as you know is not available in python's re.
so non-geedy everything .*? until <\s*/\s*tag\s*> is a winner at this time.

I know that is not perfect in that case: re.search('<\s*script.?<\s/\s*script\s*>','< script </script> shit </script>').group() but I can handle refused tail in next parsing.

It's pretty obvious that html parsing with regex is not one battle figthing.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — mayhewr, May 24 '12 at 22:19
why not parse the HTML with an HTML parser (in a few msecs) and then find the script tag in the content? — Not_a_Golfer, May 24 '12 at 22:19
@mayhewr I was hoping you linked to this one. It's the SO rickrolling I guess :) — Not_a_Golfer, May 24 '12 at 22:20
@Not_a_Golfer But "never gonna give you up" is freakin' annoying. Rereading that post always makes me happy. :) — mayhewr, May 24 '12 at 22:22
seriously, dude, do not use regex for this unless you have a very good reason. — Not_a_Golfer, May 24 '12 at 22:43

score 3 · Answer 1 · edited May 23 '17 at 10:28

Use an HTML parser like beautifulsoup.

See the great answers for "Can I remove script tags with beautifulsoup?".

If your only tool is a hammer, every problem starts looking like a nail. Regular expressions are a powerful hammer but not always the best solution for some problems.

I guess you want to remove scripts from HTML posted by users for security reasons. If security is the main concern, regular expressions are hard to implement because there are so many things a hacker can modify to fool your regex, yet most browsers will happily evaluate... An specialized parser is easier to use, performs better and is safer.

If you are still thinking "why can't I use regex", read this answer pointed by mayhewr's comment. I could not put it better, the guy nailed it, and his 4433 upvotes are well deserved.

I rather need something like own Beautifulsoup not for security, just improve BS in case when html is broken (it's often) linked article is not true - you cannot achieve 100%, but using regex to achieve 99.9% it's worth to consider. btw - BS also using regex — Sławomir Lenart, Jan 22 '13 at 10:27

score 2 · Accepted Answer · edited Mar 07 '16 at 13:33

2

I don't know python, but I know regular expressions:

if you use the greedy/non-greedy operators you get a much simpler regex:

<script.*?>.*?</script>

This is assuming there are no nested scripts.

edited Mar 07 '16 at 13:33

maazza

6,646
15
58
94

answered May 24 '12 at 22:28

ilomambo

8,249
12
56
102

The pattern you wrote has no nested scripts, the regex I gave you should work. – ilomambo May 24 '12 at 22:38
And no nested `` or `var foo = ""` (which is technically illegal, but supported so well very few know that it's illegal). – Phrogz May 24 '12 at 22:44
oh, it's brilliant and even better: "" in general approach Thank You – Sławomir Lenart May 24 '12 at 22:44
@user1416144 - you should mark the answer as correct by clicking on the tick (if it is indeed correct, and it seems so). – andrew cooke May 25 '12 at 00:59
How about things like `< script > ... script\n>`? Hackers can do nasty things to fool your regex, like using weird unicode. An HTML parser like beautifulsoup can do a better job at this. – Paulo Scardine May 25 '12 at 05:08

score 0 · Answer 3 · answered May 25 '12 at 04:50

The problem in pattern is that it is backtracking. Using atomic groups this issue could be solved. Change your pattern to this**

<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>   
         ^^^^^                           ^^^^^

Explanation

<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

Match the characters “<script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
      Match any character that is NOT a “<” «[^<]+?»
         Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
      Match the character “<” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the character “>” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
      Match any character that is NOT a “<” «[^<]+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
      Match the character “<” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the characters “</script>” literally «</script>»
-->

@Stawomir: It's from [RegexBuddy](http://www.regexbuddy.com/index.html), which I recommend highly. — Alan Moore, May 25 '12 at 12:53

heavy regex - really time consuming

3 Answers3