I have follow regex to detect start and end script tags in html file:
<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
meaning in short: <script NOT</s > NOT</s </script>
it works but need really long time to detect <script>,
even minutes or hours for long strings
lite version work perfect even for long string:
<script[^<]*>[^<]*</script>
however the extended pattern I also use for others tags like <a> where < and > are possible as values of attributes
python test for you:
import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()
how can I fix it? The inner part of regex (after <script>) should be changed and simplify.
PS :) Anticipate yours answers about wrong aproach like using regex in html parsing, I know very well many html/xml parsers, and even better what I can expect in often broken html code, regex is really usefull here.
comment:
well, I need handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extending regex from sgmllib.
and main reason is that I must know exactly position where every tag start and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == []
@Cylian:
atomic grouping as you know is not available in python's re.
so non-geedy everything .*? until <\s*/\s*tag\s*> is a winner at this time.
I know that is not perfect in that case:
re.search('<\s*script.?<\s/\s*script\s*>','< script </script> shit </script>').group()
but I can handle refused tail in next parsing.
It's pretty obvious that html parsing with regex is not one battle figthing.