2

I need to extract the content of an HTML tag using RegEx. The body of text I'm searching looks like this:

<div class="content">
    The Price is <script type="text/javascript">document.write(123())</script>
</div>

I tried to use this expression, but it fails. I need to extract the "document.write(123())"

(?s)<div class="content">[^<]*<script type="text/javascript">(.*?)</script></div>

How can I modify my expression to get what I'm after?

Blumer
  • 4,970
  • 2
  • 32
  • 47
Kathick
  • 1,325
  • 5
  • 19
  • 30

3 Answers3

1

There are a couple of problems with your regular expression:

  • What is (?s)?
  • You are not accounting for the space between </script> and </div>
  • The forward slashes (/) I believe need to be escaped, i.e., \/

This seems to work (DEMO):

<div class="content">[^<]*<script type="text\/javascript">(.*?)<\/script>[^<]*<\/div>
mellamokb
  • 55,194
  • 12
  • 105
  • 134
1

You just forgot to account for spaces between <script> and <div>

(?s)<div class="content">[^<]*<script type="text/javascript">(.*?)</script>\s*</div>

nicopico
  • 3,571
  • 1
  • 26
  • 30
1

Extracting content from HTML using Regex is a sure road to madness. It's worse than idea of validating email addresses with Regex.

If you are using C#/.NET I can recommend HtmlAgility pack which does awesome job at extracting content from any HTML (there is a good answer here on StackOverflow that shows how to use it).

If you are using some other technology just look for alternative libraries that do that same thing - you are sure to find that somebody else already solved this problem.

Community
  • 1
  • 1
nikib3ro
  • 19,951
  • 21
  • 116
  • 176