0

i want to make a regular expression for web scraping

how can i search for multiple line result :

for exemple this is my Html

    <div id="cn-centre-col-inner">

    <p>sothing her</p>
     ...
    </div>

    <div id="ok"> ..</div>

i want to find a regular expression that gieves me this result :

    <div id="cn-centre-col-inner">

    <p>sothing her</p>
     ...
    </div>
Andy Lester
  • 86,927
  • 13
  • 98
  • 148
saidmohamed11
  • 275
  • 4
  • 15

2 Answers2

2

Regex is not the best tool to do this, you should use a html parser instead.

Suppose that you have this regex:

(?s)<div id="cn-centre-col-inner">.*?<\/div>

You will be able to capture what you want like:

<div id="cn-centre-col-inner">

    <p>sothing her</p>
    ...
</div>

But, you can't ensure that the first closing div is the right one. For instance, for this case:

<div id="cn-centre-col-inner">

    <p>sothing her</p>
    ...
    <div>something inner 1</div>
    <div>something inner 2</div>
</div>
<div id="ok"> ..</div>

You will lose content and you will only capture:

<div id="cn-centre-col-inner">

    <p>sothing her</p>
    ...
    <div>something inner 1</div>

Like this: enter image description here

This is a good example to show why regex shouldn't be use to parse complex html. I strongly recommend you to use a html parser.

If you are ultra sure that your div cn-centre-col-inner has not embedded divs, then you can go ahead with the regex above. Actually you can use capturing group to get all the content within the div:

(?s)<div id="cn-centre-col-inner">(.*?)<\/div>
                                  ^---^--- notice the parentheses

enter image description here

Federico Piazza
  • 28,830
  • 12
  • 78
  • 116
1

After reading the warnings about regexs and html, and if it is just for a specific task, you can try something dirty like that:

(<div[^>]*id="cn-centre-col-inner.*</div>)\n<div id="ok"
Gaël Barbin
  • 3,611
  • 3
  • 22
  • 52