How to make mutiple lines in Regular expressions

Question

i want to make a regular expression for web scraping

how can i search for multiple line result :

for exemple this is my Html

    <div id="cn-centre-col-inner">

    <p>sothing her</p>
     ...
    </div>

    <div id="ok"> ..</div>

i want to find a regular expression that gieves me this result :

    <div id="cn-centre-col-inner">

    <p>sothing her</p>
     ...
    </div>

[It's a trap](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454z) — MattSizzle, Feb 18 '15 at 20:10

Federico Piazza · Accepted Answer · 2015-02-18T20:51:01.470

Regex is not the best tool to do this, you should use a html parser instead.

Suppose that you have this regex:

(?s)<div id="cn-centre-col-inner">.*?<\/div>

You will be able to capture what you want like:

<div id="cn-centre-col-inner">

    <p>sothing her</p>
    ...
</div>

But, you can't ensure that the first closing div is the right one. For instance, for this case:

<div id="cn-centre-col-inner">

    <p>sothing her</p>
    ...
    <div>something inner 1</div>
    <div>something inner 2</div>
</div>
<div id="ok"> ..</div>

You will lose content and you will only capture:

<div id="cn-centre-col-inner">

    <p>sothing her</p>
    ...
    <div>something inner 1</div>

Like this: enter image description here

This is a good example to show why regex shouldn't be use to parse complex html. I strongly recommend you to use a html parser.

If you are ultra sure that your div cn-centre-col-inner has not embedded divs, then you can go ahead with the regex above. Actually you can use capturing group to get all the content within the div:

(?s)<div id="cn-centre-col-inner">(.*?)<\/div>
                                  ^---^--- notice the parentheses

enter image description here

score 1 · Answer 2 · answered Feb 18 '15 at 20:20

1

After reading the warnings about regexs and html, and if it is just for a specific task, you can try something dirty like that:

(<div[^>]*id="cn-centre-col-inner.*</div>)\n<div id="ok"

answered Feb 18 '15 at 20:20

Gaël Barbin

3,611
3
22
52

How to make mutiple lines in Regular expressions

2 Answers2