1

I'm trying to scrape a whole div element in c#...

I've tried div class="txt-block"\s*(.+?)(\r\n?|\n)\s*" But it doesn't scrape it whole :( Any ideas? Here is the div.. THX!

    <div class="txt-block" itemprop="creator" itemscope itemtype="http://schema.org/Person"> 
    <h4 class="inline">Writers:</h4>
    <a href="/name/nm1318843/?ref_=tt_ov_wr" itemprop='url'><span class="itemprop"    itemprop="name">Mark Fergus</span></a>               (screenplay), 
    <a href="/name/nm1319757/?ref_=tt_ov_wr" itemprop='url'><span class="itemprop"         
    itemprop="name">Hawk Ostby</span></a>               (screenplay), <a href="fullcredits?ref_=tt_ov_wr#writers" >6 more credits</a>&nbsp;&raquo;
</div> 
Habib
  • 212,447
  • 27
  • 392
  • 421
  • 7
    You really need to read [this](http://stackoverflow.com/a/1732454/1583) to understand why RexEx and HTML parsing are not a good idea in conjunction. – Oded May 08 '13 at 12:37
  • 4
    Try an [HTML Parser](http://htmlagilitypack.codeplex.com/) instead. – Dustin Kingen May 08 '13 at 12:38
  • You can't parse HTML with regex. There are HTML parsers for most languages, look online or HTML parser. If you wish to do it yourself you will need to do a lot more work. – Peter Wooster May 08 '13 at 12:39

2 Answers2

6

Why so many down-votes? Because YOU wouldn't parse HTML with Regex, he's not allowed? That's very narrow-sighted.

I've seen a large percent of the time that htmlagilitypack can't properly parse a horribly malformed html document, or can't parse concatenated or nested HTML documents from mass-captures. Or that XPath in any form won’t work because an HTML doc is dynamically created, not consistent, and doesn't necessarily contain identifying properties. Why import extra includes and work around sloppy markup when a very simple regex can be more reliable anyway?

What if you have a large project where a single method in your project just has to pull out the contents of a DIV of an input HTML document? It isn't an entire HTML parsing project, just a single regex is necessary. Your answer is to include more imports and build a whole new framework for that? I do hundreds of projects a year. Half use DOM/XPath, the other half simply can't, and require Regex.

In short, don't be so narrow sighted. Reference XPath/DOM tools but help to answer a question. Don't just down-vote. We aren't Neanderthals who need to consistently laugh about an ancient "Don't Parse HTML with Regex" post made forever ago.

The answer(s) follow:

First, the simplex one:

(?s)<div.*?>(.*?)</div>

Require a particularly named div?

(?s)<div[^>]*?class="txt-block"[^>]*?>(.*?)</div>

Want to save CPU and avoid unnecessary backtracking?

<div[^>]*?class="txt-block"[^>]*?>(([^<]*(?(?!</div>)<))*)</div>

The above assumes you don't have nested DIV items. That's when the whole idea of not using Regex really comes into play. Unless you are using C#.Net. In which case you'd just do this:

(?xm)
    (?>
        <(?<Tagname>div)[^>]*?class="txt-block"[^>]*>
)
(?(Tagname)
    (
        </(?(?!\k'Tagname')(?<-Tagname>))*\k'Tagname'>(?<-Tagname>)
    |
        (?>
            <(?<Tagname>[a-z][^\s>]*)[^>]*>
        )
    |
        [^<]+
    )+?
    (?(Tagname)(?!))
)

Or, the single line version:

(?m)(?><(?<Tagname>div)[^>]*?class="txt-block"[^>]*>)(?(Tagname)(</(?(?!\k'Tagname')(?<-Tagname>))*\k'Tagname'>(?<-Tagname>)|(?><(?<Tagname>[a-z][^\s>]*)[^>]*>)|[^<]+)+?(?(Tagname)(?!)))

Pick your poison. Regex is more powerful and reliable than people think. The most complex example I posted won't work in Regex Buddy, but will work in any .Net framework. Regex Buddy doesn't support Balancing Groups, which is a .Net flavor.

Suamere
  • 5,034
  • 2
  • 41
  • 51
  • +1 Fair answer, I feel that Regex is a last resort when doing HTML, but I still give +1 for pointing out some options that can work in limited situations. – Scott Chamberlain May 22 '13 at 01:58
0

Parsing HTML with regex is not a good idea. Try finding a library for parsing HTML in c#.

After a quick search I came up with this library: http://htmlagilitypack.codeplex.com/ It seems that this library has all the functionaility you need.

Community
  • 1
  • 1
Jim
  • 469
  • 1
  • 3
  • 13