6

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

The output should be :-

This is a big title. How are doing you? I am fine

I want to use only HtmlAgility for this purpose. No regular expressions please.

I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?

Thanks in advance :)

TCM
  • 16,582
  • 43
  • 152
  • 251
  • 1
    See [this question](http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack) for some HTML Agility Pack links. I would guess you have to call something like `InnerText` property on the `HtmlNode`. – Uwe Keim May 01 '11 at 09:49

4 Answers4

5

You can use the body's InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

Kobi
  • 130,553
  • 41
  • 252
  • 283
3

How about using the XPath expression '//body//text()' to select all text nodes?

Oleks
  • 31,334
  • 11
  • 76
  • 131
chiborg
  • 25,223
  • 12
  • 95
  • 113
1

You can use NUglify that supports text extraction from HTML:

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

xoofx
  • 3,612
  • 1
  • 16
  • 32
1

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

TheLukeMcCarthy
  • 2,105
  • 2
  • 24
  • 33