How can I extract just text from the html

Question

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

The output should be :-

This is a big title. How are doing you? I am fine

I want to use only HtmlAgility for this purpose. No regular expressions please.

I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?

Thanks in advance :)

See [this question](http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack) for some HTML Agility Pack links. I would guess you have to call something like `InnerText` property on the `HtmlNode`. — Uwe Keim, May 01 '11 at 09:49

Kobi · Accepted Answer · 2011-05-01T09:59:22.427

5

You can use the body's InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

edited May 01 '11 at 09:59

answered May 01 '11 at 09:54

Kobi

130,553
41
252
283

3

Note htat "/html/body" for xpath is much faster. – Richard Schneider May 01 '11 at 11:08
It's giving error. Unable to find namespace for HtmlDocument . – ShaileshDev Mar 14 '17 at 13:41
@Er.ShaileshS.Bankar - Do you have the [Html Agility Pack](https://htmlagilitypack.codeplex.com/) library? – Kobi Mar 14 '17 at 13:51
No, do I have to add it firts? – ShaileshDev Mar 14 '17 at 13:52

score 3 · Answer 2 · edited May 06 '11 at 13:50

3

How about using the XPath expression '//body//text()' to select all text nodes?

edited May 06 '11 at 13:50

Oleks

31,334
11
76
131

answered May 01 '11 at 09:49

chiborg

25,223
12
95
113

score 1 · Answer 3 · answered May 17 '16 at 02:13

1

You can use NUglify that supports text extraction from HTML:

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

answered May 17 '16 at 02:13

xoofx

3,612
1
16
32

It seems to be using `HtmlAgilityPack` under the hood, as suggested by the accepted answer. – Xavier Poinas Nov 30 '17 at 13:57
@XavierPoinas no, NUglify is not using `HtmlAgilityPack`, it has its own HTML5 custom parser. – xoofx Dec 12 '17 at 11:02
Sorry, you're right. I saw it in the project but it's only there for benchmarking purposes. – Xavier Poinas Dec 12 '17 at 16:03

score 1 · Answer 4 · answered May 01 '11 at 09:53

1

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

answered May 01 '11 at 09:53

TheLukeMcCarthy

2,105
2
24
33

How can I extract just text from the html

4 Answers4

Linked