29

I have a CouchDB view map function that generates an abstract of a stored HTML document (first x characters of text). Unfortunately I have no browser environment to convert HTML to plain text.

Currently I use this multi-stage regexp

html.replace(/<style([\s\S]*?)<\/style>/gi, ' ')
    .replace(/<script([\s\S]*?)<\/script>/gi, ' ')
    .replace(/(<(?:.|\n)*?>)/gm, ' ')
    .replace(/\s+/gm, ' ');

while it's a very good filter, it's obviously not a perfect one and some leftovers slip through sometimes. Is there a better way to convert to plain text without a browser environment?

Era
  • 30,460
  • 24
  • 136
  • 197
  • it may come down to using regex as you have listed for the bulk of replaces and then using a specified list replaces, such as :active; to complete the cleanse. – Valamas Mar 03 '13 at 04:28
  • http://stackoverflow.com/a/29706729/3338098 preserves new-lines and strips html tags – user3338098 Apr 17 '15 at 18:38

7 Answers7

28

This simple regular expression works:

text.replace(/<[^>]*>/g, '');

It removes all anchors.

Entities, like &lt; does not contains <, so there is no issue with this regex.

Gaël Barbin
  • 3,611
  • 3
  • 22
  • 52
14

Converter HTML to plain text like Gmail:

html = html.replace(/<style([\s\S]*?)<\/style>/gi, '');
html = html.replace(/<script([\s\S]*?)<\/script>/gi, '');
html = html.replace(/<\/div>/ig, '\n');
html = html.replace(/<\/li>/ig, '\n');
html = html.replace(/<li>/ig, '  *  ');
html = html.replace(/<\/ul>/ig, '\n');
html = html.replace(/<\/p>/ig, '\n');
html = html.replace(/<br\s*[\/]?>/gi, "\n");
html = html.replace(/<[^>]+>/ig, '');

If you can use jQuery :

var html = jQuery('<div>').html(html).text();
EpokK
  • 37,948
  • 9
  • 59
  • 68
10

With TextVersionJS (http://textversionjs.com) you can convert your HTML to plain text. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

In node.js it looks like:

var createTextVersion = require("textversionjs");
var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

(I copied the example from the page, you will have to npm install the module first.)

gyula.nemeth
  • 845
  • 10
  • 9
6

You can try this way. textContent with innerText neither of them compatible with all browsers:

var temp = document.createElement("div");
temp.innerHTML = html;
return temp.textContent || temp.innerText || "";
Stephen Rauch
  • 44,696
  • 30
  • 102
  • 125
Dostonbek Oripjonov
  • 1,272
  • 1
  • 10
  • 28
3

Updated @EpokK answer for html to email text version use-case

const htmltoText = (html: string) => {
  let text = html;
  text = text.replace(/\n/gi, "");
  text = text.replace(/<style([\s\S]*?)<\/style>/gi, "");
  text = text.replace(/<script([\s\S]*?)<\/script>/gi, "");
  text = text.replace(/<a.*?href="(.*?)[\?\"].*?>(.*?)<\/a.*?>/gi, " $2 $1 ");
  text = text.replace(/<\/div>/gi, "\n\n");
  text = text.replace(/<\/li>/gi, "\n");
  text = text.replace(/<li.*?>/gi, "  *  ");
  text = text.replace(/<\/ul>/gi, "\n\n");
  text = text.replace(/<\/p>/gi, "\n\n");
  text = text.replace(/<br\s*[\/]?>/gi, "\n");
  text = text.replace(/<[^>]+>/gi, "");
  text = text.replace(/^\s*/gim, "");
  text = text.replace(/ ,/gi, ",");
  text = text.replace(/ +/gi, " ");
  text = text.replace(/\n+/gi, "\n\n");
  return text;
};

Melounek
  • 634
  • 4
  • 18
0

If you want something accurate and can use npm packages, I would use html-to-text.

From the README:

const { htmlToText } = require('html-to-text');

const html = '<h1>Hello World</h1>';
const text = htmlToText(html, {
  wordwrap: 130
});
console.log(text); // Hello World

FYI, I found this on npm trends; html-to-text seemed like the best option for my use case but you can check out others here.

Killian Huyghe
  • 842
  • 7
  • 9
-4

It's pretty simple, you can also implement a "toText" prototype:

String.prototype.toText = function(){
    return $(html).text();
};

//Let's test it out!
var html = "<a href=\"http://www.google.com\">link</a>&nbsp;<br /><b>TEXT</b>";
var text = html.toText();
console.log("Text: " + text); //Result will be "link TEXT"