Regexp that matches all the text content of a HTML input

Question

I have articles on my website which I would like to get corrected and translated automatically. But I need to get the content, without having the HTML tags around.

The idea is to have a regex that could retrieve all the content between the tags (and, if possible, also the content found in tags fields like <img alt='Little house'>). The problem is that I don't really know how to write such a regex. Any ideas?

:P http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Mottie, Dec 06 '09 at 15:15

score 2 · Accepted Answer · edited May 23 '17 at 12:01

2

I would recommend using an HTML parser, rather than relying on a regex. Parsing HTML with regex is generally a no-no and are nearly impossible to get right for all cases. There are many questions here on SO that arrive at the same conclusion.

EDIT looks like a couple of us had the same idea... Also, here is a question that discusses more parsers.

edited May 23 '17 at 12:01

Community

1
1

answered Dec 06 '09 at 15:08

jheddings

25,581
7
49
63

score 1 · Answer 2 · answered Dec 06 '09 at 15:06

1

Perhaps a regular expression is not the best choice for this job (I will spare you the obligatory tirade).

I would recommend that you look into an HTML parsing library to help you here, something like Html Agility Pack.

answered Dec 06 '09 at 15:06

Andrew Hare

333,516
69
632
626

score 1 · Answer 3 · answered Dec 06 '09 at 15:12

1

As people said, regex is not the most recommended way, but if you decide that regex is the way to go, this should get you started:

string pattern = @"(<(/?[^>]+)>)"
strippedString = Regex.Replace(str, pattern, string.Empty);

answered Dec 06 '09 at 15:12

Elad

17,685
17
60
70

score 0 · Answer 4 · answered Dec 06 '09 at 15:17

Not sure if this helps but I have the ability to translate articles on my site into a readers preferred language, I done this using the Bing translation widget so I don't do any parsing of html it's all done for me.

Regexp that matches all the text content of a HTML input

4 Answers4

Linked