7

i have a string with an html code. i want to remove all html tags. so all characters between < and >.

This is my code snipped:

WebClient wClient = new WebClient();
SourceCode = wClient.DownloadString( txtSourceURL.Text );
txtSourceCode.Text = SourceCode;
//remove here all between "<" and ">"
txtSourceCodeFormatted.Text = SourceCode;

hope somebody can help me

Soner Gönül
  • 94,086
  • 102
  • 195
  • 339
taito
  • 501
  • 2
  • 6
  • 11
  • 1
    What if `` characters occur inside comments, scripts, strings etc.? – Tim Pietzcker Dec 01 '13 at 14:47
  • 5
    No, do not use Regex to parse HTML strings. A real nightmare is waiting for you. This is one of the most upvoted answer in SO. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ The best approach is to use a specialized HTML parser like [HTML Agility Pack](http://htmlagilitypack.codeplex.com/) – Steve Dec 01 '13 at 14:47
  • @Steve My favourite SO answer ever :) – Rotem Dec 01 '13 at 15:05
  • Using the .NET XML-Parser might also work in this case? Or am I wrong here? – marsze Dec 01 '13 at 15:10

2 Answers2

14

Try this:

txtSourceCodeFormatted.Text = Regex.Replace(SourceCode, "<.*?>", string.Empty);

But, as others have mentioned, handle with care.

Aage
  • 5,489
  • 2
  • 28
  • 54
3

According to Ravi's answer, you can use

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

or

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");
Community
  • 1
  • 1
Vignesh Kumar A
  • 26,868
  • 11
  • 59
  • 105