Regular Expression does not remove html comment?

Question

I have the following string:

<TD>6949</TD>

I want to end up with:

<TD>6949/TD>

but instead I end up with just the tags and no information:

<TD></TD>

This is the regular expression I am using:

RegEx.Replace("<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>","<!--.*-->","")

Can someone explain how to keep the numbers and remove just what the comments. Also if possible, can someone explain why this is happening?

score 3 · Accepted Answer · answered Jun 30 '11 at 14:34

3

.* is a greedy qualifier which matches as much as possible.
It's matching everything until the last -->.

Change it to .*?, which is a lazy qualifier.

answered Jun 30 '11 at 14:34

SLaks

837,282
173
1,862
1,933

Great thanks. So when I use the .*, it doesn't care if anything is in the middle, it keeps on going until it finds the last --> and removes every character in between including the – Xaisoft Jun 30 '11 at 14:42

score 2 · Answer 2 · answered Jun 30 '11 at 14:35

.* is greedy so it will match as many characters as possible. In this case the opening of the first comment until the end of the second. Changing it to .*? or [^>]* will fix it as the ? makes the match lazy. Which is to say it will match as few characters as possible.

score 2 · Answer 3 · answered Jun 30 '11 at 14:36

2

Parsing HTML with Regex is always going to be tricky. Instead, use something like HTML Agility Pack which will allow you to query and parse html in a structured manner.

answered Jun 30 '11 at 14:36

Mrchief

73,270
19
138
185

Regular Expression does not remove html comment?

3 Answers3