0

I have the following string:

<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>

I want to end up with:

<TD>6949/TD>

but instead I end up with just the tags and no information:

<TD></TD>

This is the regular expression I am using:

RegEx.Replace("<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>","<!--.*-->","")

Can someone explain how to keep the numbers and remove just what the comments. Also if possible, can someone explain why this is happening?

Xaisoft
  • 44,403
  • 86
  • 275
  • 425

3 Answers3

3

.* is a greedy qualifier which matches as much as possible.
It's matching everything until the last -->.

Change it to .*?, which is a lazy qualifier.

SLaks
  • 837,282
  • 173
  • 1,862
  • 1,933
  • Great thanks. So when I use the .*, it doesn't care if anything is in the middle, it keeps on going until it finds the last --> and removes every character in between including the – Xaisoft Jun 30 '11 at 14:42
2

.* is greedy so it will match as many characters as possible. In this case the opening of the first comment until the end of the second. Changing it to .*? or [^>]* will fix it as the ? makes the match lazy. Which is to say it will match as few characters as possible.

zellio
  • 28,131
  • 1
  • 39
  • 59
2

Parsing HTML with Regex is always going to be tricky. Instead, use something like HTML Agility Pack which will allow you to query and parse html in a structured manner.

Mrchief
  • 73,270
  • 19
  • 138
  • 185