7

I'm trying to retrieve all text between <td> and</td>, but I only get the first match in my collection. Do I need a * or something? Here is my code.

string input = @"<tr class=""row0""><td>09/08/2013</td><td><a href=""/teams/nfl/new-england-patriots/results"">New England Patriots</a></td><td><a href=""/boxscore/2013090803"">L, 23-21</a></td><td align=""center"">0-1-0</td><td align=""right"">65,519</td></tr>";

string pattern = @"(?<=<td>)[^>]*(?=</td>)";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
    try
    {
        listBoxControl1.Items.Add(matches.ToString());
    }
    catch { }
}
Alan Moore
  • 71,299
  • 12
  • 93
  • 154
Trey Balut
  • 1,301
  • 3
  • 15
  • 37

3 Answers3

9

Use the following regex expression:

string input = "<tr class=\"row0\"><td>09/08/2013</td><td><a href=\"/teams/nfl/new-england-patriots/results\">New England Patriots</a></td><td><a href=\"/boxscore/2013090803\">L, 23-21</a></td><td align=\"center\">0-1-0</td><td align=\"right\">65,519</td></tr>";

string pattern = "(<td>)(?<td_inner>.*?)(</td>)";

MatchCollection matches = Regex.Matches(input, pattern);

foreach (Match match in matches) {
    try {
        Console.WriteLine(match.Groups["td_inner"].Value);
    }
    catch { }
}
Richard Sitze
  • 7,886
  • 3
  • 32
  • 47
Gary C.
  • 116
  • 2
4

HTML(except XHTML) is not strict i.e in some cases

  • you could have tags which have no ending tags.
  • you could have nested tags..

regex is not suitable for parsing such complex grammar.You need to use a parser..

Use htmlagilitypack parser

You can use this code to retrieve it using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var tdList = doc.DocumentNode.SelectNodes("//td")
                  .Select(p => p.InnerText)
                  .ToList();
carla
  • 1,880
  • 1
  • 34
  • 41
Anirudha
  • 31,626
  • 7
  • 66
  • 85
0

I found a solution here http://geekcoder.org/js-extract-hashtags-from-text/ from Nicolas Durand - it seems to work pretty well:

#[^ :\n\t\.,\?\/’'!]+

Best regards, Phil

Philipp P
  • 589
  • 1
  • 8
  • 12