1

I want to collect some data from the front page of a website. I can easily run through each line and it is only one specific one that I am interested in. So I want to identify the correct line and extract the number, in this case 324. How can I do this?

<h2><a href="/mmp/it/su/">Weather</a></h2> <span class="jix_channels_count">(324)</span><br><p class="jix_channels_desc">Prog&oslash;r, su, si&oslash;r, tester</p>
Ergwun
  • 11,980
  • 7
  • 53
  • 82
Kasper Hansen
  • 6,007
  • 20
  • 67
  • 101

2 Answers2

2

After downloading the contents, use an HTML Parser such as HTML Agility Pack to identify the span element belonging to the jix_channels_count class.

Another option is SgmlReader.

You tagged your question with regex - I wholeheartedly advice you not taking this direction.

The suggested approach (with SgmlReader) goes more or less like so:

var url = "www.that-website.com/foo/";
var myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();                
var responseStream = myResponse.GetResponseStream();
var sr = new StreamReader(responseStream, Encoding.Default);
var reader = new SgmlReader
             {
                 DocType = "HTML",
                 WhitespaceHandling = WhitespaceHandling.None,
                 CaseFolding = CaseFolding.ToLower,
                 InputStream = sr
             };
var xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
var nodeReader = new XmlNodeReader(xmlDoc);
XElement xml = XElement.Load(nodeReader); 

Now you can just use LINQ to XML to (recursively or otherwise) find the span element with an attribute class whose value equals jix_channels_count and read the value of that element.

Konrad Morawski
  • 8,067
  • 6
  • 51
  • 83
2

Parsing html page with regexes is wrong. Still if you know the exact structure of a single html line, you can use regex without thinking about the line as an html code.

Assuming that the number always is within the brackets and the span with jix_channels_count class:

Match match = Regex.Match(htmlLine, @"(\<span[^>]*class=""jix_channels_count[^>]*\>\()([^)]+)(\))", RegexOptions.IgnoreCase);
if (match.Success)
{
    string number = match.Groups[2].Value;
}
Marek Musielak
  • 26,602
  • 8
  • 69
  • 78