0

I tried to scrap data from web page using regex but it gives DOM warning. So I want to know, is it possible for regex to scrape date, review, rate value from this page?

http://www.yelp.com/biz/franchino-san-francisco?start=80

Here is with DOM:

https://eval.in/143074 give error.

This works for smaller code : https://eval.in/143036

Is it possible using regex?

<?php
$html= file_get_contents('http://www.yelp.com/biz/franchino-san-francisco?start=80');

$html = escapeshellarg($html) ;
$html = nl2br($html);

$classname = 'rating-qualifier';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[@class='" . $classname . "']");

if ($results->length > 0) {
    echo $review = $results->item(0)->nodeValue;
}


$classname = 'review_comment ieSucks';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[@class='" . $classname . "']");

if ($results->length > 0) {
    echo $review = $results->item(0)->nodeValue;
}

$meta = $dom->documentElement->getElementsByTagName("meta");
echo $meta->item(0)->getAttribute('content');
?>
tripleee
  • 158,107
  • 27
  • 234
  • 292
Programming_crazy
  • 1,277
  • 1
  • 19
  • 39
  • see this : http://stackoverflow.com/questions/13986359/test-php-script-online – aelor Apr 28 '14 at 07:54
  • try running your code on your local machine, what error do you get there ? – aelor Apr 28 '14 at 08:08
  • Using only "regular" regex this is only possible if the site structure is guaranteed to never change and you know it exactly, because `HTML is no regular language` http://blog.codinghorror.com/parsing-html-the-cthulhu-way/ – DrCopyPaste Apr 28 '14 at 08:17
  • @aelor: it gives error for non formed html code similar to `Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 756 in F:\wamp\www\htdocs\thenwat\yelp.php on line 23` – Programming_crazy Apr 28 '14 at 10:18
  • I can suppress these error using `libxml_use_internal_errors(true)`. Solution above is taken from one of your reply only on different thread – Programming_crazy Apr 28 '14 at 10:19
  • @DrCopyPaste: thanks, any hint for this? – Programming_crazy Apr 28 '14 at 10:20
  • @Programming_crazy don't do much php over here, but this looks promising: http://stackoverflow.com/a/3577662/2186023 – DrCopyPaste Apr 28 '14 at 10:27

0 Answers0