2

I was just about to attempt scraping using the Simple HTML DOM Framework: http://simplehtmldom.sourceforge.net/ but turns out file_get_contents is disabled in the server configuration for security reasons.

I now need to find a similar framework that uses Curl - anybody know of anything?

The error message I get when trying to run the slash dot example is:

Warning: file_get_contents() [function.file-get-contents]: URL file-access is disabled in the server configuration in /var/www/vhosts/domain.com/httpdocs/crawlfeed/simple_html_dom.php on line 70

martincarlin87
  • 10,397
  • 24
  • 97
  • 143
  • possible duplicate of [How to parse and process HTML with PHP?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php) – mario Jan 13 '12 at 16:02
  • Cant you just cURL the file and then load the text string into SimpleHTMLDOM? – prodigitalson Jan 13 '12 at 16:02
  • you don't HAVE to use file_get_contents with simplehtml. You can fetch the html yourself with curl and feed the results to simplehtml directly. – Marc B Jan 13 '12 at 16:02
  • Also you could really just do the curl request separately, and pass in the string. `$dom = str_get_html(curl($url)->returntransfer(1)->exec());` – mario Jan 13 '12 at 16:05

3 Answers3

6

Just pull the page down with cURL, then load the string into SimpleHTMLDOM:

$ch = curl_init('http://theurl.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlStr = curl_exec($ch);
curl_close($ch);

$html = new simple_html_dom();

// Load HTML from a string
$html->load($htmlStr);
prodigitalson
  • 59,320
  • 9
  • 95
  • 112
4

If you have PHP 5.3 (you should, as PHP 5.2 isn't supported anymore) I totaly recommand you Goutte.

It's kind of new, and it's just a .phar to include in your project. The HTTP part is taken care of by Http Zend and a socket. And you have the powerfull BrowserKit and DomCrawler Symfony Components to help you extract infos from HTML (no regex, no xpath).

Damien
  • 5,809
  • 2
  • 28
  • 34
1

Just use cURL to get the HTML code and then parse the html code using XPATH or Regular Expressions. Using XPATH is a good idea as it is a language specifically for parsing XML or (X)HTML as you want to use.

There is a good example here: http://www.2basetechnologies.com/screen-scraping-with-xpath-in-php

Nidhin Baby
  • 1,568
  • 4
  • 13
  • 16
Daniel West
  • 1,778
  • 2
  • 20
  • 34