Is there a (PHP) Web Scraping Framework That Uses Curl Instead of file_get_contents?

Question

I was just about to attempt scraping using the Simple HTML DOM Framework: http://simplehtmldom.sourceforge.net/ but turns out file_get_contents is disabled in the server configuration for security reasons.

I now need to find a similar framework that uses Curl - anybody know of anything?

The error message I get when trying to run the slash dot example is:

Warning: file_get_contents() [function.file-get-contents]: URL file-access is disabled in the server configuration in /var/www/vhosts/domain.com/httpdocs/crawlfeed/simple_html_dom.php on line 70

possible duplicate of [How to parse and process HTML with PHP?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php) — mario, Jan 13 '12 at 16:02
Cant you just cURL the file and then load the text string into SimpleHTMLDOM? — prodigitalson, Jan 13 '12 at 16:02
you don't HAVE to use file_get_contents with simplehtml. You can fetch the html yourself with curl and feed the results to simplehtml directly. — Marc B, Jan 13 '12 at 16:02
Also you could really just do the curl request separately, and pass in the string. `$dom = str_get_html(curl($url)->returntransfer(1)->exec());` — mario, Jan 13 '12 at 16:05

score 6 · Accepted Answer · answered Jan 13 '12 at 16:05

6

Just pull the page down with cURL, then load the string into SimpleHTMLDOM:

$ch = curl_init('http://theurl.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlStr = curl_exec($ch);
curl_close($ch);

$html = new simple_html_dom();

// Load HTML from a string
$html->load($htmlStr);

answered Jan 13 '12 at 16:05

prodigitalson

59,320
9
95
112

2

Thats no excuse.. ive never used it before... I KID, I KID ;-) – prodigitalson Jan 14 '12 at 02:44

score 4 · Answer 2 · answered Jan 13 '12 at 16:14

If you have PHP 5.3 (you should, as PHP 5.2 isn't supported anymore) I totaly recommand you Goutte.

It's kind of new, and it's just a .phar to include in your project. The HTTP part is taken care of by Http Zend and a socket. And you have the powerfull BrowserKit and DomCrawler Symfony Components to help you extract infos from HTML (no regex, no xpath).

score 1 · Answer 3 · edited Sep 23 '16 at 07:22

1

Just use cURL to get the HTML code and then parse the html code using XPATH or Regular Expressions. Using XPATH is a good idea as it is a language specifically for parsing XML or (X)HTML as you want to use.

There is a good example here: http://www.2basetechnologies.com/screen-scraping-with-xpath-in-php

edited Sep 23 '16 at 07:22

Nidhin Baby

1,568
4
13
16

answered Jan 13 '12 at 16:11

Daniel West

1,778
2
20
34

Is there a (PHP) Web Scraping Framework That Uses Curl Instead of file_get_contents?

3 Answers3