31

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.

I use the cheerio in node.js and My code is below.

var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";

request(url, function (err, res, html) {
    var $ = cheerio.load(html);
    $('.listMain > li').each(function () {
        console.log($(this).find('a').attr('href'));
    });
});

This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.

The content has not been appended yet.

How can I get these elements using node.js? How can I scrape pages with dynamic content?

pguardiario
  • 51,516
  • 17
  • 106
  • 147
JayD
  • 11,103
  • 5
  • 14
  • 14
  • use phantom.js a headless browser, it will load and render the page. you can access different elements on the page using its javascript API. – Safi Feb 26 '15 at 10:58
  • Thanks Safi! But Could you give me a code snippet or some reference with this case? – JayD Feb 26 '15 at 23:50
  • Note that the top answer on this page is from 2015 and recommends an out of date library. Puppeteer and Playwright are the preferred dynamic scraping tools as of 2021, and by the time you're reading this note, there may be other tools that have become state of the art, so please read the entire thread. OP hasn't visited SO since 2016 so I don't anticipate the checkmark changing until site policy does. – ggorlen Jul 02 '21 at 17:21

4 Answers4

23

Here you go;

var phantom = require('phantom');

phantom.create(function (ph) {
  ph.createPage(function (page) {
    var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
    page.open(url, function() {
      page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
        page.evaluate(function() {
          $('.listMain > li').each(function () {
            console.log($(this).find('a').attr('href'));
          });
        }, function(){
          ph.exit()
        });
      });
    });
  });
});
Artjom B.
  • 59,901
  • 24
  • 121
  • 211
Safi
  • 1,102
  • 7
  • 9
  • This works fine!! Thank you very much. But I have another question. This page append child using scroll down. So I have to know when the end of that group to be attached. May be above code declare callback (function() { ph.exit() } but phantom is not terminated and retain cursor!! – JayD Mar 02 '15 at 07:10
  • 2
    @Safi I copied and tried the above code but nothing happens. Can you please help me. I run node file.js and it comes to the next line. – Sumit Sahay Apr 04 '16 at 08:09
  • 3
    where exactly in this code is the logic to wait for ajax to finish loading? I don't understand how phantom would know. – 1mike12 Nov 30 '16 at 19:06
  • 1
    phantom: ⚠️ **This package has been deprecated** ⚠️ This package is no longer maintained. You might want to try using puppeteer instead – Fletcher Rippon Feb 15 '21 at 05:48
  • @1mike12 you can await a setTimeout promise after opening the page, or Phantom's waitFor can help you validate that a certain condition is true inside the page – kas Jun 05 '21 at 22:27
22

Check out GoogleChrome/puppeteer

Headless Chrome Node API

It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com (assuming .npm-expansions remains)

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://www.npmjs.com/');

  const textContent = await page.evaluate(() => {
    return document.querySelector('.npm-expansions').textContent
  });

  console.log(textContent); /* No Problem Mate */

  browser.close();
})();

evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.

scniro
  • 16,486
  • 8
  • 58
  • 103
  • 1
    Good choice, accounting, this [announcement](https://groups.google.com/forum/m/#!topic/phantomjs/9aI5d-LDuNE) – slesh Jan 22 '18 at 07:46
  • I read some articles, may I say that puppeteer runs on server (node.js) not on client side (in browser)? – Felix Xu Feb 28 '21 at 13:00
12

Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.

Examples in the pages above, but here's how to do dynamic scraping:

var phantom = require('x-ray-phantom');
var Xray = require('x-ray');

var x = Xray()
  .driver(phantom());

x('http://google.com', 'title')(function(err, str) {
  if (err) return done(err);
  assert.equal('Google', str);
  done();
})
Keng
  • 834
  • 10
  • 10
  • Are you running this program as `node google_xray_code.js` or as `phantomjs google_xray_code.js` ?? In its current form, phantomjs is not a node module.. – zipzit Feb 18 '16 at 19:36
  • @zipzit phantom is not a node module; it's a driver that you install externally and export the path of if you wish to use it with x-ray. – Keng Feb 22 '16 at 21:12
  • 1
    what makes this dynamic? the page title of google.com is static no? – 1mike12 Nov 30 '16 at 19:06
  • phantom stderr: 'phantomjs' is not recognized as an internal or external command, operable program or batch file. C:\Projects\Dealbuilder1One\node_modules\nightmare\lib\index.js:284 throw err; ^ – Urasquirrel Jul 23 '18 at 14:30
  • I tried with this, x-ray works perectly on static website. But for dynamic x-ray-phantom installation is big headache. Instead of this i found very realistic and easy solution for static+dynamic scrapping which is mentioned in https://pusher.com/tutorials/web-scraper-node – Rohit Parte Oct 03 '19 at 09:17
0

Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scraping.

Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

Rohit Parte
  • 2,635
  • 22
  • 21