Get complete web page source html with puppeteer - but some part always missing

Question

I am trying to scrape specific string on webpage below :

https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;

The info I want to get from this web page source is the number serial in string below (that is something I can search when right-click mouse ->

"View Page source"): 
 name="nr_rooms_4377601_232287150_0_1_0"/ name="nr_rooms_4377601_232287150_1_1_0"

I am using "puppeteer" and below is my code :

const puppeteer = require('puppeteer');
(async() => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //await page.goto('https://example.com');
    const response = await page.goto("My-url-above");
    let bodyHTML = await page.evaluate(() => document.body.innerHTML);
    let outbodyHTML = await page.evaluate(() => document.body.outerHTML);
    console.log(await response.text());
    console.log(await page.content());
    await browser.close();
})()

But I cannot find the strings I am looking for in response.text() or page.content().

Am I using the wrong methods in page ?

How can I dump the actual page source on the web page , the one exactly the same as I right-click the mouse ?

score 1 · Accepted Answer · answered Aug 27 '20 at 19:48

If you investigate where these strings are appearing then you can see that in <select> elements with a specific class (.hprt-nos-select):

<select
  class="hprt-nos-select"
  name="nr_rooms_4377601_232287150_0_1_0"
  data-component="hotel/new-rooms-table/select-rooms"
  data-room-id="4377601"
  data-block-id="4377601_232287150_0_1_0"
  data-is-fflex-selected="0"
  id="hprt_nos_select_4377601_232287150_0_1_0"
  aria-describedby="room_type_id_4377601 rate_price_id_4377601_232287150_0_1_0 rate_policies_id_4377601_232287150_0_1_0"
>

You would wait until this element is loaded into the DOM, then it will be visible in the page source as well:

await page.waitForSelector('.hprt-nos-select', { timeout: 0 });

BUT your issue actually lies in the fact, that the url you are visiting has some extra URL parameters: ?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl; which are not taken into account by puppeteer (you can take a full page screenshot and you will see that it still has the default hotel search form without the specific hotel offers, and not the ones you are expecting).

You should interact with the search form with puppeteer (page.click() etc.) to set the dates and the origin country yourself to achieve the expected page content.

Yes, what you said is correct , "puppeteer" is not taken my url parameter into account , so my url don't involve the info I actually looking for , @thedavidbarton , is there a way to let puppeteer accept my url parameter ? — Jia, Aug 28 '20 at 04:03
I am not sure if there is a way. Maybe that would work if you’d reuse cookies from your manual page visits, but in that case you need to do a lot of things manually as well. I suggest to automate the whole process from start to end with user-like actions: select the dates with page.click. that way it will work. — theDavidBarton, Aug 28 '20 at 06:55
One finding is : when I disable "headless" mode "const browser = await puppeteer.launch({ headless: false })" , url parameter is still valid when I visit the page . But I don't know why yet — Jia, Aug 28 '20 at 07:05
If headful mode helps with the query params, you can use this shady npm package to make your headless chrome act like a headful chrome: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth — theDavidBarton, Aug 28 '20 at 15:36

Get complete web page source html with puppeteer - but some part always missing

1 Answers1

Linked