2

I am trying to extract using Puppeteer the title of this page: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106

I have the below code,

          (async () => {
            const browser = await puppet.launch({ headless: true });
            const page = await browser.newPage();
            await page.goto(req.params[0]); //this is the url
            title = await page.evaluate(() => {
              Array.from(document.querySelectorAll("meta")).filter(function (
                el
              ) {
                return (
                  (el.attributes.name !== null &&
                    el.attributes.name !== undefined &&
                    el.attributes.name.value.endsWith("title")) ||
                  (el.attributes.property !== null &&
                    el.attributes.property !== undefined &&
                    el.attributes.property.value.endsWith("title"))
                );
              })[0].attributes.content.value ||
                document.querySelector("title").innerText;
            });

which I have tested using the browser console and even using the { headless: false } option of Puppeteer. It works as expected in the browser, but when I actually run it with node it gives me the following error.

10:54:21 AM web.1 |  (node:10288) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'attributes' of undefined
10:54:21 AM web.1 |      at __puppeteer_evaluation_script__:14:20

So, when I run the same Array.from ...querySelectorAll("meta")... query in the browser I get the expected string:

"Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom"

I'm starting to think I'm doing something wrong with the async promises, as that is the part that is different. Can anyone point me in the right direction?

EDIT: As suggested, I tested using document.title, which should be there, but it also returned null. See code and log below:

          console.log(
            "testing the return",
            (async () => {
              const browser = await puppet.launch({ headless: true });
              const page = await browser.newPage();
              await page.goto(req.params[0]); //this is the url
              try {
                title = await page.evaluate(() => {
                  const title = document.title;
                  const isTitleThere = title == null ? false : true;
                  //recently read that this checks for undefined as well as null but not an
                  //undeclared var
                  return {
                    title: title,
                    titleTitle: title.title,
                    isTitleThere: isTitleThere,
                  };
                });
              } catch (error) {
                console.log(error, "There was an error");
              }
11:54:11 AM web.1 |  testing the return Promise { <pending> }
11:54:13 AM web.1 |  { title: '', isTitleThere: true }

Does this have to do with single-page application bs? I thought puppeteer handled that because it loads everything first.

EDIT: I have added the networkidle lines and await 8000 milliseconds, as suggested. Title is still empty. Code below and log:

            await page.goto(req.params[0], { waitUntil: "networkidle2" });
            await page.waitFor(8000);
            console.log("done waiting");
            title = await page.$eval("title", (el) => el.innerText);
            console.log("title: ", title);
            console.log("done retrieving");
12:36:39 PM web.1 |  done waiting
12:36:39 PM web.1 |  title:  
12:36:39 PM web.1 |  done retreiving

EDIT: PROGRESS!! Thank you to theDavidBarton. It seems headless has to be false for it work? Does anyone know why?

Qrow Saki
  • 658
  • 3
  • 14

2 Answers2

2

If you only need the innerText of title you could do it with page.$eval puppeteer method to achieve the same result:

const title = await page.$eval('title', el => el.innerText)
console.log(title)

Output:

Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom

page.$$eval(selector, pageFunction[, ...args])

The page.$eval method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.


However: your main problem is that the page you are visiting is a Single-Page App (SPA) made in React.Js, and its title is filled dynamically by the JavaScript bundle. So your puppeteer finds a valid title element in the <head> when its content is simply: "" (an empty string).

Normally you should use waitUntil: 'networkidle0' in case of SPAs to make sure the DOM is populated by the actual JS framework properly and it is fully functional:

await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
    waitUntil: 'networkidle0'
  })

Unfortunately with this specific website it throws a timeout error as the network connections don't close until the 30000 ms default timeout, something seems to be not OK on the webpage's frontend side (webworker handling?).

As a workaround you can force puppeteer sleep for 8 seconds with: await page.waitFor(8000) before you try to retrieve the title: by that time it will be properly populated. Actually when you run your script in DevTools Console it works because you are not immediately running the script: that time the page is already fully loaded, DOM is populated.

This script will return the expected title:

async function fn() {
  const browser = await puppeteer.launch({ headless: false })
  const page = await browser.newPage()

  await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
    waitUntil: 'networkidle2'
  })
  await page.waitFor(8000)

  const title = await page.$eval('title', el => el.innerText)
  console.log(title)

  await browser.close()
}
fn()

Maybe const browser = await puppeteer.launch({ headless: false }) affects the result as well.

theDavidBarton
  • 5,676
  • 3
  • 16
  • 40
  • It still returns empty, even with the networkidle and 8000. Is it possible its not fully loaded even after those waits? Or am I doing something else wrong? – Qrow Saki Sep 09 '20 at 19:31
  • how do you use networkidle? if you use networkidle0 your whole script may fail. my script is only this 3 lines (after the page.goto) and it gives back the title currently. – theDavidBarton Sep 09 '20 at 19:34
  • 1
    I tried networkidle2 and networkidle0. See edit. Same result. If you say yours is getting back the title, then its probably my other parts of the code messing things up, since we have the same thing. I will get rid of those and see if it still causes a problem. Thanks for all the help! – Qrow Saki Sep 09 '20 at 19:42
  • 1
    @QrowSaki I have added my whole script for clarity in the end. I think the game changer is `{ headless: true }` changing to `{ headless: false }`. it worths an investigation why it results in different results. glad that I could help a little. – theDavidBarton Sep 09 '20 at 19:48
  • It worked! Thank you! Do you think it matters if headless is false? I'm making a web api. I don't need the UI. If I leave the headless: false in there, will that be constantly opening up Chromium? – Qrow Saki Sep 09 '20 at 19:54
  • 1
    yes, it is the "headful" chrome. the problem is: this site can be automated/scraped only if the browser is not headless (at least it seems to be the limitation so far). you could try to use puppeteer-extra with additional plugin called _stealth_ to pretend your chrome is a headful instance - without launching the UI: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth if it worths the effort for you (and the additional dependencies on your project). – theDavidBarton Sep 09 '20 at 20:01
  • Thank you! I will look into that! – Qrow Saki Sep 09 '20 at 20:09
1

when navigating to the page wait until the page is loaded

await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url

Could you try this

 try {
    title = await page.evaluate(() => {
        const title = document.title;
        const isTitleThere = title == null? false: true
        //recently read that this checks for undefined as well as null but not an 
        //undeclared var
        return {"title":title,"isTitleThere" :isTitleThere }
    })

} catch (error) {
    console.log(error, 'There was an error');

}

or this

 try {
title = await page.evaluate(() => {
    const title = document.querySelector('meta[property="og:title"]');
    const isTitleThere = title == null? false: true
    //recently read that this checks for undefined as well as null but not an 
    //undeclared var
    return {"title":title,"isTitleThere" :isTitleThere }
   })

   } catch (error) {
   console.log(error, 'There was an error');

   }
chuklore
  • 251
  • 5
  • 16
  • I tried the first one. It returned true :( but there's definitely a document title in the page I'm looking at. – Qrow Saki Sep 09 '20 at 18:43
  • you can access the title like so `title.title` – chuklore Sep 09 '20 at 18:47
  • I'm not. Should I be? :0 I only want this one function to be asynchronous. It can do the rest while it's waiting, is what I thought. Is this wrong? Should I be wrapping my entire code in an async func? – Qrow Saki Sep 09 '20 at 18:59
  • Why networkidle2 specifically and not networkidle0 or 1? – Qrow Saki Sep 09 '20 at 19:22
  • 1
    I used the solution from this url [Puppeteer wait until page is completely loaded - Stack Overflow](https://stackoverflow.com/questions/52497252/puppeteer-wait-until-page-is-completely-loaded) when I ran into that problem – chuklore Sep 09 '20 at 19:31