PDFJS: Invalid PDF structure

Question

I am attempting to extract plain text out of a pdf document using pdf.js and for some reason am unable to get past the Invalid PDF structure error.

My code as such:

const pdfjslib = require('pdfjs-dist');

const pdfPath = 'https://www.corenet.gov.sg/media/2268607/dc19-07.pdf'

var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
    console.log(doc);
    return null
})
.catch((err)=>{
    console.log(err)
});

I have tried other pdf documents coming from the same domain but all throws the same error:

...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
    at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
    at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
    at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
    at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
    at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
    at Module._compile (internal/modules/cjs/loader.js:776:30)
  name: 'InvalidPDFException',
  message: 'Invalid PDF structure' }

Other pdfs from other domains seem to work. Note that downloading the pdf from the above domain works well, and can be viewed on Chrome browser. I doubt that the pdf document is corrupted. I am not implementing any front end code as the intention of the above code is host it on cloud.

This answer has an alternative script for extracting data you could try - https://stackoverflow.com/a/29032269/2570277 — Nick, Nov 14 '19 at 11:27
Maybe the website you are downloading the PDF from checks the request headers. Can you try downloading the PDF with chrome and load it locally? — Cr4xy, Nov 14 '19 at 11:35
@Cr4xy yes, I could download the PDF and load it locally. It loads and extract the plain text correctly. If request header is the "issue", any idea how do I go around it without downloading the pdf? — Koh, Nov 14 '19 at 11:45
You could try to copy all the headers you can find with chrome dev tools, and add them to getDocument like [here](https://github.com/mozilla/pdf.js/issues/3852#issuecomment-373169304) — Cr4xy, Nov 14 '19 at 13:02
@Cr4xy I have attempted to copy all headers and add them to `httpHeaders`. However, the crucial header to add to make it work seems to be the `cookies` header. Is this expected? — Koh, Nov 16 '19 at 07:35

PDFJS: Invalid PDF structure

0 Answers0