2

I am attempting to extract plain text out of a pdf document using pdf.js and for some reason am unable to get past the Invalid PDF structure error.

My code as such:

const pdfjslib = require('pdfjs-dist');

const pdfPath = 'https://www.corenet.gov.sg/media/2268607/dc19-07.pdf'

var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
    console.log(doc);
    return null
})
.catch((err)=>{
    console.log(err)
});

I have tried other pdf documents coming from the same domain but all throws the same error:

...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
    at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
    at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
    at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
    at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
    at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
    at Module._compile (internal/modules/cjs/loader.js:776:30)
  name: 'InvalidPDFException',
  message: 'Invalid PDF structure' }

Other pdfs from other domains seem to work. Note that downloading the pdf from the above domain works well, and can be viewed on Chrome browser. I doubt that the pdf document is corrupted. I am not implementing any front end code as the intention of the above code is host it on cloud.

halfer
  • 19,471
  • 17
  • 87
  • 173
Koh
  • 2,274
  • 1
  • 16
  • 49
  • This answer has an alternative script for extracting data you could try - https://stackoverflow.com/a/29032269/2570277 – Nick Nov 14 '19 at 11:27
  • Maybe the website you are downloading the PDF from checks the request headers. Can you try downloading the PDF with chrome and load it locally? – Cr4xy Nov 14 '19 at 11:35
  • @Cr4xy yes, I could download the PDF and load it locally. It loads and extract the plain text correctly. If request header is the "issue", any idea how do I go around it without downloading the pdf? – Koh Nov 14 '19 at 11:45
  • You could try to copy all the headers you can find with chrome dev tools, and add them to getDocument like [here](https://github.com/mozilla/pdf.js/issues/3852#issuecomment-373169304) – Cr4xy Nov 14 '19 at 13:02
  • @Cr4xy I have attempted to copy all headers and add them to `httpHeaders`. However, the crucial header to add to make it work seems to be the `cookies` header. Is this expected? – Koh Nov 16 '19 at 07:35

0 Answers0