-2

I am using Python to do a project which involves extracting text from many PDF documents, interestingly I've come across a document which is unable to be parsed by either of these projects:

https://github.com/euske/pdfminer/

https://github.com/deanmalmgren/textract

Indeed, even the command line tool pdftotext cannot extract the text from the document. It prints text at first, then proceeds to print garbage after about 2 minutes of extraction.

The document can be found here: https://www.aiaa.org/uploadedFiles/Events/Conferences/2013_Conferences/2013_-_GNC_Infotech/Promotional_Materials/GNC%202013%20Final%20Program.pdf

I'm interested in one of two solutions:

  1. How could I accomplish the goal of extracting the text from this document in Python?
  2. How could I detect documents like this in general, so I could avoid trying to parse them altogether?

Either of these solutions would be ideal, so thanks in advance!

Has QUIT--Anony-Mousse
  • 73,503
  • 12
  • 131
  • 189

1 Answers1

0

I use Jupyter with Python 3.6 under win10. In this case I have to use pdfminer.six.

I had to re-install all in these days. This does still work for me

pyano
  • 1,643
  • 9
  • 21