Best practice to read pdf into python

Question

I'm trying to read a pdf document ( I removed some content cause of sensitive data: https://ufile.io/bgghw ) into python. I have to work with the check boxes and perform action's based on these and other text.

I tried PyPDF3 but it only gave corrupted output, after a little research I found pdfminer which sounds promising with the downside to use python 2.7.

I'm not sure if there are other package's or there is like a best practise for working with pdf's in python as all the information I got is several years old and most of the information is very contrary. Of course I could settle with the best package for my case :)

Thanks for any advice!

Look here: https://stackoverflow.com/q/32667398/10300416 – Nick Dima Dec 26 '18 at 19:49 — Nick Dima, Dec 26 '18 at 19:49

score 5 · Answer 1 · answered Dec 26 '18 at 21:10

First Option : PyPDF2

First run this in cmd to install PyPDF2: (may work better than PyPDF3 which you already tried)

pip install PyPDF2

Then to extract text from a pdf file use the following code:

# importing required modules 
import PyPDF2 

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# printing number of pages in pdf file 
print(pdfReader.numPages) 

# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
print(pageObj.extractText()) 

# closing the pdf file object 
pdfFileObj.close()

2nd Option : Textract

Run this in cmd to install textract

pip install textract

Then to read a pdf use the following code:

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

Good luck!

does extract support encrypted pdf? – Irshu Apr 14 '19 at 08:22 — Irshu, Apr 14 '19 at 08:22

Best practice to read pdf into python

1 Answers1