1

I'm trying to read a pdf document ( I removed some content cause of sensitive data: https://ufile.io/bgghw ) into python. I have to work with the check boxes and perform action's based on these and other text.

I tried PyPDF3 but it only gave corrupted output, after a little research I found pdfminer which sounds promising with the downside to use python 2.7.

I'm not sure if there are other package's or there is like a best practise for working with pdf's in python as all the information I got is several years old and most of the information is very contrary. Of course I could settle with the best package for my case :)

Thanks for any advice!

Sebastian
  • 2,227
  • 3
  • 19
  • 38

1 Answers1

5

First Option : PyPDF2

First run this in cmd to install PyPDF2: (may work better than PyPDF3 which you already tried)

pip install PyPDF2

Then to extract text from a pdf file use the following code:

# importing required modules 
import PyPDF2 

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# printing number of pages in pdf file 
print(pdfReader.numPages) 

# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
print(pageObj.extractText()) 

# closing the pdf file object 
pdfFileObj.close() 

2nd Option : Textract

Run this in cmd to install textract

pip install textract

Then to read a pdf use the following code:

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

Good luck!