How can I read pdf in python?

Question

How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf.

Can anyone explain which module in python is best for pdf extraction

score 55 · Answer 1 · edited May 28 '19 at 10:50

55

You can USE PyPDF2 package

#install pyDF2
pip install PyPDF2

# importing all the required modules
import PyPDF2

# creating an object 
file = open('example.pdf', 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

# print the number of pages in pdf file
print(fileReader.numPages)

Follow this Documentation http://pythonhosted.org/PyPDF2/

edited May 28 '19 at 10:50

sentence

7,017
4
29
36

answered Aug 21 '17 at 10:56

shankarj67

799
1
7
10

2

Is there a workaround for getting past the "PyPDF2.utils.PdfReadError: EOF marker not found" error? – James Stewart Jun 07 '18 at 22:24
3

You do not really say here how to get the actual text of the pdf. Your code only creates a . – Outcast Jun 06 '19 at 11:12
5

PyPDF2, PyPDF3, and PyPDF4 are not maintained. [I recommend to use pymupdf](https://stackoverflow.com/a/63518022/562769) – Martin Thoma Aug 21 '20 at 07:04
1

Tried using this package with an order form from Amazon. It found 33 pages but extractText() API was empty for all pages – retsigam Jun 15 '21 at 21:35
Yes, I have tested with few of the pdf, extractText() API was skipping few texts. It wasn't printing all the text in pdf. – Sanket Jan 17 '22 at 03:44

score 11 · Answer 2 · edited Jun 20 '20 at 09:12

11

You can use textract module in python

Textract

for install

pip install textract

for read pdf

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

For detail Textract

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 21 '17 at 10:49

Kallz

2,856
1
19
38

13

textract is broken as far as I can tell. – conner.xyz May 14 '18 at 16:58
4

Textract seems to be dead as well: https://github.com/deanmalmgren/textract/issues/350 – Martin Thoma Aug 21 '20 at 07:18

How can I read pdf in python?

2 Answers2

Linked