39

How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf.

Can anyone explain which module in python is best for pdf extraction

peterh
  • 1
  • 15
  • 76
  • 99
sg1994
  • 457
  • 1
  • 4
  • 6

2 Answers2

55

You can USE PyPDF2 package

#install pyDF2
pip install PyPDF2

# importing all the required modules
import PyPDF2

# creating an object 
file = open('example.pdf', 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

# print the number of pages in pdf file
print(fileReader.numPages)

Follow this Documentation http://pythonhosted.org/PyPDF2/

sentence
  • 7,017
  • 4
  • 29
  • 36
shankarj67
  • 799
  • 1
  • 7
  • 10
  • 2
    Is there a workaround for getting past the "PyPDF2.utils.PdfReadError: EOF marker not found" error? – James Stewart Jun 07 '18 at 22:24
  • 3
    You do not really say here how to get the actual text of the pdf. Your code only creates a . – Outcast Jun 06 '19 at 11:12
  • 5
    PyPDF2, PyPDF3, and PyPDF4 are not maintained. [I recommend to use pymupdf](https://stackoverflow.com/a/63518022/562769) – Martin Thoma Aug 21 '20 at 07:04
  • 1
    Tried using this package with an order form from Amazon. It found 33 pages but extractText() API was empty for all pages – retsigam Jun 15 '21 at 21:35
  • Yes, I have tested with few of the pdf, extractText() API was skipping few texts. It wasn't printing all the text in pdf. – Sanket Jan 17 '22 at 03:44
11

You can use textract module in python

Textract

for install

pip install textract

for read pdf

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

For detail Textract

Community
  • 1
  • 1
Kallz
  • 2,856
  • 1
  • 19
  • 38