3

I have used this code to convert pdf to text.

input1 = '//Home//Sai Krishna Dubagunta.pdf'
output = '//Home//Me.txt'
os.system(("pdftotext %s %s") %( input1, output))

I have created the Home directory and pasted the source file in it.

The output I get is

1

And no file with .txt was created. Where is the Problem?

Krishna
  • 663
  • 2
  • 6
  • 21

3 Answers3

8

There are various Python packages to extract the text from a PDF with Python.

pdftotext

pdftotext package: Seems to work pretty well, but it has no options e.g. to extract bounding boxes

Installation

For Ubuntu:

sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev

Minimal Working Example

import pdftotext

with open("lorem_ipsum.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Iterate over all the pages
for page in pdf:
    print(page)

# Just read the second page
print(pdf.read(2))

# Or read all the text at once
print(pdf.read_all())

PDF miner

Install it with pip install pdfminer.six. A minimal working example is here.

Martin Thoma
  • 108,021
  • 142
  • 552
  • 849
  • 1
    last 2 lines won't work. At 2.0 "Remove PDF.page_count, PDF.read, and PDF.read_all"(https://github.com/jalan/pdftotext/blob/0a1cc5ccefd603ea646bc07afcba4d581a134039/CHANGES.md) – Netro Dec 01 '17 at 06:36
  • This is the best answer. FYI, `pdftotext` requires you [first install `poppler`, which is a little painful on Windows](https://stackoverflow.com/questions/45912641/unable-to-install-pdftotext-on-python-3-6-missing-poppler) – smci Apr 04 '19 at 22:11
4

Your expression

("pdftotext %s %s") %( input1, output)

will translate to

pdftotext //Home//Sai Krishna Dubagunta.pdf //Home//Me.txt

which means that the first parameter passed to pdftotext is //Home//Sai, and the second parameter is Krishna. That obviously won't work.

Enclose the parameters in quotes:

os.system("pdftotext '%s' '%s'" % (input1, output))
Tim Pietzcker
  • 313,408
  • 56
  • 485
  • 544
  • And That didn't Work @Tim Pietzcker – Krishna May 23 '14 at 05:23
  • 2
    "Didn't work" is not really helpful. What exactly were the results when you used that command? I'm not a Unix person, but are there really supposed to be double slashes in paths? What happens if you type `pdftotext '//Home//Sai Krishna Dubagunta.pdf' '//Home//Me.txt'` in the directory that you're running the Python script in? – Tim Pietzcker May 23 '14 at 05:30
  • Double slashes is specifying a single slash in the input string. same as in C to print or to specify / we use //. The result is 1. That means according to the Error Codes, it is Invalid Function. – Krishna May 23 '14 at 05:38
  • 1
    @Krishna: Are you sure you're not confusing slashes `"/"` and backslashes `"\"`? – Tim Pietzcker May 23 '14 at 05:39
  • Confused. Always had a problem with that. – Krishna May 23 '14 at 05:42
  • Best to use rawstring `r''` so you don't need to escape backslashes: `r'/Home/Me'`@TimPietzcker et al: since 1995, Windows has accepted '/' as equivalent to `\` – smci Apr 04 '19 at 22:09
  • Anyway it's better to use the Python wrapper to `pdftotext` as @MartinThoma shows. – smci Apr 04 '19 at 22:12
0

I think pdftotext command takes only one argument. Try using:

os.system(("pdftotext %s") % input1)

and see what happens. Hope this helps.

haraprasadj
  • 999
  • 1
  • 7
  • 17
  • Then Where does the output come? I have to give a output path right ? some place to store the file. And the same output. Sorry. – Krishna May 23 '14 at 05:23
  • I came across your question while searching for some info on pdf automation (testing). I based my remark on this: http://en.wikipedia.org/wiki/Pdftotext where it is mentioned: $ pdftotext file.pdf This usage produces a text file with the same name as the input file. Wildcards (*), for example $ pdftotext *pdf, for converting multiple files, cannot be used because pdftotext expects only one file name. I could have misunderstood the question. – haraprasadj May 23 '14 at 05:59
  • I missed out a package that must be installed according to an user in another forum. [link](http://bytes.com/topic/python/answers/500078-convert-pdf-files-txt-files) But I couldn't try as I don't know how to install that package. I'll try it using PyCharm – Krishna May 23 '14 at 06:16