3

I'm converting PDF files to text with the PDFMiner Python library, using the code snippet provided in this SO answer. The problem is that the PDF is three column formatted, and I need to read each line. However, the text I get is unordered: sometimes mixes the first and second column, sometimes mixes the third one... As the text does not follow any logical order, I can't parse each line. So, is there any way to get each individual line of the PDF file using PDFMiner?

EDIT:

PDFMiner comes with a command line tool, pdf2txt.py, to convert PDF to text. Playing with it and setting 0.05 as word margin, I could get a better formatted text, but could not achieve the goal.

Community
  • 1
  • 1
davids
  • 6,049
  • 3
  • 27
  • 46

1 Answers1

0

I had a similar when parsing tables*. What worked for me was to exctract HTML. Then you can parse the HTML table and take the table tags into account (see python documentation for the HTMLParser.) I only had tables to find, tho.

My two cents :)

*Tables from word copied into QT TextEdit widget. Widget accepts rich text, but the tables would be mucked up if exported as text. Exported as HTML, parsed HTML, got data :) Did this at work, don't have the code here.

Petter TB
  • 147
  • 5