PDFMiner - Get text lines

Question

I'm converting PDF files to text with the PDFMiner Python library, using the code snippet provided in this SO answer. The problem is that the PDF is three column formatted, and I need to read each line. However, the text I get is unordered: sometimes mixes the first and second column, sometimes mixes the third one... As the text does not follow any logical order, I can't parse each line. So, is there any way to get each individual line of the PDF file using PDFMiner?

EDIT:

PDFMiner comes with a command line tool, pdf2txt.py, to convert PDF to text. Playing with it and setting 0.05 as word margin, I could get a better formatted text, but could not achieve the goal.

score 0 · Answer 1 · answered Aug 06 '13 at 08:22

0

I had a similar when parsing tables*. What worked for me was to exctract HTML. Then you can parse the HTML table and take the table tags into account (see python documentation for the HTMLParser.) I only had tables to find, tho.

My two cents :)

*Tables from word copied into QT TextEdit widget. Widget accepts rich text, but the tables would be mucked up if exported as text. Exported as HTML, parsed HTML, got data :) Did this at work, don't have the code here.

answered Aug 06 '13 at 08:22

Petter TB

147
5

can you please add a link where to find the documentation for the HTMLParser. thanks! – yishairasowsky Feb 19 '20 at 09:27
do you not mean pdfminer.converter.HTMLConverter? the link being https://programtalk.com/python-examples/pdfminer.converter.HTMLConverter/ – yishairasowsky Feb 19 '20 at 09:31

PDFMiner - Get text lines

1 Answers1