21

PDFMiner's documentation says:

PDFMiner allows one to obtain the exact location of text in a page

However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.

technillogue
  • 1,302
  • 3
  • 14
  • 27
  • 1
    Possible duplicate of [How to extract text and text coordinates from a pdf file?](https://stackoverflow.com/questions/22898145/how-to-extract-text-and-text-coordinates-from-a-pdf-file) – Martin Thoma Jul 13 '17 at 13:36

2 Answers2

22

You are looking for the bbox property on every layout object. There is a little bit of information on how to parse the layout hierarchy in the PDFMiner documentation, but it doesn't cover everything.

Here's an example:

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure


def parse_layout(layout):
    """Function to recursively parse the layout tree."""
    for lt_obj in layout:
        print(lt_obj.__class__.__name__)
        print(lt_obj.bbox)
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            print(lt_obj.get_text())
        elif isinstance(lt_obj, LTFigure):
            parse_layout(lt_obj)  # Recursive


fp = open('example.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
    interpreter.process_page(page)
    layout = device.get_result()
    parse_layout(layout)

If you are interested in the location of individual LTChar objects, you can recursively parse into the child layout objects of LTTextBox and LTTextLine just like what is done with LTFigure in the above example.

Matt Swain
  • 3,577
  • 4
  • 22
  • 36
  • 1) Could you explain what LAParams() does, please? 2) Isn't it more pythonic to try to get text and then try to recurse rather than using isinstance? – technillogue Aug 12 '14 at 16:27
  • Aren't there other types of containers other than LTFigure? – technillogue Aug 12 '14 at 16:28
  • 1
    LAParams contains the parameters used for the layout analysis that merges characters into words and lines based on their locations. You can pass initialization parameters like line_overlap, char_margin, line_margin, word_margin, boxes_flow, detect_vertical. See PDFMiner docs for explanation and default values. – Matt Swain Aug 12 '14 at 16:38
  • 1
    Other than `LTFigure` there's also `LTTextBox` that contains `LTTextLine` which in turn contains `LTChar` and `LTAnno`. The [PDFMiner docs](https://euske.github.io/pdfminer/programming.html) have a diagram of the hierarchy. – Matt Swain Aug 12 '14 at 16:39
  • Things seem to work without passing LAParams, why are they needed? Isn't it more Pythonic to EAFP rather then use isinstance? – technillogue Aug 12 '14 at 17:01
  • 1
    `LAParams` is really just a way to modify the parameters used by the layout analyser. It's good practice to pass to `PDFPageAggregator` even if you just use the default parameters, because otherwise some of the layout analysis may not be performed. You probably can make my `parse_layout` function more pythonic. Every `LT*` object should be iterable even if it doesn't have any children, so the `LTFigure` isinstance check is probably unnecessary. Similarly, you could just attempt `get_text()` for all and catch the failure if it's not implemented on that `LT*` object. – Matt Swain Aug 13 '14 at 12:06
  • Is there any way to parse just first LTTextBox of each page?(actually I want the box header ) – sunny Jan 24 '18 at 21:38
  • What's your basis for thinking that recursing into `LTFigure`s like this works? Over at https://stackoverflow.com/a/53360415/1709587, I claim it's broken because an `LTFigure` cannot contain an `LTTextBox`... but if I'm wrong, I'd appreciate you proving me so. – Mark Amery Nov 18 '18 at 11:44
  • Rather using `LTTextBox`, is there another parameter that will just find coordinates for individual words? – Starbucks Dec 04 '19 at 20:17
1

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, pdfminer.six has multiple API's to extract text and information from a PDF. For programmatically extracting information I would advice to use extract_pages(). This allows you to inspect all of the elements on a page, ordered in a meaningful hierarchy created by the layout algorithm.

The following example is a pythonic way of showing all the elements in the hierachy. It uses the simple1.pdf from the samples directory of pdfminer.six.

from pathlib import Path
from typing import Iterable, Any

from pdfminer.high_level import extract_pages


def show_ltitem_hierarchy(o: Any, depth=0):
    """Show location and text of LTItem and all its descendants"""
    if depth == 0:
        print('element                        x1  y1  x2  y2   text')
        print('------------------------------ --- --- --- ---- -----')

    print(
        f'{get_indented_name(o, depth):<30.30s} '
        f'{get_optional_bbox(o)} '
        f'{get_optional_text(o)}'
    )

    if isinstance(o, Iterable):
        for i in o:
            show_ltitem_hierarchy(i, depth=depth + 1)


def get_indented_name(o: Any, depth: int) -> str:
    """Indented name of LTItem"""
    return '  ' * depth + o.__class__.__name__


def get_optional_bbox(o: Any) -> str:
    """Bounding box of LTItem if available, otherwise empty string"""
    if hasattr(o, 'bbox'):
        return ''.join(f'{i:<4.0f}' for i in o.bbox)
    return ''


def get_optional_text(o: Any) -> str:
    """Text of LTItem if available, otherwise empty string"""
    if hasattr(o, 'get_text'):
        return o.get_text().strip()
    return ''


path = Path('~/Downloads/simple1.pdf').expanduser()

pages = extract_pages(path)
show_ltitem_hierarchy(pages)

The output shows the different elements in the hierarchy. The bounding box for each. And the text that this element contains.

element                        x1  y1  x2  y2   text
------------------------------ --- --- --- ---- -----
generator                       
  LTPage                       0   0   612 792  
    LTTextBoxHorizontal        100 695 161 719  Hello
      LTTextLineHorizontal     100 695 161 719  Hello
        LTChar                 100 695 117 719  H
        LTChar                 117 695 131 719  e
        LTChar                 131 695 136 719  l
        LTChar                 136 695 141 719  l
        LTChar                 141 695 155 719  o
        LTChar                 155 695 161 719  
        LTAnno                  
    LTTextBoxHorizontal        261 695 324 719  World
      LTTextLineHorizontal     261 695 324 719  World
        LTChar                 261 695 284 719  W
        LTChar                 284 695 297 719  o
        LTChar                 297 695 305 719  r
        LTChar                 305 695 311 719  l
        LTChar                 311 695 324 719  d
        LTAnno                  
    LTTextBoxHorizontal        100 595 161 619  Hello
      LTTextLineHorizontal     100 595 161 619  Hello
        LTChar                 100 595 117 619  H
        LTChar                 117 595 131 619  e
        LTChar                 131 595 136 619  l
        LTChar                 136 595 141 619  l
        LTChar                 141 595 155 619  o
        LTChar                 155 595 161 619  
        LTAnno                  
    LTTextBoxHorizontal        261 595 324 619  World
      LTTextLineHorizontal     261 595 324 619  World
        LTChar                 261 595 284 619  W
        LTChar                 284 595 297 619  o
        LTChar                 297 595 305 619  r
        LTChar                 305 595 311 619  l
        LTChar                 311 595 324 619  d
        LTAnno                  
    LTTextBoxHorizontal        100 495 211 519  H e l l o
      LTTextLineHorizontal     100 495 211 519  H e l l o
        LTChar                 100 495 117 519  H
        LTAnno                  
        LTChar                 127 495 141 519  e
        LTAnno                  
        LTChar                 151 495 156 519  l
        LTAnno                  
        LTChar                 166 495 171 519  l
        LTAnno                  
        LTChar                 181 495 195 519  o
        LTAnno                  
        LTChar                 205 495 211 519  
        LTAnno                  
    LTTextBoxHorizontal        321 495 424 519  W o r l d
      LTTextLineHorizontal     321 495 424 519  W o r l d
        LTChar                 321 495 344 519  W
        LTAnno                  
        LTChar                 354 495 367 519  o
        LTAnno                  
        LTChar                 377 495 385 519  r
        LTAnno                  
        LTChar                 395 495 401 519  l
        LTAnno                  
        LTChar                 411 495 424 519  d
        LTAnno                  
    LTTextBoxHorizontal        100 395 211 419  H e l l o
      LTTextLineHorizontal     100 395 211 419  H e l l o
        LTChar                 100 395 117 419  H
        LTAnno                  
        LTChar                 127 395 141 419  e
        LTAnno                  
        LTChar                 151 395 156 419  l
        LTAnno                  
        LTChar                 166 395 171 419  l
        LTAnno                  
        LTChar                 181 395 195 419  o
        LTAnno                  
        LTChar                 205 395 211 419  
        LTAnno                  
    LTTextBoxHorizontal        321 395 424 419  W o r l d
      LTTextLineHorizontal     321 395 424 419  W o r l d
        LTChar                 321 395 344 419  W
        LTAnno                  
        LTChar                 354 395 367 419  o
        LTAnno                  
        LTChar                 377 395 385 419  r
        LTAnno                  
        LTChar                 395 395 401 419  l
        LTAnno                  
        LTChar                 410 395 424 419  d
        LTAnno                  

(Similar answer and question here, here and here , I'll try to keep them in sync.)

Pieter
  • 2,804
  • 1
  • 15
  • 23