6

I know that

pdftotext -bbox foobar.pdf

creates a HTML file which contains content like

<word xMin="301.703800" yMin="104.483700" xMax="309.697000" yMax="115.283700">is</word>
<word xMin="313.046200" yMin="104.483700" xMax="318.374200" yMax="115.283700">a</word>
<word xMin="321.603400" yMin="104.483700" xMax="365.509000" yMax="115.283700">universal</word>
<word xMin="368.858200" yMin="104.483700" xMax="384.821800" yMax="115.283700">file</word>
<word xMin="388.291000" yMin="104.483700" xMax="420.229000" yMax="115.283700">format</word>

Hence each single word has a bounding box.

The Python package PDFminer in contrast seems only to be able to give the position of a block of text (see example).

How can I get the bounding boxes for each word in Python?

Martin Thoma
  • 108,021
  • 142
  • 552
  • 849

0 Answers0