Is it possible to get the bounding boxes for each word with Python?

Asked Jul 13 '17 at 13:41

Active Jul 13 '17 at 13:41

Viewed 1,075 times

I know that

pdftotext -bbox foobar.pdf

creates a HTML file which contains content like

<word xMin="301.703800" yMin="104.483700" xMax="309.697000" yMax="115.283700">is</word>
<word xMin="313.046200" yMin="104.483700" xMax="318.374200" yMax="115.283700">a</word>
<word xMin="321.603400" yMin="104.483700" xMax="365.509000" yMax="115.283700">universal</word>
<word xMin="368.858200" yMin="104.483700" xMax="384.821800" yMax="115.283700">file</word>
<word xMin="388.291000" yMin="104.483700" xMax="420.229000" yMax="115.283700">format</word>

Hence each single word has a bounding box.

The Python package PDFminer in contrast seems only to be able to give the position of a block of text (see example).

How can I get the bounding boxes for each word in Python?

asked Jul 13 '17 at 13:41

Martin Thoma

108,021
142
552
849

Is it possible to get the bounding boxes for each word with Python?

0 Answers0