1

I'd like to print all objects present in a PDF file: text blocks, images, fonts, page objects, but also vector shapes (if any).

I hoped to see all of them with PyMuPDF:

import fitz
doc = fitz.open('test.pdf')
for xref in range(1, doc.xref_length()):
    print(doc.xref_object(xref))

but not everything is there. For example, text is not there. Text can be obtained with doc.loadPage(page_number).getText('dict')), but I'm more looking for a general method, rather than one specific for text elements, one for other objects, etc.

Question: how to print all objects present in a PDF file? (text blocks, images, vector shapes, etc.)

Notes:

  • I've already read How to extract text from a PDF file? and similar questions but this is specific to text, whereas I'm looking for all objects / attributes.

  • I already read How to open PDF raw? but here it did not help

  • When opening a PDF with a text editor, we see a lot of human-unreadable binary data (it seems that it is not only for images).

Basj
  • 36,818
  • 81
  • 313
  • 561
  • You can't. Or at least, you won't be able to do it perfectly in most situations. First, PDF format is layout-based and not object-based. Second, doing that "sequentially" does not mean anything in this context since it depends of the layout that ultimately depends of each user. Let say you have a two column document with a footnote. The footnote is likely to pollute after the first column. So there is basically no way up to my knowledge to do this simply and it must be specific to the kind of document you're working with. – Synthase Nov 15 '21 at 11:31
  • @Synthase thanks for your comment. I edited and removed the word 'sequentially' from my question, because in fact I don't necessarily care about the layout. As long as I can get every element in a big loop it's fine (if optionally, I can get the x,y position it's great, but not necessary). – Basj Nov 15 '21 at 11:36

0 Answers0