I'd like to print all objects present in a PDF file: text blocks, images, fonts, page objects, but also vector shapes (if any).
I hoped to see all of them with PyMuPDF:
import fitz
doc = fitz.open('test.pdf')
for xref in range(1, doc.xref_length()):
print(doc.xref_object(xref))
but not everything is there. For example, text is not there. Text can be obtained with doc.loadPage(page_number).getText('dict')), but I'm more looking for a general method, rather than one specific for text elements, one for other objects, etc.
Question: how to print all objects present in a PDF file? (text blocks, images, vector shapes, etc.)
Notes:
I've already read How to extract text from a PDF file? and similar questions but this is specific to text, whereas I'm looking for all objects / attributes.
I already read How to open PDF raw? but here it did not help
When opening a PDF with a text editor, we see a lot of human-unreadable binary data (it seems that it is not only for images).