3

I have a PDF that contains a long list numbers, that was compressed using the JBIG2 algorithm. When I look up the the internal file structure of my file I can find, that my pages are being built with two different XObjects: Pictured is Adobe Acrobat Preflight -> Internal structure.

(Pictured is Adobe Acrobat Preflight -> Internal structure.)

I can easily look at the specifics of the first one called "XIPLAYER0" (not pictured) it even gives me the information bit by bit if I want to. The second one is the one I am interested in tho. In it I can see that the image is built using 2 "Symbol Dictionaries" (first one marked grey). Is it possible to see the different entries in this dictionary? Or maybe even get some metadata for just one of them?

Sample PDF(Outside link)

SirHawrk
  • 337
  • 2
  • 11
  • Can you include a sample PDF? Also, how do you want to view the symbols, in Acrobat? – Zach Young May 24 '22 at 13:33
  • @ZachYoung I don't really care about where I can see the symbols. I am comfortable with python and I'd guess that would be the most used language for something like this. I also included a sample PDF. It is an outside Link tho – SirHawrk May 24 '22 at 13:40
  • 1
    @KJ I am not entirely certain I follow but I am interested in the specific files as this is a faulty Xerox scan (yes from that story ~ 9 years ago) – SirHawrk May 24 '22 at 17:25
  • Ah no it really is faulty. The numbers are not the same ones as in the original that was scanned lol – SirHawrk May 24 '22 at 19:25
  • This input of yours is not helpful. I __know__ that it is faulty. I am writing a paper about __why__ it is faulty and what __mistakes__ were made by the printer company – SirHawrk May 25 '22 at 04:53

1 Answers1

1

This is not really about PDF, PDF is just the container for the JBIG2 format and its symbols dictionary, which is what you're really interested in.

But, as a first step, you'll need to get the JBIG2 images out of the PDF:

Extract images from PDF, how to handle JBIG2 encoded

That SO mentions poppler, and poppler does have a Python binding/wrapper:

https://pypi.org/project/python-poppler/

Once you get those JBIG2 files, maybe this can help:

jbig2_symbol_dict.c

The bigger project has a command-line util which has a "dump" option, but the source says it's not implemented^1:

case dump:
    fprintf(stderr, "Sorry, segment dump not yet implemented\n");
    break;

So if you're just curious/this is an academic question, the answer looks like "not really". If you need to read the text, how about OCR?

Zach Young
  • 7,809
  • 4
  • 29
  • 48
  • This sadly is an academic question in the sense that I need this for university. I will check these things out tomorrow; I am already at home but big thanks already – SirHawrk May 24 '22 at 17:26