Extracting images from pdf using Python

Question

How can we extract images(only images) from PDF.

I used many online tools, they all are not universal. In most of the PDF, it tools the screenshot of the whole image instead of the image. PDF link -> sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter 4.pdf

I have used some websites: http://www.pdfaid.com/ExtractImages.aspx https://pdfcandy.com/extract-images.html https://www.pdf-online.com/osa/extract.aspx — Yash Sharma, May 30 '19 at 09:09
"whole image instead of the image" what do you mean by this? I would really recommend you post screenshots showing what you got, and clearly indicating what you wanted to get. — Ryan, Jun 06 '19 at 06:52

score 3 · Answer 1 · answered May 30 '19 at 09:07

Here is some code that reads a PDF-File using pyPdf, extracts images and yields them as a PIL.Image. You need to modify it to your needs, it's just here to demonstrate how to walk the object tree.

import io
import pyPdf
import PIL.Image

infile_name = 'my.pdf'

with open(infile_name, 'rb') as in_f:
    in_pdf = pyPdf.PdfFileReader(in_f)
    for page_no in range(in_pdf.getNumPages()):
        page = in_pdf.getPage(page_no)

        # Images are part of a page's `/Resources/XObject`
        r = page['/Resources']
        if '/XObject' not in r:
            continue
        for k, v in r['/XObject'].items():
            vobj = v.getObject()
            # We are only interested in images...
            if vobj['/Subtype'] != '/Image' or '/Filter' not in vobj:
                continue
            if vobj['/Filter'] == '/FlateDecode':
                # A raw bitmap
                buf = vobj.getData()
                # Notice that we need metadata from the object
                # so we can make sense of the image data
                size = tuple(map(int, (vobj['/Width'], vobj['/Height'])))
                img = PIL.Image.frombytes('RGB', size, buf,
                                          decoder_name='raw')
                # Obviously we can't really yield here, do something with `img`...
                yield img
            elif vobj['/Filter'] == '/DCTDecode':
                # A compressed image
                img = PIL.Image.open(io.BytesIO(vobj._data))
                yield img

Welcome to StackOverflow. Please frame your questions and comments with as much information as possible. "It does not work" is in no way helpful. Update your question to state what precisely you've tried and what "it does not work" mean. — user2722968, May 30 '19 at 09:14
It works. I have some installation issues with pyPdf. I used PyPDF2 instead of pyPdf and replaced "yield" parts with img.save(..). (Win x64 - Python 3.8) — M.Selman SEZGİN, Jan 19 '20 at 13:23

score 3 · Answer 2 · answered Jun 07 '19 at 20:29

Here's a solution with PyMuPDF:

#!python3.6
import fitz  # PyMuPDF


def get_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index in range(doc.pageCount):
        for image in doc.getPageImageList(page_index):
            xrefs.add(image[0])  # Add XREFs to set so duplicates are ignored
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps


def write_pixmaps_to_pngs(pixmaps):
    for i, pixmap in enumerate(pixmaps):
        pixmap.writePNG(f'{i}.png')  # Might want to come up with a better name


pixmaps = get_pixmaps_in_pdf(r'C:\StackOverflow\09_chapter 4.pdf')
write_pixmaps_to_pngs(pixmaps)

Extracting images from pdf using Python

2 Answers2

Linked