Extract a page from a pdf as a jpeg

Question

In python code, how to efficiently save a certain page in a pdf as a jpeg file? (Use case: I've a python flask web server where pdf-s will be uploaded and jpeg-s corresponding to each page is stores.)

This solution is close, but the problem is that it does not convert the entire page to jpeg.

Depending on the image, it may be better to extract as a png. This would apply if the page contains mainly text. — Paul Rooney, Jun 18 '20 at 05:54

score 189 · Accepted Answer · edited Jan 21 '20 at 17:42

189

The pdf2image library can be used.

You can install it simply using,

pip install pdf2image

Once installed you can use following code to get images.

from pdf2image import convert_from_path
pages = convert_from_path('pdf_file', 500)

Saving pages in jpeg format

for page in pages:
    page.save('out.jpg', 'JPEG')

Edit: the Github repo pdf2image also mentions that it uses pdftoppm and that it requires other installations:

pdftoppm is the piece of software that does the actual magic. It is distributed as part of a greater package called poppler. Windows users will have to install poppler for Windows. Mac users will have to install poppler for Mac. Linux users will have pdftoppm pre-installed with the distro (Tested on Ubuntu and Archlinux) if it's not, run sudo apt install poppler-utils.

You can install the latest version under Windows using anaconda by doing:

conda install -c conda-forge poppler

note: Windows versions upto 0.67 are available at http://blog.alivate.com.au/poppler-windows/ but note that 0.68 was released in Aug 2018 so you'll not be getting the latest features or bug fixes.

edited Jan 21 '20 at 17:42

Sam Mason

12,674
1
33
48

answered Feb 02 '18 at 12:51

Keval Dave

2,411
1
12
16

5

Hi, the poppler is just a zipped file, doesn't install anything, what is one supposed to do with the dll's or the bin files ? – gaurwraith Aug 26 '18 at 21:59
@gaurwraith: Use the following [link to poppler](https://blog.alivate.com.au/poppler-windows/). For some reason the link in the description from Rodrigo is not the same as in the github repo. – Tobias Oct 09 '18 at 07:20
@Keval Dave Have you installed poppler and tried pdf2image on Windows machine? Which Windows please? – SKR Nov 27 '18 at 15:08
@SKR I have used this with windows 10 and 64bit machine. Find installation of poppler in windows from answer. – Keval Dave Nov 29 '18 at 09:56
This packages gives a white border to the image so removed it following this [stackoverflow question](https://stackoverflow.com/questions/10615901/trim-whitespace-using-pil?answertab=votes#tab-top) – hru_d May 06 '19 at 14:00
I've install it but got error: `jpeg8.dll` not found – Peter.k May 29 '19 at 11:20
I've pretty easily run out of memory doing this - anyone know of a way to just convert a single page (without loading the whole thing, then just using [0] or something)? – elPastor Jun 04 '19 at 23:16
1

@elPastor you can add first_page and last_page in argument of conver_from_path function to convert specified page only – Keval Dave Jun 05 '19 at 09:57
Thanks for the heads up on those arguments, however I still get the same issue (I believe it's with memory, the traceback isn't helpful). I'm wondering if `first_page` / `last_page` still requires loading the full PDF into memory and then internally just parses out the required pages. – elPastor Jun 05 '19 at 10:22
Is the '500' the dpi? Just wondering what your reason for going to 500 dpi would be, it looks like 300 is the standard. – Sam Jul 25 '19 at 01:09
1

@Jacob 500 is the dpi. It tradeoff on the resolution required and the computation available. In my experiments, 500 worked well most of the cases while 300 got me low rez images. – Keval Dave Jul 25 '19 at 08:41
1

I used `conda install -c conda-forge poppler` to install poppler and it worked. – MNA Sep 18 '19 at 08:43
2

For converting the first page of the PDF and nothing else, this works:`from pdf2image import convert_from_path pages = convert_from_path('file.pdf', 500) pages = convert_from_path('file.pdf', 500, single_file=True) pages[0].save('file.jpg', 'JPEG')` – helgis Nov 12 '19 at 09:37
And there is a nice line in poppler docs: "You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path." thought in my case (conda install) it was actually C:\ProgramData\Anaconda3\pkgs\poppler-21.09.0-h24fffdf_1\Library\bin. – Rustam A. Oct 31 '21 at 21:21
If using mac, you can install both packages needed using conda `conda install poppler` `conda install pdf2image` – Emad Goahri Nov 06 '21 at 22:30
Get stuck on some pdf – Elia Weiss Apr 11 '22 at 09:19

score 103 · Answer 2 · edited Nov 02 '21 at 09:50

103

I found this simple solution, PyMuPDF, output to png file. Note the library is imported as "fitz", a historical name for the rendering engine it uses.

import fitz

pdffile = "infile.pdf"
doc = fitz.open(pdffile)
page = doc.loadPage(0)  # number of page
pix = page.get_pixmap()
output = "outfile.png"
pix.save(output)

edited Nov 02 '21 at 09:50

Jossef Harush Kadouri

28,860
9
119
117

answered Apr 02 '19 at 17:27

JJPty

1,039
1
5
2

2

Please add explanation to your answer. – Shanteshwar Inde Apr 02 '19 at 17:31
3

A good library and it installs on Windows 10 without problems (no wheels required). https://github.com/pymupdf – Comrade Che Jan 23 '20 at 09:27
19

This is the BEST answer. This was the only code that didn't require an additional installation onto my OS. Python scripts should focus on working within the Python system. I did not need to install poppler, pdftoppm, imageMagick or ghostscript, etc. (Python 3.6) – ZStoneDPM Feb 04 '20 at 22:11
6

Actually it requires another installation (fitz library, imported without even being referred to and its dependencies), this answer is incomplete (like all of the answers at this question) – Tommaso Guerrini Feb 06 '20 at 12:36
1

@TommasoGuerrini no. From the docs: "The standard Python import statement for this library is import fitz. This has a historical reason..." is another library, something about neuroimaging. The code works as expected. – TEH EMPRAH Feb 18 '20 at 08:49
1

@JJPty Instead of pdf file taken from the path, can we take from pdfurl? Also, is it possible for the png file to be in-stream data rather than output-png file? – Shubham Agrawal Mar 04 '20 at 06:23
6

`image = page.getPixmap(matrix=fitz.Matrix(150/72,150/72))` extracts the image at 150 DPI. [Issue question on this topic.](https://github.com/pymupdf/PyMuPDF/issues/181) – Josiah Yoder Jul 20 '20 at 21:21
2

This solution uses code licensed commercially by Artifix Software, as well as open-source by AGPL licensing. Be wary of using this on your project, especially if it's commercial in nature. You may need to dig deeper into the legal implications. – Milo Persic Mar 07 '21 at 18:44
The perfect solution no dependency it needs . no poppler, no want nnothing else – Zain Ul Abidin Apr 02 '22 at 12:16

Basj · Answer 3 · 2021-12-10T07:32:11.597

24

The Python library pdf2image (used in the other answer) in fact doesn't do much more than just launching pdttoppm with subprocess.Popen, so here is a short version doing it directly:

PDFTOPPMPATH = r"D:\Documents\software\____PORTABLE\poppler-0.51\bin\pdftoppm.exe"
PDFFILE = "SKM_28718052212190.pdf"

import subprocess
subprocess.Popen('"%s" -png "%s" out' % (PDFTOPPMPATH, PDFFILE))

Here is the Windows installation link for pdftoppm (contained in a package named poppler): http://blog.alivate.com.au/poppler-windows/.

edited Dec 10 '21 at 07:32

answered May 22 '18 at 21:33

Basj

36,818
81
313
561

4

Hi, the Windows installation link for pdftoppm is just a buncho of zipped files, what do you have to do with them to make them work ? Thanks! – gaurwraith Aug 27 '18 at 11:05

score 16 · Answer 4 · edited Dec 05 '19 at 11:25

16

There is no need to install Poppler on your OS. This will work:

pip install Wand

from wand.image import Image

f = "somefile.pdf"
with(Image(filename=f, resolution=120)) as source: 
    for i, image in enumerate(source.sequence):
        newfilename = f[:-4] + str(i + 1) + '.jpeg'
        Image(image).save(filename=newfilename)

edited Dec 05 '19 at 11:25

normanius

6,540
4
42
72

answered Feb 06 '19 at 01:15

DevB2F

3,951
2
25
50

16

[ImageMagick library](http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-on-windows) needs to be installed to work on wand. – Neeraj Gulia Mar 13 '19 at 12:32
4

I tried this and needed to install Ghostscript as well (using Windows 10 and Python 3.7). Did it and it worked perfectly. – jcf Jul 01 '19 at 07:55
1

whats the f[:-4] for? its not referenced anywhere else – Ari Sep 14 '19 at 23:27
@Ari f[:-4] will cut of ".pdf" from filename ( string slicing ) to create new filename with other ext. – Fabian Nov 01 '19 at 19:10

photek1944 · Answer 5 · 2018-12-01T08:57:19.700

@gaurwraith, install poppler for Windows and use pdftoppm.exe as follows:

Download zip file with Poppler's latest binaries/dlls from http://blog.alivate.com.au/poppler-windows/ and unzip to a new folder in your program files folder. For example: "C:\Program Files (x86)\Poppler".
Add "C:\Program Files (x86)\Poppler\poppler-0.68.0\bin" to your SYSTEM PATH environment variable.
From cmd line install pdf2image module -> "pip install pdf2image".
Or alternatively, directly execute pdftoppm.exe from your code using Python's subprocess module as explained by user Basj.

@vishvAs vAsuki, this code should generate the jpgs you want through the subprocess module for all pages of one or more pdfs in a given folder:

import os, subprocess

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

pdftoppm_path = r"C:\Program Files (x86)\Poppler\poppler-0.68.0\bin\pdftoppm.exe"

for pdf_file in os.listdir(pdf_dir):

    if pdf_file.endswith(".pdf"):

        subprocess.Popen('"%s" -jpeg %s out' % (pdftoppm_path, pdf_file))

Or using the pdf2image module:

import os
from pdf2image import convert_from_path

pdf_dir = r"C:\yourPDFfolder"
os.chdir(pdf_dir)

    for pdf_file in os.listdir(pdf_dir):

        if pdf_file.endswith(".pdf"):

            pages = convert_from_path(pdf_file, 300)
            pdf_file = pdf_file[:-4]

            for page in pages:

               page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

This should actually be the accepted answer. Shows what to do with the installed binaries for Poppler — Kunj Mehta, Dec 14 '19 at 06:43

mara004 · Answer 6 · 2022-06-01T10:51:54.137

Using pypdfium2:

python3 -m pip install pypdfium2

import pypdfium2 as pdfium

pdffile = 'path/to/your_doc.pdf'

# render multiple pages concurrently (in this case: all)
for image, suffix in pdfium.render_pdf_topil(pdffile):
    image.save('output_%s.jpg' % suffix)

# render a single page (in this case: the first one)
with pdfium.PdfContext(pdffile) as pdf:
    image = pdfium.render_page_topil(pdf, 0)
    image.save('output.jpg')

Advantages:

PDFium is liberal-licensed (BSD 3-Clause or Apache 2.0, at your choice)
It is fast, outperforming Poppler. In terms of speed, pypdfium2 can almost reach PyMuPDF
Returns PIL.Image.Image, bytes, or a ctypes array, depending on your needs
Is capable of processing encrypted (password-protected) PDFs
No runtime dependencies except PIL, which is optional
Supports Python >= 3.5
Setup infrastructure complies with PEP 517/518, while legacy setup still works as well

Wheels are currently available for

Windows amd64, win32, arm64
macOS x86_64, arm64
Linux (glibc) x86_64, i686, aarch64, armv7l
Linux (musl) x86_64, i686

There is a script to build from source, too.

(Disclaimer: I'm the author)

score 5 · Answer 7 · answered Jan 07 '20 at 12:29

5

GhostScript performs much faster than Poppler for a Linux based system.

Following is the code for pdf to image conversion.

def get_image_page(pdf_file, out_file, page_num):
    page = str(page_num + 1)
    command = ["gs", "-q", "-dNOPAUSE", "-dBATCH", "-sDEVICE=png16m", "-r" + str(RESOLUTION), "-dPDFFitPage",
               "-sOutputFile=" + out_file, "-dFirstPage=" + page, "-dLastPage=" + page,
               pdf_file]
    f_null = open(os.devnull, 'w')
    subprocess.call(command, stdout=f_null, stderr=subprocess.STDOUT)

GhostScript can be installed on macOS using brew install ghostscript

Installation information for other platforms can be found here. If it is not already installed on your system.

answered Jan 07 '20 at 12:29

Keval Dave

2,411
1
12
16

1

Just to let everyone know, Ghostscript is based on AGPL License and might need permissions in case used within commercial projects. For more reference, read https://www.ghostscript.com/license.html. – Abhishek Jain Jul 06 '21 at 18:27
How do you get to the conclusion that Ghostscript is "much faster" than Poppler? I can't reproduce this observation in my personal benchmarks. In fact, I found Ghostscript to be slightly slower. – mara004 Apr 14 '22 at 14:19

score 4 · Answer 8 · answered Jul 30 '18 at 15:17

4

Their is a utility called pdftojpg which can be used to convert the pdf to img

You can found the code here https://github.com/pankajr141/pdf2jpg

from pdf2jpg import pdf2jpg
inputpath = r"D:\inputdir\pdf1.pdf"
outputpath = r"D:\outputdir"
# To convert single page
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1")
print(result)

# To convert multiple pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="1,0,3")
print(result)

# to convert all pages
result = pdf2jpg.convert_pdf2jpg(inputpath, outputpath, pages="ALL")
print(result)

answered Jul 30 '18 at 15:17

duck

2,318
1
20
32

4

did this java thing just delete my whole folder full of pdf manipulating python scripts....? – Ulf Gjerdingen Nov 26 '18 at 13:40
An alternative binding to Apache PDFBox is https://github.com/lebedov/python-pdfbox – mara004 Apr 14 '22 at 14:32

score 2 · Answer 9 · answered Dec 10 '20 at 14:19

One problem,everyone will face that is to Install Poppler.My way is a tricky way,but will work efficiently.1st download Poppler here.Then Extract it add In the code section just add poppler_path=r'C:\Program Files\poppler-0.68.0\bin'(for eg.) like below

from pdf2image import convert_from_path
images = convert_from_path("mypdf.pdf", 500,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i, image in enumerate(images):
    fname = 'image'+str(i)+'.png'
    image.save(fname, "PNG")

This will produce an image per page with the i argument. It works really well. Thank you! — Harry, Jan 08 '21 at 15:45

dpacman · Answer 10 · 2021-01-17T08:38:37.083

Here is a function that does the conversion of a PDF file with one or multiple pages to a single merged JPEG image.

import os
import tempfile
from pdf2image import convert_from_path
from PIL import Image

def convert_pdf_to_image(file_path, output_path):
    # save temp image files in temp dir, delete them after we are finished
    with tempfile.TemporaryDirectory() as temp_dir:
        # convert pdf to multiple image
        images = convert_from_path(file_path, output_folder=temp_dir)
        # save images to temporary directory
        temp_images = []
        for i in range(len(images)):
            image_path = f'{temp_dir}/{i}.jpg'
            images[i].save(image_path, 'JPEG')
            temp_images.append(image_path)
        # read images into pillow.Image
        imgs = list(map(Image.open, temp_images))
    # find minimum width of images
    min_img_width = min(i.width for i in imgs)
    # find total height of all images
    total_height = 0
    for i, img in enumerate(imgs):
        total_height += imgs[i].height
    # create new image object with width and total height
    merged_image = Image.new(imgs[0].mode, (min_img_width, total_height))
    # paste images together one by one
    y = 0
    for img in imgs:
        merged_image.paste(img, (0, y))
        y += img.height
    # save merged image
    merged_image.save(output_path)
    return output_path

Example usage: -

convert_pdf_to_image("path_to_Pdf/1.pdf", "output_path/output.jpeg")

Just curious, why `for i, img in enumerate(imgs): total_height += imgs[i].height` instead of simply `for img in imgs: total_height += img.height` ? — Vladimir Prudnikov, Jul 05 '21 at 09:55

Christopher Creveling · Answer 11 · 2021-05-18T18:53:29.227

I wrote this script to easily convert a folder directory that contains PDFs (single page) to PNGs really nicely.

import os
from pathlib import PurePath
import glob
# from PIL import Image
from pdf2image import convert_from_path
import pdb

# In[file list]

wd = os.getcwd()

# filter images
fileListpdf = glob.glob(f'{wd}//*.pdf')

# In[Convert pdf to images]

for i in fileListpdf:
    
    images = convert_from_path(i, dpi=300)
    
    path_split = PurePath(i).parts
    fileName, ext = os.path.splitext(path_split[-1])
    
    images[0].save(f'{fileName}.png', 'PNG')

Hopefully, this helps if you need to convert PDFs to PNGs!

score 0 · Answer 12 · edited Sep 15 '19 at 04:44

0

from pdf2image import convert_from_path
import glob

pdf_dir = glob.glob(r'G:\personal\pdf\*')  #your pdf folder path
img_dir = "G:\\personal\\img\\"           #your dest img path

for pdf_ in pdf_dir:
    pages = convert_from_path(pdf_, 500)
    for page in pages:
        page.save(img_dir+pdf_.split("\\")[-1][:-3]+"jpg", 'JPEG')

edited Sep 15 '19 at 04:44

Ari

3,781
4
34
75

answered May 23 '19 at 07:07

Saiprasad Bhatwadekar

1

This would be a better answer if you explained how the code you provided answers the question. – pppery Sep 15 '19 at 00:39
2

@pppery Python is fairly readable, the comments do indicate the source folder and output folder, the rest reads like english. – Ari Sep 15 '19 at 10:36

score 0 · Answer 13 · answered Jul 30 '19 at 06:48

I use a (maybe) much simpler option of pdf2image:

cd $dir
for f in *.pdf
do
  if [ -f "${f}" ]; then
    n=$(echo "$f" | cut -f1 -d'.')
    pdftoppm -scale-to 1440 -png $f $conv/$n
    rm $f
    mv  $conv/*.png $dir
  fi
done

This is a small part of a bash script in a loop for the use of a narrow casting device. Checks every 5 seconds on added pdf files (all) and processes them. This is for a demo device, at the end converting will be done at a remote server. Converting to .PNG now, but .JPG is possible too.

This converting, together with transitions on A4 format, displaying a video, two smooth scrolling texts and a logo (with transition in three versions) sets the Pi3 to allmost 4x 100% cpu-load ;-)

The question is about rendering a PDF with Python, not bash. — mara004, Dec 05 '21 at 10:25

score -1 · Answer 14 · answered Mar 17 '20 at 11:31

Here is a solution which requires no additional libraries and is very fast. This was found from: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html# I have added the code in a function to make it more convenient.

def convert(filepath):
    with open(filepath, "rb") as file:
        pdf = file.read()

    startmark = b"\xff\xd8"
    startfix = 0
    endmark = b"\xff\xd9"
    endfix = 2
    i = 0

    njpg = 0
    while True:
        istream = pdf.find(b"stream", i)
        if istream < 0:
            break
        istart = pdf.find(startmark, istream, istream + 20)
        if istart < 0:
            i = istream + 20
            continue
        iend = pdf.find(b"endstream", istart)
        if iend < 0:
            raise Exception("Didn't find end of stream!")
        iend = pdf.find(endmark, iend - 20)
        if iend < 0:
            raise Exception("Didn't find end of JPG!")

        istart += startfix
        iend += endfix
        jpg = pdf[istart:iend]
        newfile = "{}jpg".format(filepath[:-3])
        with open(newfile, "wb") as jpgfile:
            jpgfile.write(jpg)

        njpg += 1
        i = iend

        return newfile

Call convert with the pdf path as the argument and the function will create a .jpg file in the same directory

This technique looks like it extracts images that have been embedded in the file, rather than rasterizing a page of the file as an image which is what the questioner wanted. — Josh Gallagher, Mar 20 '20 at 16:43

score -1 · Answer 15 · answered Mar 15 '21 at 17:11

-1

For a pdf file with multiple pages, the following is the best & simplest (I used pdf2image-1.14.0):

from pdf2image import convert_from_path
from pdf2image.exceptions import (
     PDFInfoNotInstalledError,
     PDFPageCountError,
     PDFSyntaxError
     )
        
images = convert_from_path(r"path/to/input/pdf/file", output_folder=r"path/to/output/folder", fmt="jpg",) #dpi=200, grayscale=True, size=(300,400), first_page=0, last_page=3)
        
images.clear()

Note:

"images" is a list of PIL images.
The saved images in the output folder will have system generated names; one can later change them, if required.

answered Mar 15 '21 at 17:11

SKG

127
1
8

2

Why is this "the best" ? – Nik O'Lai Mar 25 '21 at 18:41
1) Fast as, no loop is required. 2) All the required parameters (like dpi, format, grayscale option, size etc.) are processed at one run. 3) Built-in exception handling is there. 4) The core function calling is only a single line statement. 5) You can get images as 'saved' files as well as a 'list' of 'matrices'. – SKG Mar 26 '21 at 12:34

score -1 · Answer 16 · answered Apr 12 '22 at 09:57

This easy script can convert a folder directory that contains PDFs (single/multiple pages) to jpeg.

from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
from os import listdir
from os import system
from os.path import isfile, join, basename, dirname
import shutil

def move_processed_file(file, doc_path, download_processed):
    try:
        shutil.move(doc_path + '/' + file, download_processed + '/' + file)
        pass
    except Exception as e:
        print(e.errno)
        raise
    else:
        pass
    finally:
        pass
    pass


def run_conversion():
    root_dir = os.path.abspath(os.curdir)

    doc_path = root_dir + r"\data\download"
    pdf_processed = root_dir + r"\data\download\pdf_processed"
    results_folder = doc_path

    files = [f for f in listdir(doc_path) if isfile(join(doc_path, f))]

    pdf_files = [f for f in listdir(doc_path) if isfile(join(doc_path, f)) and f.lower().endswith('.pdf')]

    # check OS type
    if os.name == 'nt':
        # if is windows or a graphical OS, change this poppler path with your own path
        poppler_path = r"C:\poppler-0.68.0\bin"
    else:
        poppler_path = root_dir + r"\usr\bin"

    for file in pdf_files:

        ''' 
        # Converting PDF to images 
        '''

        # Store all the pages of the PDF in a variable
        pages = convert_from_path(doc_path + '/' + file, 500, poppler_path=poppler_path)

        # Counter to store images of each page of PDF to image
        image_counter = 1

        filename, file_extension = os.path.splitext(file)

        # Iterate through all the pages stored above
        for page in pages:
            # Declaring filename for each page of PDF as JPG
            # PDF page n -> page_n.jpg
            filename = filename + '_' + str(image_counter) + ".jpg"

            # Save the image of the page in system
            page.save(results_folder + '/' + filename, 'JPEG')

            # Increment the counter to update filename
            image_counter += 1

        move_processed_file(file, doc_path, pdf_processed)

score -3 · Answer 17 · edited Nov 16 '21 at 11:33

-3

from pdf2image import convert_from_path

PDF_file = 'Statement.pdf'
pages = convert_from_path(PDF_file, 500,userpw='XXX')

image_counter = 1

for page in pages:

    filename = "foldername/page_" + str(image_counter) + ".jpg"
    page.save(filename, 'JPEG')
    image_counter = image_counter + 1

edited Nov 16 '21 at 11:33

Vito Gentile

12,019
9
58
89

answered Apr 14 '21 at 05:36

madan maram

23
3

3

Posting a poorly formatted, incorrectly indented answer with no explanation as to how your answer works or what benefits it offers compared to the 13 existing answers, is of very little value as it stands. Please [edit] your answer, fix the formatting (the [formatting help](https://stackoverflow.com/editing-help) can assist you), fix the indentation, and add some explanation. – David Buck Apr 14 '21 at 06:15

Extract a page from a pdf as a jpeg

17 Answers17

Linked