Generate flattened PDF with Python

Question

When I print a PDF from any of my source PDFs, the file size drops and removes the text boxes presents in form. In short, it flattens the file. This is behavior I want to achieve.

The following code to create a PDF using another PDF as a source (the one I want to flatten), it writes the text boxes form as well.

Can I get a PDF without the text boxes, flatten it? Just like Adobe does when I print a PDF as a PDF.

My other code looks something like this minus some things:

import os
import StringIO
from pyPdf import PdfFileWriter, PdfFileReader
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

directory = os.path.join(os.getcwd(), "source")  # dir we are interested in
fif = [f for f in os.listdir(directory) if f[-3:] == 'pdf'] # get the PDFs
for i in fif:
    packet = StringIO.StringIO()
    can = canvas.Canvas(packet, pagesize=letter)
    can.rotate(-90)
    can.save()

    packet.seek(0)
    new_pdf = PdfFileReader(packet)
    fname = os.path.join('source', i)
    existing_pdf = PdfFileReader(file(fname, "rb"))
    output = PdfFileWriter()
    nump = existing_pdf.getNumPages()
    page = existing_pdf.getPage(0)
    for l in range(nump):
        output.addPage(existing_pdf.getPage(l))
    page.mergePage(new_pdf.getPage(0))
    outputStream = file("out-"+i, "wb")
    output.write(outputStream)
    outputStream.close()
    print fName + " written as", i

Summing up: I have a pdf, I add a text box to it, covering up info and adding new info, and then I print a pdf from that pdf. The text box becomes not editable or moveable any longer. I wanted to automate that process but everything I tried still allowed that text box to be editable.

Also looking for a solution to this. I have a watermarking Python script, but the watermark gets in the way when trying to select or highlight text in the document. If I could generate a flattened watermark PDF and then merge it in with the source PDFs, that would solve it. — Joseph Mansfield, Nov 18 '15 at 15:59
Do the file names follow some specific convention? if so, which is the semantic? What is the purpose of splitting the file name by space, and then by comma? (otherwise, the script fails, but I am unsure whether is relevant or not for the problem you are facing) — gpoo, Nov 22 '15 at 18:48
+MakeCents I cannot reproduce the issue. I get no boxes. May you paste an image with the result you get and the expected result? — gpoo, Nov 22 '15 at 20:51
@gpoo I think the boxes exist in the originals, however I don't know either what kind of box it is, I have a pdf with a box on the first page but I cannot remove it by printing (maybe Acrobat Pro does that) — rll, Nov 23 '15 at 11:45
@gpoo What I was going for at that time was: I have a pdf, I add a text box to it, covering up info and adding new info, and then I print a pdf from that pdf. The text box becomes not editable or moveable any longer. I wanted to automate that process but everything I tried still allowed that text box to be editable. I hope that clears it up. I'm using Acrobat 9.5 — MakeCents, Nov 23 '15 at 16:09

naktinis · Accepted Answer · 2021-07-09T07:17:00.880

16

If installing an OS package is an option, then you could use pdftk with its python wrapper pypdftk like this:

import pypdftk
pypdftk.fill_form('filled.pdf', out_file='flattened.pdf', flatten=True)

You would also need to install the pdftk package, which on Ubuntu could be done like this:

sudo apt-get install pdftk

The pypdftk library can by downloaded from PyPI:

pip install pypdftk

Update: pdftk was briefly removed from Ubuntu in version 18.04, but it seems it is back since 20.04.

edited Jul 09 '21 at 07:17

answered Nov 23 '15 at 20:19

naktinis

3,729
2
35
50

is there a way to do it without pdftk? I ask because I am attempting to write a pdftk clone as pdftk doesn't work on centos7. Any help would be greatly appreciated. – Oscar Smith Aug 08 '16 at 19:38
1

this doesn't work on ubuntu 18.04 as `pdftk` is no longer in the repo – Fabrizio Miano Apr 30 '20 at 21:40
@FabrizioMiano I've seen people discuss workarounds here: https://askubuntu.com/a/1029451/70751 also qpdf might be an alternative. – naktinis May 01 '20 at 09:52
The `pdftk` functions as a pdf printer, so the `cups` might work. – caot Apr 20 '21 at 00:45
It seems pdftk is back in Ubuntu 20.04 and later. It has been ported to Java, and can be installed as either `pdftk` or `pdftk-java`. Last checked in 21.04. – naktinis Jul 09 '21 at 07:26

Tyler Houssian · Answer 2 · 2021-03-25T21:51:52.600

4

A simple but more of a round about way it to covert the pdf to images than to put those image into a pdf.

You'll need pdf2image and PIL

Like So

from pdf2image import convert_from_path 
from PIL import Image

images = convert_from_path('temp.pdf') 
im1 = images[0]
images.pop(0)

pdf1_filename = "flattened.pdf"

im1.save(pdf1_filename, "PDF" ,resolution=100.0, save_all=True, append_images=images)

Edit:

I created a library to do this called fillpdf

pip install fillpdf

from fillpdf import fillpdfs
fillpdfs.flatten_pdf('input.pdf', 'newflat.pdf')

edited Mar 25 '21 at 21:51

answered Nov 20 '20 at 22:38

Tyler Houssian

119
5

I think OP is still looking to perserve the vector-properties of the page contents, whereas this suggestion would convert rhe document to a (low resolution) image? – AllanLRH Nov 20 '20 at 23:21
@AllanLRH Correct the vector properties of the page will be lost and the will result in a lower resolution image but the image is still perfectly readable and can be used in many different instances where readability is the only requirement. – Tyler Houssian Nov 23 '20 at 20:00

score 3 · Answer 3 · edited May 22 '22 at 07:18

Per the Adobe Docs, you can change the Bit Position of the Editable Form Fields to 1 to make the field ReadOnly. I provided a full solution here, but it uses Django:

https://stackoverflow.com/a/55301804/8382028

Adobe Docs (page 552):

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf

Use PyPDF2 to fill the fields, then loop through the annotations to change the bit position:

from io import BytesIO

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, NumberObject

# open the pdf
input_stream = open("YourPDF.pdf", "rb")
reader = PdfFileReader(input_stream, strict=False)
if "/AcroForm" in reader.trailer["/Root"]:
    reader.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)}
    )

writer = PdfFileWriter()
writer.set_need_appearances_writer()
if "/AcroForm" in writer._root_object:
    # Acro form is form field, set needs appearances to fix printing issues
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)}
    )

data_dict = dict()  # this is a dict of your form values

writer.addPage(reader.getPage(0))
page = writer.getPage(0)
# update form fields
writer.updatePageFormFieldValues(page, data_dict)
for j in range(0, len(page["/Annots"])):
    writer_annot = page["/Annots"][j].getObject()
    for field in data_dict:
        if writer_annot.get("/T") == field:
            # make ReadOnly:
            writer_annot.update({NameObject("/Ff"): NumberObject(1)})
output_stream = BytesIO()
writer.write(output_stream)

# output_stream is your flattened PDF

score 1 · Answer 4 · edited Apr 20 '21 at 00:35

A solution that goes for Windows as well, converts many pdf pages and flatens the chackbox values as well. For some reason @ViaTech code did not work in my pc (Windows7 python 3.8)

Followed @ViaTech indications and used extensively @hchillon code from this post

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject, TextStringObject, NumberObject


def set_need_appearances_writer(writer):

    try:
        catalog = writer._root_object
        # get the AcroForm tree and add "/NeedAppearances attribute
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer



class PdfFileFiller(object):

    def __init__(self, infile):

        self.pdf = PdfFileReader(open(infile, "rb"), strict=False)
        if "/AcroForm" in self.pdf.trailer["/Root"]:
            self.pdf.trailer["/Root"]["/AcroForm"].update(
            {NameObject("/NeedAppearances"): BooleanObject(True)})

    # newvals and newchecks have keys have to be filled. '' is not accepted
    def update_form_values(self, outfile, newvals=None, newchecks=None):

        self.pdf2 = MyPdfFileWriter()


        trailer = self.pdf.trailer['/Root'].get('/AcroForm', None)
        if trailer:
            self.pdf2._root_object.update({
                NameObject('/AcroForm'): trailer})

        set_need_appearances_writer(self.pdf2)
        if "/AcroForm" in self.pdf2._root_object:
            self.pdf2._root_object["/AcroForm"].update(
            {NameObject("/NeedAppearances"): BooleanObject(True)})

        for i in range(self.pdf.getNumPages()):
            self.pdf2.addPage(self.pdf.getPage(i))

            self.pdf2.updatePageFormFieldValues(self.pdf2.getPage(i), newvals)
            for j in range(0, len(self.pdf.getPage(i)['/Annots'])):
                writer_annot = self.pdf.getPage(i)['/Annots'][j].getObject()
                for field in newvals:
                    writer_annot.update({NameObject("/Ff"): NumberObject(1)})

            self.pdf2.updatePageFormCheckboxValues(self.pdf2.getPage(i), newchecks)

        with open(outfile, 'wb') as out:
            self.pdf2.write(out)


class MyPdfFileWriter(PdfFileWriter):

    def __init__(self):
        super().__init__()

    def updatePageFormCheckboxValues(self, page, fields):

        for j in range(0, len(page['/Annots'])):
            writer_annot = page['/Annots'][j].getObject()
            for field in fields:
                writer_annot.update({NameObject("/Ff"): NumberObject(1)})




origin = ## Put input pdf path here
destination = ## Put output pdf path here, even if the file does not exist yet

newchecks = {} # A dict with all checkbox values that need to be changed
newvals = {'':''} # A dict with all entry values that need to be changed
# newvals dict has to be equal to {'':''} in case that no changes are needed

c = PdfFileFiller(origin)
c.update_form_values(outfile=destination, newvals=newvals, newchecks=newchecks)
print('PDF has been created\n')

score 0 · Answer 5 · answered Jan 08 '19 at 16:01

I had trouble flattening a form that I had entered content into using pdfrw (How to Populate Fillable PDF's with Python) and found that I had to add an additional step using generate_fdf (pdftk flatten loses fillable field data).

os.system('pdftk '+outtemp+' generate_fdf output '+outfdf)
os.system('pdftk '+outtemp+' fill_form '+outfdf+' output '+outpdf)

I came to this solution because I was able to flatten a file just fine using ghostscript's pdf2ps followed by ps2pdf on my Mac, but the quality had low resolution when I ran it on an Amazon Linux instance. I couldn't figure out why that was the case and so moved to the pdftk solution.

Generate flattened PDF with Python

5 Answers5