6

I have a pdf file and i want to replace some text in pdf file and generate new pdf. How can i do that in python? I have tried reportlab , reportlab does not have any fucntion to search text and replace it. What other module can i use?

Dax Amin
  • 419
  • 2
  • 5
  • 12
  • please explain what you have tried – adao7000 Jul 29 '15 at 14:15
  • Hi @Dax! Welcome to Stack Overflow. As @adao7000 mentioned - could you please give us an example of what you've tried? Please check out the guidelines on creating a "Minimal, closed, verifiable" example here: http://stackoverflow.com/help/mcve . – OldTinfoil Jul 29 '15 at 14:24
  • 1
    I'm the upvoter. Note to previous comments: @Dax is not asking for code, but for a python module. Note that http://stackoverflow.com/help/on-topic clearly states that "but if your question generally covers…a practical, answerable problem that is unique to software development… then you’re in the right place to ask your question!" I just got here looking for the same thing. If someone were to point us in the right direction, that would be enough. – Roy Falk Feb 02 '16 at 11:22
  • 1
    That page you link to also contains the following: "Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it." – David van Driessche Mar 03 '16 at 17:12

3 Answers3

3

You can try Aspose.PDF Cloud SDK for Python, Aspose.PDF Cloud is a REST API PDF Processing solution. It is paid API and its free package plan provides 50 credits per month.

I'm developer evangelist at Aspose.

import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi

# Get App key and App SID from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
    app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    app_sid='xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxx')

pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
copied_file= '02_pages_new.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)

#upload PDF file to storage
pdf_api.copy_file(remote_name,copied_file)

#Replace Text
text_replace = asposepdfcloud.models.TextReplace(old_value='origami',new_value='polygami',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace])

response = pdf_api.post_document_text_replace(copied_file, text_replace_list)
print(response)
Tilal Ahmad
  • 879
  • 5
  • 8
0

Have a look in THIS thread for one of the many ways to read text from a PDF. Then you'll need to create a new pdf, as they will, as far as I know, not retrieve any formatting for you.

Stiffo
  • 773
  • 5
  • 19
-1

The CAM::PDF Perl Library can output text that's not too hard to parse (it seems to fairly randomly split lines of text). I couldn't be bothered to learn too much Perl, so I wrote these really basic Perl command line scripts, one that reads a single page pdf to a text file perl read.pl pdfIn.pdf textOut.txt and one that writes the text (that you can modify in the meantime) to a pdf perl write.pl pdfIn.pdf textIn.txt pdfOut.pdf.

#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";

$pdfIn = $ARGV[0];
$textOut = $ARGV[1];

$pdf = CAM::PDF->new($pdfIn);
$page = $pdf->getPageContent(1);

open(my $fh, '>', $textOut);
print $fh $page;
close $fh;

exit;

and

#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";

$pdfIn = $ARGV[0];
$textIn = $ARGV[1];
$pdfOut = $ARGV[2];

$pdf = CAM::PDF->new($pdfIn);

my $page;
   open(my $fh, '<', $textIn) or die "cannot open file $filename";
   {
       local $/;
       $page = <$fh>;
   }
close($fh);

$pdf->setPageContent(1, $page);

$pdf->cleanoutput($pdfOut);

exit;

You can call these with python either side of doing some regex etc stuff on the outputted text file.

If you're completely new to Perl (like I was), you need to make sure that Perl and CPAN are installed, then run sudo cpan, then in the prompt install "CAM::PDF";, this will install the required modules.

Also, I realise that I should probably be using stdout etc, but I was in a hurry :-)

Also also, any ideas what the format CAM-PDF outputs is? is there any doc for it?

Community
  • 1
  • 1
leontrolski
  • 345
  • 3
  • 8
  • 1
    There is some more useful documentation here http://search.cpan.org/dist/CAM-PDF/lib/CAM/PDF.pm if I get round to it, I might write some kind of Python wrapper – leontrolski Mar 03 '16 at 16:49