0

I am trying to translate PDFs files using translation API and output it as PDF by keeping the format same. My approach is to convert the PDF to word doc and to translate the file and then convert it back to PDF. But the problem, is there no efficient way to convert the PDF to word. I am trying to write my own program but the PDFs has lots of formats. So I guess it will take some effort to handle all the formats. So my question, is there any efficient way to translate there PDFs without losing the format or is there any efficient way to convert them to docx. I am using python as programing language.

TinMan
  • 5,613
  • 2
  • 7
  • 20
  • Try referring this answer: https://stackoverflow.com/questions/26358281/convert-pdf-to-doc-python-bash – Daniel Isaac Jul 12 '18 at 11:03
  • @DanielIsaac thank for reply but i tried this solution current libreoffice doesn't support this feature. –  Jul 12 '18 at 11:10

2 Answers2

1

Probably not.

PDFs aren't meant to be machine readable or editable, really; they describe formatted, laid-out, printable pages.

AKX
  • 123,782
  • 12
  • 99
  • 138
1

You can use pdfminer instead of API here an example:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text