How can I convert PDF to HTML?

Question

What good libraries are there, in any common language, for converting PDF to HTML?

Attempted to turn it into a programming question. And I see lots of questions going from HTML to PDF but not the reverse so probably worthwhile keeping it? — Cruachan, Oct 28 '09 at 18:01
This is completely subjective. Please reword your question as to not be subjective and give a little info about what you are trying to do. — Russ Bradberry, Oct 28 '09 at 18:05
I've de-subjectified the question and reworded it to what I think the OP is asking. It's a pity SO doesn't have a feature to remove close votes. — Ether, Oct 28 '09 at 18:11
Good work, Ether. BTW, unknown - if you're just looking for a program (not a library), please see: http://stackoverflow.com/questions/1531699/pdf-to-html-convertor (which... should probably be migrated to SU now) — Shog9, Oct 28 '09 at 19:05

score 5 · Answer 1 · answered Nov 23 '09 at 17:47

5

PDFBox at apache has an html extraction capability. http://pdfbox.apache.org/

answered Nov 23 '09 at 17:47

John Thorhauer

51
1

score 3 · Answer 2 · answered Oct 29 '09 at 19:01

If you are working on a Windows box, I think Amyuni has a library for this as well. Their PDF Document Convertor is accessible as a DLL, can be used widely among the languages supported by Visual Studio, and can convert to RTF, TML, EXCEL, JPEG, and TIFF.

score 2 · Answer 3 · answered Apr 10 '14 at 04:51

In linux install pdftohtml - For batch convertion of all files in a folder use:

ls *.pdf | xargs -I{} pdftohtml {}

This will create html site with all references and images from original documents. Every page in a separate html file. Very useful to convert project documentation to search for files by phrase, using common system file search.

score 2 · Answer 4 · answered Oct 04 '10 at 07:56

2

The pdftohtml program converts pdf to html and xml and preserves position information of the text which is helpful for scraping tables..

It seems to be based on the xpdf library and has a windows binary, too.

answered Oct 04 '10 at 07:56

Karsten W.

16,858
11
64
99

2

This is now included as part of the `poppler` utilities. `yum install poppler` if it isn't already installed. – a coder Sep 15 '14 at 14:14
This does not do a good job of keeping position or retaining background images – Daniel Kats Sep 02 '21 at 19:04

score 2 · Answer 5 · answered Mar 06 '20 at 09:30

2

You can use a module in Python called PDFMiner.

You can install it like this:

pip install pdfminer

Use this module as below:

pdf2txt.py -o output.html -t html file.pdf

Link to the module: https://pypi.org/project/pdfminer/

answered Mar 06 '20 at 09:30

Code J

921
1
11
29

1

This does not preserve layout – Daniel Kats Sep 02 '21 at 19:04

score 1 · Answer 6 · edited May 23 '17 at 11:48

1

In Perl, you can use the SWISH::Filter plugin SWISH::Filters::Pdf2HTML. (It requires the xpdf package.)

For the reverse (HTML to PDF), see this question.

edited May 23 '17 at 11:48

Community

1
1

answered Oct 28 '09 at 18:07

Ether

51,401
13
87
157

score 1 · Answer 7 · answered Oct 30 '09 at 04:26

1

http://www.lowagie.com/iText/ Opensource library for both Java and C#

answered Oct 30 '09 at 04:26

AZ_

36,879
28
155
201

This is probably your best bet. Parse the PDF using the library and generate HTML from the data. – TJB Oct 30 '09 at 05:44

score 0 · Answer 8 · answered Oct 28 '09 at 18:22

0

if you're looking for a way to convert PDF to HTML once or twice then I recommend Adobe Online Conversion

If it's an API you're after then http://www.pdfonline.com/ has an SDK that should suit your needs.

If it's a library you're after then please let us know which server-side language you prefer.

answered Oct 28 '09 at 18:22

Russ Bradberry

10,519
17
67
85

Thanks Russ! I'm using Adobe Online so far. I tried the website and the results are difficult to gauge. But thanks for the help! – user178644 Oct 28 '09 at 18:47
It seems that it doesn't work no more. Redirects to PDF Creator – Mikhail Golubtsov Jul 10 '13 at 18:39

score 0 · Answer 9 · answered Oct 30 '09 at 02:04

Given the vagueness of the original question I'm going to go ahead and give a solution that will work with any language that can execute command-line apps. Although it can be a little bit tricky to get setup, OpenOffice can be run in headless mode on a server and, with the help of jodconverter, can convert any file format to any other file format (well, any format conversions that openoffice can handle, that is).

Here are a couple of links that help with the setup:

How can I convert PDF to HTML?

9 Answers9

Linked