How to extract text from the PDF document?

Question

How to extract text from the PDF document using PHP?

(I can't use other tools, I don't have root access)

I've found some functions working for plain text, but they don't handle well Unicode characters:

Don't see why this question is considered off-topic as it is very useful, even if it may attract 'opinionated' answers, it is always better to see different points of views. Has a lot of hits too. — user3574492, Jun 04 '15 at 23:54

Pedro Lobito · Accepted Answer · 2020-04-10T19:19:07.893

54

Code:

include('class.pdf2text.php');
$a = new PDF2Text();
$a->setFilename('filename.pdf'); 
$a->decodePDF();
echo $a->output();

class.pdf2text.php Project Home
pdf2textclass doesn't work with all the PDF's I've tested, If it doesn't work for you, try PDF Parser

edited Apr 10 '20 at 19:19

answered Aug 09 '11 at 18:53

Pedro Lobito

2

if here is any table in pdf file then it doesn't show it. i want to extract as it is displaying in pdf also text of scanned image attached with pdf. any solution for that?? – Aug 23 '12 at 05:36
Thanks a lot... That class is very useful. In this I want just a url from pdf. Any way to find that...? – CJ Ramki Apr 05 '14 at 06:30
The class includes an output buffer flush that can cause 'headers already sent' errors. Seemingly no ill-effects if you disable it (for any reasonable size of document). – Geoff Kendall Mar 15 '16 at 10:59
@CJRamki Extract the text and then use a regex to match urls. – Pedro Lobito Dec 08 '16 at 01:56
1

Yes, class is not working for all. Do you have any other suggestion? – Kamaldeep singh Bhatia Apr 06 '17 at 15:24
3

You may want to try http://pdfparser.org/ . – Pedro Lobito Apr 06 '17 at 20:58

1 Answers1