C# Extract text from PDF using PdfSharp

Question

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.

Just wondering, why downvotes? (There are no clarifying comments to help author to improve the question.) — TN., Dec 11 '12 at 07:28
You need to extract the ToUnicode CMaps from the document to convert the binary indexes of the text-strings, unless you're lucky and the binary indexes are ASCII values themselves. — R.J. Dunnill, Jun 03 '19 at 03:56

score 50 · Answer 1 · answered Jun 04 '14 at 19:37

50

Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.

public static class PdfSharpExtensions
{
    public static IEnumerable<string> ExtractText(this PdfPage page)
    {       
        var content = ContentReader.ReadContent(page);      
        var text = content.ExtractText();
        return text;
    }   

    public static IEnumerable<string> ExtractText(this CObject cObject)
    {   
        if (cObject is COperator)
        {
            var cOperator = cObject as COperator;
            if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
                cOperator.OpCode.Name == OpCodeName.TJ.ToString())
            {
                foreach (var cOperand in cOperator.Operands)
                    foreach (var txt in ExtractText(cOperand))
                        yield return txt;   
            }
        }
        else if (cObject is CSequence)
        {
            var cSequence = cObject as CSequence;
            foreach (var element in cSequence)
                foreach (var txt in ExtractText(element))
                    yield return txt;
        }
        else if (cObject is CString)
        {
            var cString = cObject as CString;
            yield return cString.Value;
        }
    }
}

answered Jun 04 '14 at 19:37

Ronnie Overby

43,601
70
265
343

I am using PDFsharp library but it say ContentReader Class is out of context.What could be the problem? – Sudarshan Taparia Aug 31 '16 at 13:33
ContentReader Class is out of context. – Ronnie Overby Sep 01 '16 at 20:42
5

Couldn't resist. IDK what that means or how to fix it. I try to avoid working with PDF's like the plague because the tools to work with them are crap and pretending that a human readable format is machine readable is a total fools errand. – Ronnie Overby Sep 01 '16 at 20:43
1

PdfSharp v1.32.3057 has a bug where `ContentReader.ReadContent` hangs. To fix, there are some changes needed (see [here](http://forum.pdfsharp.net/viewtopic.php?p=7911#p7911)). After fixing the bug, I can confirm this works. :-) – Nicholas Miller Apr 05 '17 at 15:33
Namespace for `ContentReader` : `PdfSharp.Pdf.Content.ContentReader`. – NoOne Jul 09 '18 at 15:11
3

Although this is promising, it does not work for Unicode texts. – NoOne Jul 09 '18 at 15:52
It seems that it works fine when `OpCode.Name == "Tj"` (which, I guess, is related to ASCII) and return gibberish when `OpCode.Name == "TJ"` (which, I guess, is Unicode). – NoOne Jul 09 '18 at 16:08
TJ allows glyph-spacing, and to that end, will have integer values between its strings. Neither Tj nor TJ are related to ASCII: both use binary indexes which cannot be depended on to be ASCII. – R.J. Dunnill Jun 03 '19 at 03:54
This works OOTB copy/paste for me in Jan 2021 against PDFs we create from our AS400 / i5. Many Thanks – bkwdesign Jan 29 '21 at 18:04

score 21 · Answer 2 · edited Jun 01 '22 at 22:00

I have implemented it somehow similar to how David did it. Here is my code:

...
{
    // ....
    var page = document.Pages[1];
    CObject content = ContentReader.ReadContent(page);
    var extractedText = ExtractText(content);
    // ...
}

private IEnumerable<string> ExtractText(CObject cObject)
{
    var textList = new List<string>();
    if (cObject is COperator)
    {
        var cOperator = cObject as COperator;
        if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
            cOperator.OpCode.Name == OpCodeName.TJ.ToString())
        {
            foreach (var cOperand in cOperator.Operands)
            {
                textList.AddRange(ExtractText(cOperand));
            }
        }
    }
    else if (cObject is CSequence)
    {
        var cSequence = cObject as CSequence;
        foreach (var element in cSequence)
        {
            textList.AddRange(ExtractText(element));
        }
    }
    else if (cObject is CString)
    {
        var cString = cObject as CString;
        textList.Add(cString.Value);
    }
    return textList;
}

You shouldn't have stripped down the StringBuilder, PDFs are quite big and that solution will cause a huge unnecessary memory consumption. — Ivan Ičin, Aug 20 '16 at 14:37

score 12 · Answer 3 · answered Aug 01 '13 at 08:36

12

PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators.

I've uploaded a simple implementation to github.

answered Aug 01 '13 at 08:36

David Schmitt

56,693
26
120
165

6

On many PDFs CString.Value returns just some junk (e.g. create a PDF using OpenOffice.org and try to import it using this method). – Ivan Ičin Aug 20 '16 at 14:52
2

No, PdfSharp does not provide all the tools for text extraction. Functionality has yet to be added for ToUnicode CMaps, which are necessary to extract the text of Unicode PDFs. – R.J. Dunnill Jun 03 '19 at 03:59
1

Because that's the choice I made. – David Schmitt Dec 02 '19 at 11:01
it doesn't seem to be perfect, one word could be split in few lines, eg: Pre dic t ion i s ve – hazjack Sep 28 '20 at 08:24
@hazjack Yeah, you'll need a strong AI then to salvage the text from your PDF. – David Schmitt Sep 29 '20 at 09:07

C# Extract text from PDF using PdfSharp

3 Answers3

Linked