55

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.

skeletank
  • 2,850
  • 5
  • 44
  • 73
der_chirurg
  • 1,345
  • 1
  • 15
  • 25
  • 8
    Just wondering, why downvotes? (There are no clarifying comments to help author to improve the question.) – TN. Dec 11 '12 at 07:28
  • You need to extract the ToUnicode CMaps from the document to convert the binary indexes of the text-strings, unless you're lucky and the binary indexes are ASCII values themselves. – R.J. Dunnill Jun 03 '19 at 03:56

3 Answers3

50

Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.

public static class PdfSharpExtensions
{
    public static IEnumerable<string> ExtractText(this PdfPage page)
    {       
        var content = ContentReader.ReadContent(page);      
        var text = content.ExtractText();
        return text;
    }   

    public static IEnumerable<string> ExtractText(this CObject cObject)
    {   
        if (cObject is COperator)
        {
            var cOperator = cObject as COperator;
            if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
                cOperator.OpCode.Name == OpCodeName.TJ.ToString())
            {
                foreach (var cOperand in cOperator.Operands)
                    foreach (var txt in ExtractText(cOperand))
                        yield return txt;   
            }
        }
        else if (cObject is CSequence)
        {
            var cSequence = cObject as CSequence;
            foreach (var element in cSequence)
                foreach (var txt in ExtractText(element))
                    yield return txt;
        }
        else if (cObject is CString)
        {
            var cString = cObject as CString;
            yield return cString.Value;
        }
    }
}
Ronnie Overby
  • 43,601
  • 70
  • 265
  • 343
  • I am using PDFsharp library but it say ContentReader Class is out of context.What could be the problem? – Sudarshan Taparia Aug 31 '16 at 13:33
  • ContentReader Class is out of context. – Ronnie Overby Sep 01 '16 at 20:42
  • 5
    Couldn't resist. IDK what that means or how to fix it. I try to avoid working with PDF's like the plague because the tools to work with them are crap and pretending that a human readable format is machine readable is a total fools errand. – Ronnie Overby Sep 01 '16 at 20:43
  • 1
    PdfSharp v1.32.3057 has a bug where `ContentReader.ReadContent` hangs. To fix, there are some changes needed (see [here](http://forum.pdfsharp.net/viewtopic.php?p=7911#p7911)). After fixing the bug, I can confirm this works. :-) – Nicholas Miller Apr 05 '17 at 15:33
  • Namespace for `ContentReader` : `PdfSharp.Pdf.Content.ContentReader`. – NoOne Jul 09 '18 at 15:11
  • 3
    Although this is promising, it does not work for Unicode texts. – NoOne Jul 09 '18 at 15:52
  • It seems that it works fine when `OpCode.Name == "Tj"` (which, I guess, is related to ASCII) and return gibberish when `OpCode.Name == "TJ"` (which, I guess, is Unicode). – NoOne Jul 09 '18 at 16:08
  • TJ allows glyph-spacing, and to that end, will have integer values between its strings. Neither Tj nor TJ are related to ASCII: both use binary indexes which cannot be depended on to be ASCII. – R.J. Dunnill Jun 03 '19 at 03:54
  • This works OOTB copy/paste for me in Jan 2021 against PDFs we create from our AS400 / i5. Many Thanks – bkwdesign Jan 29 '21 at 18:04
21

I have implemented it somehow similar to how David did it. Here is my code:

...
{
    // ....
    var page = document.Pages[1];
    CObject content = ContentReader.ReadContent(page);
    var extractedText = ExtractText(content);
    // ...
}

private IEnumerable<string> ExtractText(CObject cObject)
{
    var textList = new List<string>();
    if (cObject is COperator)
    {
        var cOperator = cObject as COperator;
        if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
            cOperator.OpCode.Name == OpCodeName.TJ.ToString())
        {
            foreach (var cOperand in cOperator.Operands)
            {
                textList.AddRange(ExtractText(cOperand));
            }
        }
    }
    else if (cObject is CSequence)
    {
        var cSequence = cObject as CSequence;
        foreach (var element in cSequence)
        {
            textList.AddRange(ExtractText(element));
        }
    }
    else if (cObject is CString)
    {
        var cString = cObject as CString;
        textList.Add(cString.Value);
    }
    return textList;
}
Sergio
  • 482
  • 4
  • 11
  • 1
    You shouldn't have stripped down the StringBuilder, PDFs are quite big and that solution will cause a huge unnecessary memory consumption. – Ivan Ičin Aug 20 '16 at 14:37
12

PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators.

I've uploaded a simple implementation to github.

David Schmitt
  • 56,693
  • 26
  • 120
  • 165