Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.
Asked
Active
Viewed 5.0k times
55
-
8Just wondering, why downvotes? (There are no clarifying comments to help author to improve the question.) – TN. Dec 11 '12 at 07:28
-
You need to extract the ToUnicode CMaps from the document to convert the binary indexes of the text-strings, unless you're lucky and the binary indexes are ASCII values themselves. – R.J. Dunnill Jun 03 '19 at 03:56
3 Answers
50
Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.
public static class PdfSharpExtensions
{
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text;
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
}
Ronnie Overby
- 43,601
- 70
- 265
- 343
-
I am using PDFsharp library but it say ContentReader Class is out of context.What could be the problem? – Sudarshan Taparia Aug 31 '16 at 13:33
-
-
5Couldn't resist. IDK what that means or how to fix it. I try to avoid working with PDF's like the plague because the tools to work with them are crap and pretending that a human readable format is machine readable is a total fools errand. – Ronnie Overby Sep 01 '16 at 20:43
-
1PdfSharp v1.32.3057 has a bug where `ContentReader.ReadContent` hangs. To fix, there are some changes needed (see [here](http://forum.pdfsharp.net/viewtopic.php?p=7911#p7911)). After fixing the bug, I can confirm this works. :-) – Nicholas Miller Apr 05 '17 at 15:33
-
-
3
-
It seems that it works fine when `OpCode.Name == "Tj"` (which, I guess, is related to ASCII) and return gibberish when `OpCode.Name == "TJ"` (which, I guess, is Unicode). – NoOne Jul 09 '18 at 16:08
-
TJ allows glyph-spacing, and to that end, will have integer values between its strings. Neither Tj nor TJ are related to ASCII: both use binary indexes which cannot be depended on to be ASCII. – R.J. Dunnill Jun 03 '19 at 03:54
-
This works OOTB copy/paste for me in Jan 2021 against PDFs we create from our AS400 / i5. Many Thanks – bkwdesign Jan 29 '21 at 18:04
21
I have implemented it somehow similar to how David did it. Here is my code:
...
{
// ....
var page = document.Pages[1];
CObject content = ContentReader.ReadContent(page);
var extractedText = ExtractText(content);
// ...
}
private IEnumerable<string> ExtractText(CObject cObject)
{
var textList = new List<string>();
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
{
textList.AddRange(ExtractText(cOperand));
}
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
{
textList.AddRange(ExtractText(element));
}
}
else if (cObject is CString)
{
var cString = cObject as CString;
textList.Add(cString.Value);
}
return textList;
}
Diego Montania
- 95
- 5
Sergio
- 482
- 4
- 11
-
1You shouldn't have stripped down the StringBuilder, PDFs are quite big and that solution will cause a huge unnecessary memory consumption. – Ivan Ičin Aug 20 '16 at 14:37
12
PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators.
I've uploaded a simple implementation to github.
David Schmitt
- 56,693
- 26
- 120
- 165
-
6On many PDFs CString.Value returns just some junk (e.g. create a PDF using OpenOffice.org and try to import it using this method). – Ivan Ičin Aug 20 '16 at 14:52
-
2No, PdfSharp does not provide all the tools for text extraction. Functionality has yet to be added for ToUnicode CMaps, which are necessary to extract the text of Unicode PDFs. – R.J. Dunnill Jun 03 '19 at 03:59
-
1
-
it doesn't seem to be perfect, one word could be split in few lines, eg: Pre dic t ion i s ve – hazjack Sep 28 '20 at 08:24
-
@hazjack Yeah, you'll need a strong AI then to salvage the text from your PDF. – David Schmitt Sep 29 '20 at 09:07