How to sanitise a string before printing it to PDF with PDFBox

Question

I am creating PDF documents from user inputs that are UTF-8.

Beyond displaying the PDFs, the creation itself fails with java.lang.IllegalArgumentException: U+039B is not available in this font's encoding: WinAnsiEncoding.

Most answers here point to "using a font with better UTF-8 support", but as I have no control over user inputs, this UTF-8 support is never going to be good enough and I need a bullet proof solution (as in print something rather than error out).

The answer Using PDFBox to write unicode strings to a PDF suggests that the text should be sanitised before it is added to the PDF.

The issue is that I cannot find valid example to achieve this. All examples seem to be pointing at removed code (font.setToUnicodeor some method in encoding to convert characters one at a time).

So in a nutshell, I have a string I want a bullet proof method to write most of it to a PDFBox document (obviously, missing characters in the font will be replaced or not printed).

Many thanks, JM

Which PDFBox version do you use? As the answer you refer to points out, the situation differs for versions 1.8.x and 2.0.x — mkl, Feb 24 '17 at 17:25
Which font do you use? How do you use it? Pdfbox 2.0.x allows you to embed font subsets which contain the glyphs you need. — mkl, Feb 25 '17 at 09:29
@mkl yes I tried with Ubuntu fonts which improved things to a point, but it is never going to be good enough as I cannot know in advance what characters will be printed. I am printing text that are user inputs and basically they have access to the whole UTF-8 set. Is there a way to know what glyphs are in a font for what code points ? That would be massively inefficient but I could scan all strings and replace missing characters by a place hloder... — jmc34, Feb 25 '17 at 13:02
https://stackoverflow.com/a/31424164/3977077 This helps a lot to remove non-printable characters — Sherlock, Aug 24 '17 at 02:04

score 0 · Accepted Answer · answered Feb 27 '17 at 07:17

I ended doing a character by character sanitization.

Here what my sanitization function looks like.

To avoid reprocessing characters, I am caching the availability of each character for each given font.

When a code point is not available in a font I am trying the "standard" replacement character and if it is not available I am replacing with a question mark.

It is indeed inefficient, but I have not found another more efficient way to do this bearing in mind that I have no control and no advance knowledge of what is being printed.

There might be a lot of things to improve but this works for my use case.

private String getPrintableString(String string, PDFont font) {

    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < string.length(); i++) {

        int codePoint = string.codePointAt(i);

        if (codePoint == 0x000A) {
            sb.appendCodePoint(codePoint);
            continue;
        }

        String fontName = font.getName();
        int cpKey = fontName.hashCode();
        cpKey = 31 * cpKey + codePoint;

        if (codePointAvailCache.get(cpKey) == null) {

            try {
                font.encode(string.substring(i, i + 1));
                codePointAvailCache.put(cpKey, true);
            } catch (Exception e) {
                codePointAvailCache.put(cpKey, false);
            }
        }

        if (!codePointAvailCache.get(cpKey)) {

            // Need to make sure our font has a replacement character
            try {
                codePoint = 0xFFFD;
                font.encode(new String(new int[] { codePoint }, 0, 1));
            } catch (Exception e) {
                codePoint = 0x003F;
            }
        }

        sb.appendCodePoint(codePoint);
    }

    return sb.toString();
}

How to sanitise a string before printing it to PDF with PDFBox

1 Answers1