0

I am creating PDF documents from user inputs that are UTF-8.

Beyond displaying the PDFs, the creation itself fails with java.lang.IllegalArgumentException: U+039B is not available in this font's encoding: WinAnsiEncoding.

Most answers here point to "using a font with better UTF-8 support", but as I have no control over user inputs, this UTF-8 support is never going to be good enough and I need a bullet proof solution (as in print something rather than error out).

The answer Using PDFBox to write unicode strings to a PDF suggests that the text should be sanitised before it is added to the PDF.

The issue is that I cannot find valid example to achieve this. All examples seem to be pointing at removed code (font.setToUnicodeor some method in encoding to convert characters one at a time).

So in a nutshell, I have a string I want a bullet proof method to write most of it to a PDFBox document (obviously, missing characters in the font will be replaced or not printed).

Many thanks, JM

Community
  • 1
  • 1
jmc34
  • 790
  • 3
  • 9
  • 21
  • Which PDFBox version do you use? As the answer you refer to points out, the situation differs for versions 1.8.x and 2.0.x – mkl Feb 24 '17 at 17:25
  • I am using 2.0.3 (the last one published). – jmc34 Feb 24 '17 at 18:58
  • Which font do you use? How do you use it? Pdfbox 2.0.x allows you to embed font subsets which contain the glyphs you need. – mkl Feb 25 '17 at 09:29
  • @mkl yes I tried with Ubuntu fonts which improved things to a point, but it is never going to be good enough as I cannot know in advance what characters will be printed. I am printing text that are user inputs and basically they have access to the whole UTF-8 set. Is there a way to know what glyphs are in a font for what code points ? That would be massively inefficient but I could scan all strings and replace missing characters by a place hloder... – jmc34 Feb 25 '17 at 13:02
  • https://stackoverflow.com/a/31424164/3977077 This helps a lot to remove non-printable characters – Sherlock Aug 24 '17 at 02:04

1 Answers1

0

I ended doing a character by character sanitization.

Here what my sanitization function looks like.

To avoid reprocessing characters, I am caching the availability of each character for each given font.

When a code point is not available in a font I am trying the "standard" replacement character and if it is not available I am replacing with a question mark.

It is indeed inefficient, but I have not found another more efficient way to do this bearing in mind that I have no control and no advance knowledge of what is being printed.

There might be a lot of things to improve but this works for my use case.

private String getPrintableString(String string, PDFont font) {

    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < string.length(); i++) {

        int codePoint = string.codePointAt(i);

        if (codePoint == 0x000A) {
            sb.appendCodePoint(codePoint);
            continue;
        }

        String fontName = font.getName();
        int cpKey = fontName.hashCode();
        cpKey = 31 * cpKey + codePoint;

        if (codePointAvailCache.get(cpKey) == null) {

            try {
                font.encode(string.substring(i, i + 1));
                codePointAvailCache.put(cpKey, true);
            } catch (Exception e) {
                codePointAvailCache.put(cpKey, false);
            }
        }

        if (!codePointAvailCache.get(cpKey)) {

            // Need to make sure our font has a replacement character
            try {
                codePoint = 0xFFFD;
                font.encode(new String(new int[] { codePoint }, 0, 1));
            } catch (Exception e) {
                codePoint = 0x003F;
            }
        }

        sb.appendCodePoint(codePoint);
    }

    return sb.toString();
}
jmc34
  • 790
  • 3
  • 9
  • 21