What is the proper encoding to use with item Reader

Question

I'm using spring batch to read csv files, when I open these files with Notepad++ I see that the used encode is encode in ANSI. Now when reading a line from a file, I notice that all accent character are not shown correctly. For example let's take this line:

Données issues de la reprise des données

It's transformed to be like this one with some special characters:

So as first solution I set the encode for my Item Reader to utf-8 but the problem still exist.

I thought that with UTF-8 encoding all my accent characters will be recognized, is that not true ? from what I heard UTF-8 is the best encoding to use to handle all character on web page for example ?

After setting my item Reader encoding to ISO-8859-1:

public class TestItemReader extends FlatFileItemReader<TestFileRow> {

    private static final Logger log = LoggerFactory.getLogger(TestItemReader.class);
    public ScelleItemReader(String path) {

        this.setResource( new FileSystemResource(path + "/Test.csv"));
        this.setEncoding("ISO-8859-1");

I cant see that these character are now displayed correctly.

As output I should write with utf-8 as encoding, did this is correct if I use ISO-8859-1 as encoding input and utf-8 as output?

"My question is that why when i try to set the itemReader encoding to utf-8 still persist ?" - Um, because the file isn't in UTF-8. Its not clear what you're asking, to be honest. — Jon Skeet, Nov 15 '17 at 09:42
I suspect you don't understand how encodings work. If a file is encoded in ISO-8859-1 and you try to read it using UTF-8, it's a bit like trying to use a PNG reader to load a JPEG image. UTF-8 can represent every character in Unicode, but that doesn't mean you can arbitrarily use it for files that are encoded in a different encoding. — Jon Skeet, Nov 15 '17 at 09:48
You might want to read http://csharpindepth.com/Articles/General/Unicode.aspx - it's phrased in terms of C#, but the concepts are the same. — Jon Skeet, Nov 15 '17 at 09:49
Ok, so there no the concept of a global encode that is capable to read any format. i should use the same encoding as mentioned on notepadd ++ — Feres.o, Nov 15 '17 at 09:50
Well "ANSI" isn't a single encoding either. If you can change what's producing the CSV files to output UTF-8, that would be the best thing. But if you can't change that, you should find out what encoding it's using (without just relying on Notepad++). — Jon Skeet, Nov 15 '17 at 09:51

score 4 · Answer 1 · answered Sep 12 '18 at 17:52

I had the same problem. Input file is ANSI, and "ü" gets displayed as a square in the output.

That's because your input file is encoded in ANSI, but by default, Spring Batch assumes ISO-8859-1 encoding (6.6.2 FlatFileItemReader).

Therefore, you have to set the encoding for your reader to "Cp1252" (setEncoding("Cp1252")) - that's how Java refers to ANSI encoding.

Furthermore, you will have to set your writer's encoding to "utf-8". I'm not entirely sure why it doesn't work with other encodings (that are generally able to display "ü", such as ISO-8859-1), but it works with UTF-8, so that's what I'm using.

score 0 · Answer 2 · answered Jun 01 '18 at 06:36

0

i use the same encoding "ISO-8859-1", all character are displayed correctly.

answered Jun 01 '18 at 06:36

zedtimi

196
6

What is the proper encoding to use with item Reader

2 Answers2

Linked