6

I'm using spring batch to read csv files, when I open these files with Notepad++ I see that the used encode is encode in ANSI. Now when reading a line from a file, I notice that all accent character are not shown correctly. For example let's take this line:

Données issues de la reprise des données

It's transformed to be like this one with some special characters:

enter image description here

So as first solution I set the encode for my Item Reader to utf-8 but the problem still exist.

  • I thought that with UTF-8 encoding all my accent characters will be recognized, is that not true ? from what I heard UTF-8 is the best encoding to use to handle all character on web page for example ?

After setting my item Reader encoding to ISO-8859-1:

public class TestItemReader extends FlatFileItemReader<TestFileRow> {

    private static final Logger log = LoggerFactory.getLogger(TestItemReader.class);
    public ScelleItemReader(String path) {

        this.setResource( new FileSystemResource(path + "/Test.csv"));
        this.setEncoding("ISO-8859-1");

I cant see that these character are now displayed correctly.

  • As output I should write with utf-8 as encoding, did this is correct if I use ISO-8859-1 as encoding input and utf-8 as output?
Feres.o
  • 243
  • 4
  • 16
  • "My question is that why when i try to set the itemReader encoding to utf-8 still persist ?" - Um, because the file isn't in UTF-8. Its not clear what you're asking, to be honest. – Jon Skeet Nov 15 '17 at 09:42
  • i update the post – Feres.o Nov 15 '17 at 09:47
  • I suspect you don't understand how encodings work. If a file is encoded in ISO-8859-1 and you try to read it using UTF-8, it's a bit like trying to use a PNG reader to load a JPEG image. UTF-8 can represent every character in Unicode, but that doesn't mean you can arbitrarily use it for files that are encoded in a different encoding. – Jon Skeet Nov 15 '17 at 09:48
  • You might want to read http://csharpindepth.com/Articles/General/Unicode.aspx - it's phrased in terms of C#, but the concepts are the same. – Jon Skeet Nov 15 '17 at 09:49
  • Ok, so there no the concept of a global encode that is capable to read any format. i should use the same encoding as mentioned on notepadd ++ – Feres.o Nov 15 '17 at 09:50
  • 1
    Well "ANSI" isn't a single encoding either. If you can change what's producing the CSV files to output UTF-8, that would be the best thing. But if you can't change that, you should find out what encoding it's using (without just relying on Notepad++). – Jon Skeet Nov 15 '17 at 09:51
  • Thank you very much for your kind help and clarifications – Feres.o Nov 15 '17 at 09:53

2 Answers2

4

I had the same problem. Input file is ANSI, and "ü" gets displayed as a square in the output.

That's because your input file is encoded in ANSI, but by default, Spring Batch assumes ISO-8859-1 encoding (6.6.2 FlatFileItemReader).

Therefore, you have to set the encoding for your reader to "Cp1252" (setEncoding("Cp1252")) - that's how Java refers to ANSI encoding.

Furthermore, you will have to set your writer's encoding to "utf-8". I'm not entirely sure why it doesn't work with other encodings (that are generally able to display "ü", such as ISO-8859-1), but it works with UTF-8, so that's what I'm using.

PixelMaster
  • 865
  • 7
  • 27
0

i use the same encoding "ISO-8859-1", all character are displayed correctly.

zedtimi
  • 196
  • 6