Can I fall back to latin1 if there are illegal bytes?

Question

The help text for fileencodings says this:

This is a list of character encodings considered when starting to edit an existing file. When a file is read, Vim tries to use the first mentioned character encoding. If an error is detected, the next one in the list is tried. When an encoding is found that works, 'fileencoding' is set to it.

Since all byte strings are valid latin1 text, but utf-8 is more common, I have set my fileencodings as:

set fileencodings=utf-8,latin1

However, vim appears to use utf-8 even when there are decoding errors. A minimal example is the file containing the bytes 0x00 0xfd:

% xxd test.in
00000000: 00fd                                     ..
% vim test.in
"test.in" [noeol][ILLEGAL BYTE in line 1] 1L, 2C
:set fileencoding?
fileencoding=utf-8

Why is this? How can I ask vim to fall back to latin1 when it sees illegal bytes?

score 2 · Answer 1 · answered Mar 29 '18 at 20:26

2

The biggest problem with your example is [noeol]. Note that the bytes 00 DF do not really make a file that a text editor expects; if you add a newline to the end (00 DF 0A), you definitely won't get the [noeol] error, and I suspect that the file will now correctly open as latin1 (it does in my tests), and you'll see something like <00>ý or ^@ý. I'm not sure if this is what you expect or not - but latin1 is an 8-bit encoding, so your file consists of a literal NULL (which vim doesn't have a good way to print, so it prints as <00> if you have display=uhex set, or ^@ if not) followed by the character at DF, ý.

Note that 00 is NULL in ASCII, I'm not even sure if it's valid in latin1, but (for me at least), vim doesn't choke on it.

answered Mar 29 '18 at 20:26

brhfl

294
2
8

I'm not super concerned about how it displays. The file always loads correctly for me if I instead request latin-1; my question is explicitly about how to set up the editor so that it automatically falls back to latin-1 if utf-8 doesn't work (as it doesn't, and can't, here). – Daniel Wagner Mar 29 '18 at 20:40
[noeol] makes it invalid for everything in fileencodings. When this fails, the value of encoding is used. encoding defaults to latin1, unless a suitable value is found in the environmental variable $LANG, which is likely utf-8 these days. echo $LANG is likely to reveal the source of the ultimate utf-8 fallback. This shouldn't happen with a file that ends in a newline, however... and you won't get the [noeol] error. set nofixeol in your .vimrc will 'fix' this by ignoring files that don't end in newlines. – brhfl Mar 29 '18 at 21:16
@DanielWagner While I've seen vim go a little bonkers with [noeol] before, it almost seems to be the combination of the leading NULL byte and the lack of a trailing eol in this case. Regardless, $LANG is almost certainly where the ultimate fallback is coming from. – brhfl Mar 29 '18 at 21:38
Ah, that is interesting! I agree: with a final newline, it does fall back to latin1. I can't get your nofixeol trick to work though: even after setting it, a file with no final eol doesn't fall back to latin1 (and the documentation for nofixeol seems to imply it is used only when writing, not reading). – Daniel Wagner Mar 29 '18 at 23:28
@DanielWagner Not working for me in a different environment either - so either that was a quirk of my earlier environment, or I accidentally changed more than one test variable. I've run into this sort of thing before, and thought that I had solved it with something along those lines… but I guess that wasn't it. Sorry for the misinfo. – brhfl Mar 30 '18 at 01:44

score 0 · Answer 2 · answered Feb 27 '18 at 18:58

I think this is, because Vim uses the utf-8 encoding. it should work when manually forcing encoding=latin1. However I would not recommend setting this option. An alternative is to force Vim reading that file with a latin1 encoding, like this :e ++enc=latin1

Can I fall back to latin1 if there are illegal bytes?

2 Answers2