Why does sed fail with International characters and how to fix?

Question

GNU sed version 4.1.5 seems to fail with International chars. Here is my input file:

Gras Och Stenar Trad - From Moja to Minneapolis DVD [G2007DVD] 7812 | X
<br>
Gras Och Stenar Trad - From Möja to Minneapolis DVD [G2007DVD] 7812 | Y

(Note the umlaut in the second line.)

And when I do

sed 's/.*| //' < in

I would expect to see only the X and Y, as I've asked to remove ALL chars up to the '|' and space beyond it. Instead, I get:

X<br>
Gras Och Stenar Trad - From M? Y

I know I can use tr to remove the International chars. first, but is there a way to just use sed?

This problem seems that has been solved with GNU sed (tested on version 4.2.2). — done, Nov 23 '16 at 22:36

score 26 · Accepted Answer · answered Sep 15 '08 at 22:18

I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.

Example: in is UTF-8

$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Y
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X 
Y

UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.

Example: in is ISO-8859-1

$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Gras Och Stenar Trad - From MöY
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X 
Y

ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.

The answer is based on Debian Lenny/Sid and sed 4.1.5.

score 11 · Answer 2 · edited Feb 16 '15 at 18:57

11

sed is not very well setup for non-ASCII text. However you can use (almost) the same code in perl and get the result you want:

perl -pe 's/.*\| //' x

edited Feb 16 '15 at 18:57

Jonas Stein

6,363
7
39
70

answered Sep 15 '08 at 22:02

Why does sed fail with International characters and how to fix?

2 Answers2

Linked