Treatment of spaces in sort command. Difference between LC_COLLATE=c and LC_COLLATE="en_US.UTF-8"

Question

I tried to look this up in the man pages of the sort command, but could not find anything. So consider the following text file t.txt:

 11
1 0

(Binary representation of t.txt

$ xxd -p t.txt
2031310a3120300a

)

using LC_COLLATE="en_US.UTF-8" with sort on this file gives:

$  LC_COLLATE="en_US.UTF-8" sort t.txt
1 0
 11

If we examine the second character position (or column) in the file, we observe that the first row has a space, and the second row has a 1. Since space has hexadecimal value of 0x20 which is less than the hexadecimal value of 1 (which is 0x31) I would assume that sort would give:

 11
1 0

It turns out that the expected sorting order can be obtained using LC_COLLATE=c

$ LC_COLLATE=c sort t.txt
 11
1 0

What is the reason for the difference between LC_COLLATE="en_US.UTF-8" and LC_COLLATE=c for this case?

See also:

Edit:

Some more information about this issue was found here:

It depends on your locale. Check for example `LC_ALL=C sort file`, that gives `A 11` first. See http://www.manpagez.com/info/coreutils/coreutils_196.php#SEC196 — fedorqui, May 14 '14 at 16:34
@fedorqui But why does it not work without `LC_ALL=C` ? (`echo $LANG` gives `en_US.UTF-8`) — Håkon Hægland, May 14 '14 at 16:45
@HåkonHægland The simple answer is "because the sorting rules are different in different locales". The full answer is probably quite a bit more complex... — twalberg, May 14 '14 at 19:25

score 3 · Accepted Answer · answered May 18 '14 at 17:11

3

punctuation is ignored when ordering in the en_US locale

Note sort can explicitly skip whitespace with the -b option, but note that's trick to use, so I'd advise using the sort --debug option when using that.

answered May 18 '14 at 17:11

pixelbeat

29,113
9
48
60

Thanks! That is interesting. I also found some more information here: [In utf-8 collation, why 11- is less then 1-?](http://superuser.com/questions/227925/in-utf-8-collation-why-11-is-less-then-1) and [UNICODE COLLATION ALGORITHM](http://unicode.org/reports/tr10/). – Håkon Hægland May 18 '14 at 17:41

Treatment of spaces in sort command. Difference between LC_COLLATE=c and LC_COLLATE="en_US.UTF-8"

1 Answers1