4

I tried to look this up in the man pages of the sort command, but could not find anything. So consider the following text file t.txt:

 11
1 0

(Binary representation of t.txt

$ xxd -p t.txt
2031310a3120300a

)

using LC_COLLATE="en_US.UTF-8" with sort on this file gives:

$  LC_COLLATE="en_US.UTF-8" sort t.txt
1 0
 11

If we examine the second character position (or column) in the file, we observe that the first row has a space, and the second row has a 1. Since space has hexadecimal value of 0x20 which is less than the hexadecimal value of 1 (which is 0x31) I would assume that sort would give:

 11
1 0 

It turns out that the expected sorting order can be obtained using LC_COLLATE=c

$ LC_COLLATE=c sort t.txt
 11
1 0

What is the reason for the difference between LC_COLLATE="en_US.UTF-8" and LC_COLLATE=c for this case?

See also:

Edit:

Some more information about this issue was found here:

Community
  • 1
  • 1
Håkon Hægland
  • 36,323
  • 18
  • 71
  • 152

1 Answers1

3

punctuation is ignored when ordering in the en_US locale

Note sort can explicitly skip whitespace with the -b option, but note that's trick to use, so I'd advise using the sort --debug option when using that.

pixelbeat
  • 29,113
  • 9
  • 48
  • 60
  • Thanks! That is interesting. I also found some more information here: [In utf-8 collation, why 11- is less then 1-?](http://superuser.com/questions/227925/in-utf-8-collation-why-11-is-less-then-1) and [UNICODE COLLATION ALGORITHM](http://unicode.org/reports/tr10/). – Håkon Hægland May 18 '14 at 17:41