8

I'm porting a Bash script to Python. The script sets LC_ALL=C and uses the Linux sort command to ensure the native byte order instead of locale-specific sort orders (http://stackoverflow.com/questions/28881/why-doesnt-sort-sort-the-same-on-every-machine).

In Python, I want to use Python's list sort() or sorted() functions (without the key= option). Will I always get the same results as Linux sort with LC_ALL=C?

Sicco
  • 5,979
  • 4
  • 43
  • 59
tahoar
  • 1,768
  • 3
  • 20
  • 36

4 Answers4

10

Sorting should behave as you expect if you pass locale.strcoll as the cmp argument to list.sort() and sorted():

import locale
locale.setlocale(locale.LC_ALL, "C")
yourList.sort(cmp=locale.strcoll)

But in Python 3 (from this answer):

import locale
from functools import cmp_to_key
locale.setlocale(locale.LC_ALL, "C")
yourList.sort(key=cmp_to_key(locale.strcoll))
jtpereyda
  • 6,239
  • 8
  • 48
  • 75
Frédéric Hamidi
  • 249,845
  • 40
  • 466
  • 467
  • Thanks everyone. My data are all Unicode and I said "without the key= option" because I use it for another purpose. This solution works great. @nabucosound, your solution is interesting, but installing PyICU is a bit heavy for my purpose. Thanks again. – tahoar Jan 08 '12 at 12:43
  • In my case, I needed to use a different locale option: `locale.setlocale(locale.LC_ALL, ('en_US', 'UTF-8'))` – jtpereyda May 13 '22 at 22:06
1

Considering you can add a comparison function, you can make sure that the sort is going to be the equivalent of LC_ALL=C. From the docs, though, it looks like if all the characters are 7bit, then it sorts in this manner by default, otherwise is uses locale specific sorting.

In the case that you have 8bit or Unicode characters, then locale specific sorting makes a lot of sense.

Petesh
  • 87,225
  • 3
  • 99
  • 116
1

Non-unicode strings in Python version less than 3 are actually bytes. sort function and methods do not do anything to enforce locale (locale module function is needed to facilitate locale-aware sorting explicitly).

unicode strings and all strings of Python 3.x are no more bytes. There is a "bytes" type in Python 3.

Roman Susi
  • 3,889
  • 2
  • 30
  • 41
1

I have been using International Components for Unicode, along with the PyICU bindings, to sort things with sorted() and using my own locale (Catalan on my case). For example, ordering a list of user profiles by name property:

collator = PyICU.Collator.createInstance(PyICU.Locale('ca_ES.UTF-8'))
sorted(user_profiles, key=lambda x: x.name, cmp=collator.compare)
nabucosound
  • 1,273
  • 1
  • 11
  • 23