Wrong sorting in cs_CZ.UTF-8 locale - humansorted #140

michalskop · 2021-12-01T16:47:58Z

Describe the bug
The sorting using humansorted() in cs_CZ.UTF-8 is not correct.

Expected behavior

a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']
humansorted(a)

# result:
# ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']
# expected result:
# ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
# i.e., it sorts correctly 'Ch' as one letter after 'H', but not letters with 'hacek' after the correspondent letter without it
# it is correctly sorted using: sorted(a, key=functools.cmp_to_key(locale.strcoll))

Environment (please complete the following information):

Python Version: 3.8.8.
OS: Ubuntu 20.04
If the bug involves LOCALE or humansorted:
- Is PyICU installed? No
- Do you have a locale set? If so, to what? It is set to en_US.UTF-8, but I set it in python to cs_CZ.UTF-8

To Reproduce

import functools
import locale
import natsort

locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')  # this locale is installed on the computer

a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']
natsort.humansorted(a)  # ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']
sorted(a, key=functools.cmp_to_key(locale.strcoll))  # ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

The text was updated successfully, but these errors were encountered:

SethMMorton · 2021-12-08T05:58:51Z

I can reproduce. I can isolate this to the fact that I am normalizing unicode to the decomposed compatibility form input using unicodedata.normalize before anything else.

In [11]: locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')
Out[11]: 'cs_CZ.UTF-8'

In [12]: sorted(a, key=locale.strxfrm)
Out[12]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

In [13]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[13]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']

In [14]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[14]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

In [15]: natsort.humansorted(a)
Out[15]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']

Interestingly, it's only with your locale. The other locales I have tested with (en_US and de_DE) do not show this behavior.

In [1]: import locale, natsort, unicodedata

In [2]: a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']

In [3]: locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
Out[3]: 'en_US.UTF-8'

In [4]: sorted(a, key=locale.strxfrm)
Out[4]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [5]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[5]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [6]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[6]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [7]: locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
Out[7]: 'de_DE.UTF-8'

In [8]: sorted(a, key=locale.strxfrm)
Out[8]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [9]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[9]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [10]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[10]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [11]: locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')
Out[11]: 'cs_CZ.UTF-8'

In [12]: sorted(a, key=locale.strxfrm)
Out[12]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

In [13]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[13]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']

In [14]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[14]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

So, the question I am faced with is this: Do I make it so that the unicode normalizations use "NFKC" whenever ns.LOCALE is used, or do I make it so that the user can select "NFKC" instead of "NFKD" if they so choose? I am leaning towards the former.

SethMMorton · 2021-12-11T05:17:25Z

Released as 8.0.1

SethMMorton added the bug label Dec 8, 2021

SethMMorton mentioned this issue Dec 10, 2021

Fix sorting in ce locale #141

Merged

SethMMorton closed this as completed Dec 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong sorting in cs_CZ.UTF-8 locale - humansorted #140

Wrong sorting in cs_CZ.UTF-8 locale - humansorted #140

michalskop commented Dec 1, 2021

SethMMorton commented Dec 8, 2021

SethMMorton commented Dec 11, 2021

Wrong sorting in cs_CZ.UTF-8 locale - humansorted #140

Wrong sorting in cs_CZ.UTF-8 locale - humansorted #140

Comments

michalskop commented Dec 1, 2021

SethMMorton commented Dec 8, 2021

SethMMorton commented Dec 11, 2021