Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong sorting in cs_CZ.UTF-8 locale - humansorted #140

Closed
michalskop opened this issue Dec 1, 2021 · 2 comments
Closed

Wrong sorting in cs_CZ.UTF-8 locale - humansorted #140

michalskop opened this issue Dec 1, 2021 · 2 comments
Labels

Comments

@michalskop
Copy link

Describe the bug
The sorting using humansorted() in cs_CZ.UTF-8 is not correct.

Expected behavior

a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']
humansorted(a)

# result:
# ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']
# expected result:
# ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
# i.e., it sorts correctly 'Ch' as one letter after 'H', but not letters with 'hacek' after the correspondent letter without it
# it is correctly sorted using: sorted(a, key=functools.cmp_to_key(locale.strcoll))

Environment (please complete the following information):

  • Python Version: 3.8.8.
  • OS: Ubuntu 20.04
  • If the bug involves LOCALE or humansorted:
    • Is PyICU installed? No
    • Do you have a locale set? If so, to what? It is set to en_US.UTF-8, but I set it in python to cs_CZ.UTF-8

To Reproduce

import functools
import locale
import natsort

locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')  # this locale is installed on the computer

a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']
natsort.humansorted(a)  # ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']
sorted(a, key=functools.cmp_to_key(locale.strcoll))  # ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
@SethMMorton
Copy link
Owner

I can reproduce. I can isolate this to the fact that I am normalizing unicode to the decomposed compatibility form input using unicodedata.normalize before anything else.

In [11]: locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')
Out[11]: 'cs_CZ.UTF-8'

In [12]: sorted(a, key=locale.strxfrm)
Out[12]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

In [13]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[13]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']

In [14]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[14]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

In [15]: natsort.humansorted(a)
Out[15]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']

Interestingly, it's only with your locale. The other locales I have tested with (en_US and de_DE) do not show this behavior.

In [1]: import locale, natsort, unicodedata

In [2]: a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']

In [3]: locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
Out[3]: 'en_US.UTF-8'

In [4]: sorted(a, key=locale.strxfrm)
Out[4]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [5]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[5]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [6]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[6]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [7]: locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
Out[7]: 'de_DE.UTF-8'

In [8]: sorted(a, key=locale.strxfrm)
Out[8]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [9]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[9]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [10]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[10]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']

In [11]: locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')
Out[11]: 'cs_CZ.UTF-8'

In [12]: sorted(a, key=locale.strxfrm)
Out[12]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

In [13]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[13]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']

In [14]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[14]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']

So, the question I am faced with is this: Do I make it so that the unicode normalizations use "NFKC" whenever ns.LOCALE is used, or do I make it so that the user can select "NFKC" instead of "NFKD" if they so choose? I am leaning towards the former.

@SethMMorton
Copy link
Owner

Released as 8.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants