-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong sorting in cs_CZ.UTF-8 locale - humansorted #140
Comments
I can reproduce. I can isolate this to the fact that I am normalizing unicode to the decomposed compatibility form input using In [11]: locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')
Out[11]: 'cs_CZ.UTF-8'
In [12]: sorted(a, key=locale.strxfrm)
Out[12]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
In [13]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[13]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']
In [14]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[14]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
In [15]: natsort.humansorted(a)
Out[15]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo'] Interestingly, it's only with your locale. The other locales I have tested with (en_US and de_DE) do not show this behavior. In [1]: import locale, natsort, unicodedata
In [2]: a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']
In [3]: locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
Out[3]: 'en_US.UTF-8'
In [4]: sorted(a, key=locale.strxfrm)
Out[4]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [5]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[5]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [6]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[6]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [7]: locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
Out[7]: 'de_DE.UTF-8'
In [8]: sorted(a, key=locale.strxfrm)
Out[8]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [9]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[9]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [10]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[10]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [11]: locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')
Out[11]: 'cs_CZ.UTF-8'
In [12]: sorted(a, key=locale.strxfrm)
Out[12]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
In [13]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKD", x)))
Out[13]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']
In [14]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFKC", x)))
Out[14]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina'] So, the question I am faced with is this: Do I make it so that the unicode normalizations use "NFKC" whenever |
Released as 8.0.1 |
Describe the bug
The sorting using humansorted() in cs_CZ.UTF-8 is not correct.
Expected behavior
Environment (please complete the following information):
LOCALE
orhumansorted
:PyICU
installed? NoTo Reproduce
The text was updated successfully, but these errors were encountered: