Description
Bug report
3.8 adds the .is_normalized
function to the unicodedata
module, which also is available as a method on the legacy unicodedata.ucd_3_2_0
database. It is supposed to check whether a string is equal to its normalization in a given form, but without having to normalize and compare.
However, the legacy version does not maintain the expected invariant. In fact, it reports that every single-character string is not normalized, regardless of the normalization form chosen. Presumably, the result is the same for every non-empty string. (It appears that the empty string works because it is special-cased at line 871-874.)
Example:
>>> import unicodedata
>>> unicodedata.ucd_3_2_0.normalize('NFC', '!') == '!'
True
>>> unicodedata.ucd_3_2_0.is_normalized('NFC', '!')
False
>>> any(unicodedata.ucd_3_2_0.is_normalized(form, chr(x)) for form in ('NFC', 'NFD', 'NFKC', 'NFKD') for x in range(0x110000))
False
The bug appears to be at line 801-804 of unicodedata.c:
/* UCD 3.2.0 is requested, quickchecks must be disabled. */
if (UCD_Check(self)) {
return NO;
}
I believe the NO
should say MAYBE
instead. The NO
value appears to indicate that the quickcheck has determined that the string is not normalized - contrary to both the comment and expected behaviour.
Your environment
$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.