Skip to content

unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database #101372

Open
@zahlman

Description

@zahlman

Bug report

3.8 adds the .is_normalized function to the unicodedata module, which also is available as a method on the legacy unicodedata.ucd_3_2_0 database. It is supposed to check whether a string is equal to its normalization in a given form, but without having to normalize and compare.

However, the legacy version does not maintain the expected invariant. In fact, it reports that every single-character string is not normalized, regardless of the normalization form chosen. Presumably, the result is the same for every non-empty string. (It appears that the empty string works because it is special-cased at line 871-874.)

Example:

>>> import unicodedata
>>> unicodedata.ucd_3_2_0.normalize('NFC', '!') == '!'
True
>>> unicodedata.ucd_3_2_0.is_normalized('NFC', '!')
False
>>> any(unicodedata.ucd_3_2_0.is_normalized(form, chr(x)) for form in ('NFC', 'NFD', 'NFKC', 'NFKD') for x in range(0x110000))
False

The bug appears to be at line 801-804 of unicodedata.c:

    /* UCD 3.2.0 is requested, quickchecks must be disabled. */
    if (UCD_Check(self)) {
        return NO;
    }

I believe the NO should say MAYBE instead. The NO value appears to indicate that the quickcheck has determined that the string is not normalized - contrary to both the comment and expected behaviour.

Your environment

$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions