Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-101372: Fix unicodedata.is_normalized to properly handle the UCD 3… #101388

Merged
merged 5 commits into from
Feb 6, 2023

Conversation

corona10
Copy link
Member

@corona10 corona10 commented Jan 28, 2023

@corona10
Copy link
Member Author

@serhiy-storchaka

All ranges of characters are candidates for testing.
So I decide to choose a sampling approach rather than choose specific cases.

Test script

import unicodedata

with open('foo.out', 'w') as f:
    for x in range(0x110000):
        for form in ('NFC', 'NFD', 'NFKC', 'NFKD'):
            norm = unicodedata.ucd_3_2_0.normalize(form, chr(x))
            if not unicodedata.ucd_3_2_0.is_normalized(form, norm):
                f.write(f'{str(x)},{form}\n')

AS-IS

(.oss) ➜  cpython git:(main) ✗ ./python.exe gh-101372.py
(.oss) ➜  cpython git:(main) ✗ wc -l foo.out
 4456448 foo.out

TO-BE


(.oss) ➜  cpython git:(gh-101372) ✗ ./python.exe gh-101372.py
(.oss) ➜  cpython git:(gh-101372) ✗ wc -l foo.out
       0 foo.out

@corona10
Copy link
Member Author

corona10 commented Feb 3, 2023

@serhiy-storchaka I will merge this PR by next week, please let me know if there need some changes

@serhiy-storchaka
Copy link
Member

I am not happy with provided tests.

Testing all range of Unicode characters is slow (few seconds on my computer), it should be decorated with @requires_resource('cpu') if performed. Testing only small random sample can miss errors, and the test result will be hardly reproducible. It works for this issue, because is_normalized() was broken for most of characters, but it could not work for other types of bugs.

The test for multicharacter string is not what I meant. It should not only test all normalized sequences, but also non-normalized sequences. For example, '\ufb2c' is normalized to '\u05e9\u05bc\u05c1'. Therefore, '\ufb2c' should be not normalized, and '\u05e9\u05bc\u05c1' should be normalized. But '\u05e9\u05c1\u05bc', created by swapping the last two characters, is normalized to the same sequence '\u05e9\u05bc\u05c1', therefore it should be not normalized, besides it looks exactly the same as the original character. I think we need such kind of tests.

I tried to write more interesting tests for is_normalized(), and have found that the UCD 3.2.0 is mostly not tested. Also, there are not many tests for differences between UCD 3.2.0 and the current version. I am writing new tests.

I propose to merge your PR without tests. The bugfix itself is obvious, and the tests I will add later.

@corona10
Copy link
Member Author

corona10 commented Feb 6, 2023

I propose to merge your PR without tests. The bugfix itself is obvious, and the tests I will add later.

Okay got it, Please let me know once you submit the patch for test codes. I may learn a lot from the patch.

@corona10 corona10 merged commit 9ef7e75 into python:main Feb 6, 2023
@miss-islington
Copy link
Contributor

Thanks @corona10 for the PR 🌮🎉.. I'm working now to backport this PR to: 3.10, 3.11.
🐍🍒⛏🤖

@corona10 corona10 deleted the gh-101372 branch February 6, 2023 04:58
@bedevere-bot
Copy link

GH-101597 is a backport of this pull request to the 3.11 branch.

@bedevere-bot bedevere-bot removed the needs backport to 3.11 only security fixes label Feb 6, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 6, 2023
… UCD 3… (pythongh-101388)

(cherry picked from commit 9ef7e75)

Co-authored-by: Dong-hee Na <donghee.na@python.org>
@bedevere-bot
Copy link

GH-101598 is a backport of this pull request to the 3.10 branch.

@bedevere-bot bedevere-bot removed the needs backport to 3.10 only security fixes label Feb 6, 2023
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 6, 2023
… UCD 3… (pythongh-101388)

(cherry picked from commit 9ef7e75)

Co-authored-by: Dong-hee Na <donghee.na@python.org>
miss-islington added a commit that referenced this pull request Feb 6, 2023
gh-101388)

(cherry picked from commit 9ef7e75)

Co-authored-by: Dong-hee Na <donghee.na@python.org>
miss-islington added a commit that referenced this pull request Feb 6, 2023
gh-101388)

(cherry picked from commit 9ef7e75)

Co-authored-by: Dong-hee Na <donghee.na@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants