Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix installation of wheel with non-ASCII entries #8223

Merged
merged 9 commits into from
May 19, 2020

Conversation

uranusjr
Copy link
Member

@uranusjr uranusjr commented May 12, 2020

This mainly deals with correctly recording the wheel content in the RECORD metadata. This metadata file must be written in UTF-8, but the actual files need to be installed to the filesystem, the encoding of which is not (always) UTF-8. So we need to carefully handle file name encoding/decoding when comparing RECORD entries to the actual file.

The fix here makes sure we always use the correct encoding by adding strict type hints. The entries in RECORD is decoded/encoded with UTF-8 on the read/write boundaries to make sure we always deal with text types. A type-hint-only type RecordPath is introduced to make sure this is enforced (because Python 2 "helpfully" coerces str to unicode with the wrong encoding).

Close #7138, close #7574.

@uranusjr uranusjr added OS: windows Windows specific C: encoding Related to text encoding and likely, UnicodeErrors type: bugfix labels May 12, 2020
@uranusjr
Copy link
Member Author

Oh wow, it seems like flake8 just released a new version, and the new rules borked all our linting jobs.

@pfmoore
Copy link
Member

pfmoore commented May 12, 2020

We should scream at them, so they can tell us we should have pinned our dependency and done a more controlled rollout of the new version 🙂

Seriously we should pin for now, and discuss the new rules. I'm not sure I want to just knee-jerk "fix" the failures here (why shouldn't we call a variable in a comprehension "l" for a line?).

@pradyunsg
Copy link
Member

Fix merged. :)

Rebasing on master will make the warnings vanish! :)

@pradyunsg
Copy link
Member

Seriously we should pin for now, and discuss the new rules. I'm not sure I want to just knee-jerk "fix" the failures here (why shouldn't we call a variable in a comprehension "l" for a line?).

So... The issue was the flake8 started providing it's pre-commit hook properly, and we needed to update to use those instead of a deprecated hook. #8227 did that. :)

tests/unit/test_wheel.py Show resolved Hide resolved
@pradyunsg pradyunsg mentioned this pull request May 12, 2020
@pradyunsg pradyunsg self-requested a review May 12, 2020 22:58
@julienmalard
Copy link

Non-ASCII names are not weird

Thank you, thank you for saying that!

Regarding checks, one possible thing to check for would be that abugidas can be well represented. Because of a bug in re, many Python programs, even ones not using the dreaded [A-Za-z_] regex, still fail with vowel diactrics.

I am not sure how the new tests were implemented here, but would it be possible to include testing for modules with the following names? Even if re was not the culprit here, I have a feeling that, since it is often used in pip, it may end up causing problems elsewhere if we don't explicitly test:

  1. اسالام # Often works
  2. 你好 # Often works
  3. வணக்கம் # Often fails because of the diactrics

Thanks!

@uranusjr
Copy link
Member Author

Let me use pytest.mark.parametrize on the test so we can add more cases when we know about them.

@pradyunsg pradyunsg added the needs rebase or merge PR has conflicts with current master label May 19, 2020
@pradyunsg
Copy link
Member

@uranusjr FYI: This needs a rebase.

This mainly deals with correctly recording the wheel content in the
RECORD metadata. This metadata file must be written in UTF-8, but the
actual files need to be installed to the filesystem, the encoding of
which is not (always) UTF-8. So we need to carefully handle file name
encoding/decoding when comparing RECORD entries to the actual file.

The fix here makes sure we always use the correct encoding by adding
strict type hints. The entries in RECORD is decoded/encoded with UTF-8
on the read/write boundaries to make sure we always deal with text
types. A type-hint-only type RecordPath is introduced to make sure this
is enforced (because Python 2 "helpfully" coerces str to unicode with
the wrong encoding).
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label May 19, 2020
@uranusjr uranusjr requested a review from pradyunsg May 19, 2020 08:05
@pradyunsg pradyunsg merged commit 15f0863 into pypa:master May 19, 2020
@uranusjr uranusjr deleted the unicode-wheel branch May 21, 2020 05:12
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C: encoding Related to text encoding and likely, UnicodeErrors OS: windows Windows specific
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants