-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always consider metadata as utf-8. #22
Conversation
This is documented near the beginning of https://packaging.python.org/en/latest/specifications/core-metadata/ and warehouse currently serves these as `content-type: application/octet-stream` with no encoding, so requests spends a surprising amount of time on what are generally lower-ascii files trying to guess an encoding.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #22 +/- ##
==========================================
+ Coverage 94.80% 95.80% +1.00%
==========================================
Files 10 10
Lines 616 620 +4
Branches 112 112
==========================================
+ Hits 584 594 +10
+ Misses 30 22 -8
- Partials 2 4 +2 ☔ View full report in Codecov by Sentry. |
src/pypi_simple/client.py
Outdated
@@ -383,7 +383,7 @@ def get_package_metadata( | |||
r.raise_for_status() | |||
digester.update(r.content) | |||
digester.finalize() | |||
return r.text | |||
return r.content.decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return r.content.decode("utf-8") | |
return r.content.decode("utf-8", "surrogateescape") |
Because we don't want to crash on the inevitable case where someone somehow created a wheel with non-UTF-8 metadata.
23fa4e8
to
4fdb8f6
Compare
I moved the bulk of the code to a bytes api, because that's really what I was after. Added surrogateescape as requested, and there are tests. |
This is now released in pypi-simple 1.5.0. Thank you for the contribution! |
This is documented near the beginning of
https://packaging.python.org/en/latest/specifications/core-metadata/ and warehouse currently serves these as
content-type: application/octet-stream
with no encoding, so requests spends a surprising amount of time on what are generally lower-ascii files trying to guess an encoding.I think this is 100% safe -- files that can't be understood as utf-8 appear to not work in pip, which reads them as bytes, saves them as bytes and importlib always uses utf-8. A better API for pypi-simple might be to return bytes, but that's more disruptive.
Sample debug log pydantic-2.6.1-py3-none-any.whl.metadata showing more than 300ms spent trying to load the thing, only to give up and treat it as utf-8 anyway.
You might not notice this in a normal application, but I'm fetching these in a loop, with cachecontrol as the session, with a hot cache.
Here are traces for this change (the green block is
get_package_metadata
and the total time is visible top-right):Before:
After: