Content-address distributions in the archive #16816

charliermarsh · 2025-11-22T03:05:15Z

Summary

The idea here is to always compute at least a SHA256 hash for all wheels, then store unzipped wheels in a content-address location in the archive directory. This will help with disk space (since we'll avoid storing multiple copies of the same wheel contents) and cache reuse, since we can now reuse unzipped distributions from uv pip install in uv sync commands (which always require hashes already).

Closes #1061.

Closes #13995.

Closes #16786.

charliermarsh · 2025-11-22T03:55:41Z

Still a few things I want to improve here.

charliermarsh · 2025-11-22T14:38:58Z

crates/uv-distribution/src/distribution_database.rs

-                Ok(temp_dir)
-            }
-        })
-        .await??;


This change I am a little worried about, because it will be a regression to move away from our parallel synchronous zip reader (https://github.com/GoogleChrome/ripunzip) to streaming. On the other hand, it means we'll no longer have two zip implementations.

Unfortunately I probably need to benchmark this.

A huge performance degradation for large files (300ms to 1.3s):

unzip_sync_small time: [5.4237 ms 5.5621 ms 5.7425 ms] change: [-13.499% -7.9934% -2.6150%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe unzip_sync_medium time: [12.587 ms 13.109 ms 13.788 ms] change: [+3.3102% +8.3958% +15.174%] (p = 0.00 < 0.05) Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe Benchmarking unzip_sync_large: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 34.9s, or reduce sample count to 10. unzip_sync_large time: [328.01 ms 331.20 ms 334.39 ms] change: [-2.9970% +0.4267% +3.5014%] (p = 0.80 > 0.05) No change in performance detected. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild unzip_stream_small time: [5.5436 ms 5.6239 ms 5.7148 ms] change: [-9.3797% -6.8046% -4.0946%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe unzip_stream_medium time: [16.696 ms 17.381 ms 18.161 ms] change: [+16.309% +21.820% +27.116%] (p = 0.00 < 0.05) Performance has regressed. Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe Benchmarking unzip_stream_large: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 118.7s, or reduce sample count to 10. unzip_stream_large time: [1.2877 s 1.3015 s 1.3146 s] change: [+12.395% +14.194% +15.823%] (p = 0.00 < 0.05) Performance has regressed.

What we could do instead is: keep our parallel unzip, then use blake3's parallelized mmap hash for files that we have on-disk (at least for wheels build ourselves, since we never validate hashes for those).

I guess this wouldn't work for path-based wheels that are provided in uv add though. (Although in that case, we already do hash them for uv add even if not in uv pip install, so the regression only affects uv pip.)

Ah that's pretty unfortunate. I'm curious about content addressing by blake3 and just computing the sha256 where needed, that idea sounds compelling.

We'd still have to compute the SHA256 for any local wheels used in uv add or similar (unless we changed the lockfile to also use Blake3, which could be a good idea?). The only benefit would be for wheels we build ourselves (since we'd no longer need to hash those).

Trying to understand the impact of this regression vs the gains of content-addressable caching and one unzip implementation less, there are three cases where we have local wheels:

A wheel comes from a build: The build an building the wheel should already be much slower than our hashing and unpacking.

A local path dependency as an index override: There is a regression, but it only applies to one or few wheels. If it's the torch wheel, it becomes noticeable.

Vendoring with a large find-links directory: This is the case where we lose most compared to parallel unzipping.

I think this is correct. However, the latter two only apply to uv pip, since we already pay that cost in uv sync et al (to include a hash in the lockfile).

With increased focus on this project API, this only being a slowdown for uv pip makes the tradeoff sound even better.

charliermarsh · 2025-11-22T15:09:36Z

docs/reference/troubleshooting/build-failures.md

    raise BackendUnavailable(data.get('traceback', ''))
 pip._vendor.pyproject_hooks._impl.BackendUnavailable: Traceback (most recent call last):
-  File "/Users/example/.cache/uv/archive-v0/3783IbOdglemN3ieOULx2/lib/python3.13/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend
+  File "/Users/example/.cache/uv/archive-v0/97de8790030bbd5c2d96b7ec782fc2f7820ef8dba6db909ccf95449f2d062d4b/lib/python3.13/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 77, in _build_backend


Another risk here is that this is significantly longer which hurts path length.

We could base64.urlsafe_b64encode it which would be ~43 characters (less than the 64 here, but more than the 21 we used before).

A few ideas...

base64 encoding seems reasonable

we might want to store it as {:2}/{2:}? git and npm do this to shard directories. I guess we don't have that problem today but if we're changing it maybe we should consider it? It looks like you did in 3bf79e2 ?

We could do a truncated hash with a package id for collisions? {:8}/{package-id} (I guess the package-id could come first?). We'd could persist the full hash to a file for a safety check too.

Yes. I did it as {:2}/{2:4}/{4:} in an earlier commit then rolled it back because it makes various things more complicated (e.g., for cache prune we have to figure out if we can prune the directories recursively). I can re-add it if it seems compelling.

We could do a truncated hash with a package id for collisions?

I'd prefer not to couple the content-addressed storage to a concept like "package names" if possible. It's meant to be more general (e.g., we also use it for cached environments).

({:2}/{2:4}/{4:} is what PyPI uses; it looks like pip does {:2}/{2:4}/{4:6}/{6:}?)

then rolled it back because it makes various things more complicated

Fair enough. I think people do it to avoid directory size limits (i.e., the number of items allowed in a single directory). I think we'd have had this problem already though if it was a concern for us? It seems fairly trivial to check both locations in the future if we determine we need it.

I'd prefer not to couple the content-addressed storage to a concept like "package names" if possible.

I think the idea that there's a "disambiguating" component for collisions if we truncate the hash doesn't need to be tied to "package names" specifically. The most generic way to do it would be to have /0, /1, ... directories with /{id}/HASH files and iterate over them? I sort of don't like that though :)

It's broadly unclear to me how much engineering we should do to avoid a long path length.

It may not really matter. I can't remember the specifics but what ends up happening here is: we create a temp dir, unzip it, then we move the temp dir into this location and hardlink from this location. So I don't think we end up referencing paths within these archives?

zanieb · 2025-12-02T10:02:42Z

How does this relate to #888?

konstin · 2025-12-02T12:37:10Z

iirc The hash checking ideas of RECORD never materialized, pip doesn't check the RECORD and neither does uv, and there's plans to remove it (https://discuss.python.org/t/discouraging-deprecating-pep-427-style-signatures/94968). The consensus has shifted to using hashes and signature for the entire archive that are presented outside of the archive, on the index page, instead of being shipped with the archive.

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 03:07 — with GitHub Actions Inactive

charliermarsh force-pushed the charlie/content branch from d42bbf0 to 0bf8b5c Compare November 22, 2025 03:16

Content-address distributions in the archive

6dbfe80

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 03:19 — with GitHub Actions Inactive

Use a slash-delimited path

3bf79e2

charliermarsh force-pushed the charlie/content branch from 0bf8b5c to 3bf79e2 Compare November 22, 2025 04:15

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 04:17 — with GitHub Actions Inactive

charliermarsh commented Nov 22, 2025

View reviewed changes

charliermarsh had a problem deploying to uv-test-registries November 22, 2025 14:39 — with GitHub Actions Error

Remove sync zip

0b8c764

charliermarsh force-pushed the charlie/content branch from cfa9f83 to 0b8c764 Compare November 22, 2025 14:39

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 14:41 — with GitHub Actions Inactive

charliermarsh requested review from konstin and zanieb November 22, 2025 14:45

charliermarsh marked this pull request as ready for review November 22, 2025 14:46

Fix tests

936b1cf

charliermarsh force-pushed the charlie/content branch from 48ca3d5 to 084d601 Compare November 22, 2025 15:07

Remove prefix

d111e5c

charliermarsh force-pushed the charlie/content branch from 084d601 to d111e5c Compare November 22, 2025 15:09

charliermarsh commented Nov 22, 2025

View reviewed changes

charliermarsh temporarily deployed to uv-test-registries November 22, 2025 15:12 — with GitHub Actions Inactive

konstin added enhancement New feature or improvement to existing functionality performance Potential performance improvement labels Dec 1, 2025

Content-address distributions in the archive #16816

Are you sure you want to change the base?

Content-address distributions in the archive #16816

Conversation

charliermarsh commented Nov 22, 2025

Summary

Uh oh!

charliermarsh commented Nov 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charliermarsh Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zanieb Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zanieb Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charliermarsh Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zanieb commented Dec 2, 2025

Uh oh!

konstin commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

charliermarsh Nov 22, 2025 •

edited

Loading

zanieb Nov 22, 2025 •

edited

Loading

zanieb Nov 22, 2025 •

edited

Loading

charliermarsh Nov 22, 2025 •

edited

Loading