Description
This issue collects thoughts and facts about the state of shallow clones for git repositories when used by cargo.
Here is a list of steps to take in cargo
to support step-wise integration of gitoxide
.
Terminology
Let's be sure we are on the same page, so I repeat here this comment by @Eh2406 to set a baseline.
- The "crates.io index": The index that backs crates.io. It lives at https://github.com/rust-lang/crates.io-index .
- An "alternative git index": A git repo with the same structure as the "crates.io index", but for a different set of crates.
- A "git dependency": A git repo cargo clones because of a
git = "<url>"
dependency in a cargo.toml.
Another source of miscommunication is that there are two interconnected potential changes.
- Switching from
libgit2 -> gitoxide
- Adding new functionality that is only available in
gitoxide
(Specifically "shallow clones")
of course one depends on the other.
Tracking issues
Cloning crates.io + crates (non-shallow)
It would be most straightforward to implement git::fetch(…) using gitoxide
. This includes all transports and all credentials options that git2
supports for maximum usability.
Note that checkouts would still be performed by git2
.
Requirements
All requirements are to be validated with the cargo-team, and a checkmark means its indeed a requirement.
- a cargo-config setting to control if
gitoxide
should be used. Use an unstable flag as suggested by Josh.
Cloning crates.io + crates (shallow)
Add a parameter to support shallow fetches that maintain shallow-ness.
Issues
Don't forget about the general considerations of shallow clones for database-like repositories by ehuss in a comment, which might make this option unusable. It's something to validate first. If it truly is an issue, shallow
can be turned off for crates.io but can be used for crates clones.
Assumptions
These should be validated to see if they may indeed be considered issues or risks one day in case they are proven true.
- Shallow clones are not inherently slower to serve anymore and are thus desirable without increasing the risk of being throttled by GitHub. Work has been going on 6 years ago and is likely mature by now.
- ✅Shallow clones are actually saving time and bandwidth - indeed, a maximally shallow clone is ~5.3x faster and uses 1/4th of the disk space. See details for the source data.
Questions
- How to 'unshallow' a crates index? Some might want it for research. In any case, there should be a known path for this, so probably there must be an option for this in the cargo config no matter what will be the default.
- No need, it's a special case and those who need it can always recreate the index from scratch. The index is an implementation detail.
Requirements
- a cargo-config setting to control turn on or off shallow clones - maybe it's enabled by default for crates and maybe disable it for the crates index.
- documentation on ways to change shallow-ness of crates.io clones (purposefully omitting such documentation for crates clones merely because I consider them private to cargo)
- validate that older cargo versions can still work with such an index. It's likely they can as git2 can open them (and we only access a single tree which has complete objects)
Notes by @Eh2406
- "shallow clones" of the "crates.io index" we could experiment with. But stabilizing requires careful communication with GItHub to make sure we don't abuse their generosity. With sparse indexes coming along, I don't know that the coordination is worth setting up.
- "shallow clones" of the "alternative git index"s we could experiment with. However, it's not very motivating as I suspect a lot of alternative indexes will switch to sparse indexes.
Interesting reading
- an HTTP based protocol to get index information (with some research on the size of worktrees under various conditions)
- this comment by a GitHub engineer to explain what happens with crates.io like repositories (2016)
Checkout worktrees (without submodules)
This effectively is an implementation of git reset --hard
as used in GitCheckout::reset(…)
.
Questions
- Does cargo manipulate existing checkouts to match different versions as needed, or is each version of a clone in its own worktree, along with a git repository copy? It's probably the latter, but let's validate that. - YES, with hard-links if available.
- Cargo splits checkouts (
git/checkouts
) and their source, and does a full clone from these to the sources (git/db
, bare repos). Worktrees should help here, saving quite a bit of space.
- Cargo splits checkouts (
- Does cargo update these
db
clones or always create a new one? It's the question on how to update worktrees with submodules properly after changes where pulled. I have a feeling the current setup works around this.
Notes by @Eh2406
- "shallow clones" of "git dependency"s is definitely worth striving for. I don't think it needs a opt in, unless there are practical use cases where people might need the full history.
Checkout submodules
Update submodules as in GitCheckout::update_submodules(…)
.
Out of scope
Reducing the local size of the .cargo
directory seems very doable even without great effort, but we chose to tackle these separately.
- optimize crates clones by using worktree checkouts instead of local
file://…
clones. - use a bare clones of the crates.io index and extract files content directly from git.
cargo
is doing that already
bare shallow clones vs non-shallow ones
❯ git clone --bare https://github.com/rust-lang/crates.io-index index-full-history.git
Cloning into bare repository 'index-full-history.git'...
remote: Total 457133 (delta 151), reused 69 (delta 0), pack-reused 456913
Receiving objects: 100% (457133/457133), 209.38 MiB | 1.21 MiB/s, done.
Resolving deltas: 100% (319566/319566), done.
~/.cargo/registry took 2m59s
❯ git clone --depth 1 --bare https://github.com/rust-lang/crates.io-index index-shallow-depth-1.git
Cloning into bare repository 'index-shallow-depth-1.git'...
remote: Total 108481 (delta 57698), reused 92572 (delta 47615), pack-reused 0
Receiving objects: 100% (108481/108481), 53.77 MiB | 2.05 MiB/s, done.
Resolving deltas: 100% (57698/57698), done.
~/.cargo/registry took 34s
worktree checkout sizes (compressed, uncompressed)
.cargo/registry/index-shallow-depth-1.git ( master)
❯ l
.rw-r--r-- 703Mi byron staff 1 Jul 11:40 archive.tar
.rw-r--r-- 44Mi byron staff 1 Jul 11:40 archive.tar.gz