Skip to content

[integration] Shallow clones for cargo #449

Open
@Byron

Description

@Byron

This issue collects thoughts and facts about the state of shallow clones for git repositories when used by cargo.

Here is a list of steps to take in cargo to support step-wise integration of gitoxide.

Terminology

Let's be sure we are on the same page, so I repeat here this comment by @Eh2406 to set a baseline.

  • The "crates.io index": The index that backs crates.io. It lives at https://github.com/rust-lang/crates.io-index .
  • An "alternative git index": A git repo with the same structure as the "crates.io index", but for a different set of crates.
  • A "git dependency": A git repo cargo clones because of a git = "<url>" dependency in a cargo.toml.

Another source of miscommunication is that there are two interconnected potential changes.

  • Switching from libgit2 -> gitoxide
  • Adding new functionality that is only available in gitoxide (Specifically "shallow clones")

of course one depends on the other.

Tracking issues

Cloning crates.io + crates (non-shallow)

It would be most straightforward to implement git::fetch(…) using gitoxide. This includes all transports and all credentials options that git2 supports for maximum usability.

Note that checkouts would still be performed by git2.

Requirements

All requirements are to be validated with the cargo-team, and a checkmark means its indeed a requirement.

Cloning crates.io + crates (shallow)

Add a parameter to support shallow fetches that maintain shallow-ness.

Issues

Don't forget about the general considerations of shallow clones for database-like repositories by ehuss in a comment, which might make this option unusable. It's something to validate first. If it truly is an issue, shallow can be turned off for crates.io but can be used for crates clones.

Assumptions

These should be validated to see if they may indeed be considered issues or risks one day in case they are proven true.

  • Shallow clones are not inherently slower to serve anymore and are thus desirable without increasing the risk of being throttled by GitHub. Work has been going on 6 years ago and is likely mature by now.
  • ✅Shallow clones are actually saving time and bandwidth - indeed, a maximally shallow clone is ~5.3x faster and uses 1/4th of the disk space. See details for the source data.

Questions

  • How to 'unshallow' a crates index? Some might want it for research. In any case, there should be a known path for this, so probably there must be an option for this in the cargo config no matter what will be the default.
    • No need, it's a special case and those who need it can always recreate the index from scratch. The index is an implementation detail.

Requirements

  • a cargo-config setting to control turn on or off shallow clones - maybe it's enabled by default for crates and maybe disable it for the crates index.
  • documentation on ways to change shallow-ness of crates.io clones (purposefully omitting such documentation for crates clones merely because I consider them private to cargo)
  • validate that older cargo versions can still work with such an index. It's likely they can as git2 can open them (and we only access a single tree which has complete objects)

Notes by @Eh2406

  • "shallow clones" of the "crates.io index" we could experiment with. But stabilizing requires careful communication with GItHub to make sure we don't abuse their generosity. With sparse indexes coming along, I don't know that the coordination is worth setting up.
  • "shallow clones" of the "alternative git index"s we could experiment with. However, it's not very motivating as I suspect a lot of alternative indexes will switch to sparse indexes.

Interesting reading

Checkout worktrees (without submodules)

This effectively is an implementation of git reset --hard as used in GitCheckout::reset(…).

Questions

  • Does cargo manipulate existing checkouts to match different versions as needed, or is each version of a clone in its own worktree, along with a git repository copy? It's probably the latter, but let's validate that. - YES, with hard-links if available.
    • Cargo splits checkouts (git/checkouts) and their source, and does a full clone from these to the sources (git/db, bare repos). Worktrees should help here, saving quite a bit of space.
  • Does cargo update these db clones or always create a new one? It's the question on how to update worktrees with submodules properly after changes where pulled. I have a feeling the current setup works around this.

Notes by @Eh2406

  • "shallow clones" of "git dependency"s is definitely worth striving for. I don't think it needs a opt in, unless there are practical use cases where people might need the full history.

Checkout submodules

Update submodules as in GitCheckout::update_submodules(…).

Out of scope

Reducing the local size of the .cargo directory seems very doable even without great effort, but we chose to tackle these separately.

  • optimize crates clones by using worktree checkouts instead of local file://… clones.
  • use a bare clones of the crates.io index and extract files content directly from git.
    • cargo is doing that already

bare shallow clones vs non-shallow ones

❯ git clone --bare https://github.com/rust-lang/crates.io-index index-full-history.git
Cloning into bare repository 'index-full-history.git'...
remote: Total 457133 (delta 151), reused 69 (delta 0), pack-reused 456913
Receiving objects: 100% (457133/457133), 209.38 MiB | 1.21 MiB/s, done.
Resolving deltas: 100% (319566/319566), done.

~/.cargo/registry took 2m59s
❯ git clone --depth 1 --bare https://github.com/rust-lang/crates.io-index index-shallow-depth-1.git
Cloning into bare repository 'index-shallow-depth-1.git'...
remote: Total 108481 (delta 57698), reused 92572 (delta 47615), pack-reused 0
Receiving objects: 100% (108481/108481), 53.77 MiB | 2.05 MiB/s, done.
Resolving deltas: 100% (57698/57698), done.

~/.cargo/registry took 34s

worktree checkout sizes (compressed, uncompressed)

.cargo/registry/index-shallow-depth-1.git ( master)
❯ l
.rw-r--r-- 703Mi byron staff  1 Jul 11:40 archive.tar
.rw-r--r--  44Mi byron staff  1 Jul 11:40 archive.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-integrate-gitoxide"Oxidize" crates even more by replacing git2 with gitoxide

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions