Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for shallow clones and fetches with gitoxide #11840

Merged
merged 17 commits into from
May 3, 2023

Conversation

Byron
Copy link
Member

@Byron Byron commented Mar 13, 2023

This PR makes it possible to enable shallow clones and fetches for git dependencies and crate indices independently with the -Zgitoxide=fetch,shallow_deps and -Zgitoxide=fetch,shallow_index respectively.

Tasks

  • setup the shallow option when fetching, differentiated by 'registry' and 'git-dependency'
  • validate registries are cloned shallowly and fetched shallowly
  • validate git-dependencies are cloned shallowly and fetched shallowly
  • a test to show what happens if a shallow index is opened with git2 (it can open it and fetch like normal, no issues)
  • assure that git2 can safely operate on a shallow clone - we unshallow it beforehand, both for registries and git dependencies
  • assure git-deps with revisions are handled correctly (they should just not be shallow, and they should unshallow themselves if they are)
  • make sure shallow index clones aren't seen by older cargo's
  • make sure shallow git dependency clones aren't seen by older cargo's
  • shallow.lock test and more test-suite runs with shallow clones enabled for everything
  • release new version of gix with full shallow support and use it here
  • check why shallow files remain after unshallowing. Should they not rather be deleted if empty? - Yes, git does so as well, implemented with this commit
  • see if it can be avoided to ever unshallow an existing -shallow clone by using the right location from the start. If not, test that we can go shallow->unshallow->shallow without a hitch. Cannot happen anymore as it can predict the final location perfectly.
  • Cargo.lock files don't prevent shallow clones
  • assure all other tests work with shallow cloning enabled (or fix the ones that don't with regression protection)
  • can the 'split-brain' issue be solved for good?

Review Notes

  • there is a chance of 'split brain' in git-dependencies as the logic for determining whether the clone/fetch is shallow is repeated in two places. This isn't the case for registries though.

Notes

  • I am highlighting that this is the gitoxide version of shallow clones as the git2 version might soon be available as well. Having that would be good as it would ensure interoperability remains intact.
  • Maybe for when git2 has been phased out, i.e. everything else is working, I think (unscientifically) there might be benefits in using worktrees for checkouts. Admittedly I don't know the history of why they weren't used in the first place. Also: gitoxide doesn't yet support local clones and might not have to if worktrees were used instead.

@rustbot
Copy link
Collaborator

rustbot commented Mar 13, 2023

r? @epage

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added A-git Area: anything dealing with git A-registries Area: registries S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 13, 2023
@Byron Byron marked this pull request as draft March 13, 2023 10:58
@Byron Byron changed the title [DRAFT] support for shallow clones and fetches with gitoxide support for shallow clones and fetches with gitoxide Mar 13, 2023
@Byron Byron force-pushed the shallow-support branch 2 times, most recently from 4459329 to 367074c Compare March 13, 2023 17:16
.all()?
.count(),
2,
"it fetched only the new portion of the history, even though it was unaware of the shallow boundary"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is truly amazing. Do we know that this will always work correctly?

I'm worried about the case where a repository returns to an earlier version of the file. Say we have commit 1 with "a", 2 with "b", and 3 with "a". We do a shallow clone of 2. Then we do a libgit2 full update from 2->3, the server tells us "the contents of the file is the same as it was in commit 1 which you already have" but we don't have it.

I am concerned that we always correctly handle any errors from libgit2 and that we never deepen a shallow clone.

Copy link
Member Author

@Byron Byron Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, a valid point! But is it that exhaustive?

If that was true, it wouldn't only have to align the commits to send along with the changes within them, but also back-track through the entire history to do some sort of tree optimization. Not having implemented that part yet definitely leaves my knowledge spotty there, but having implemented pack generation I know that when building the smallest possible pack of a set of commits, I do diff each of them to only send changes. Each of these object candidates to send would now have to pass through a filter that answers the question of "does this object exist already in the portion of the commit graph that the client has? If yes, don't send it." However, what speaks for it being able to do that is its ability to base a delta entry in the back off an object that it knows the other side has (so it doesn't have to send it).

I am aware that the server side has caches which facilitate all kinds of operations (which are mostly non-functional when doing a shallow clone), but that level of sophistication still baffles me :D.

Excuse my ramblings, I think the answer here is that it's still early and I am writing tests to see what happens, and more testing has to be done to see if it's truly safe. If there is any doubt, and I will see if I can get clarification, it would be better to nuke the repository and re-clone it with git2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of researching this further, I thought it's safest to reinitialize repo instead so using git2 on top of a shallow repo will unshallow it effectively and in a controlled fashion. I am also watching the 'shallow' PR over at the libgit2 repo so this can be adjusted once it becomes available there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will always be older cargoes that do not know that shallow clones are a thing. We need a mechanism to ensure that if we shallow clone a repository older cargoes will ether "work entirely correctly" or "not see the checkout".

There are two big things that would not qualify as "work entirely correctly". The obvious one, is crashes with an error. The less obvious one is unshallow the checkout in one of the ways that is extremely expensive for GitHub.

A undesirable scenario that I think still qualifies as "work entirely correctly" is, do a Delta fetch (which as you demonstrate almost always works) and if there's an error blow away the repository and do a full fetch. If we can determine what errors are likely from libgit2 and we have a re-fetch fallback for those errors, I will feel significantly better.

If for any of these reasons we need to make sure that old cargoes do not see shallow clones we will have to make sure that they are put in a different location. Either by adding it to the hash in the repo name or by /git/shallow-db.

All of this applies just as well whether were adding shallowness from git2 or from gix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this to the forefront of my mind.

Having seen what happens, I think the default would be that git2 can operate on shallow repos and receive only an incremental pack. This might not contain all objects they need due to the scenario you mentioned if git really works that well (which it probably does), resulting in some probability of hard errors. With that being a possibility (and even if it is not right now it might be in the future), I feel like it would be best to not let git2 touch shallow repos at all by placing them in their own directory.

Something I also thought about is if it's an option to 'cut' the feature down to an MVP level that yields value while not being used by everyone. Specifically, what if those shallow clones of the index would only be 'for CI' which makes them one-time-use only. Then there is no worry of older cargo's seeing them.

Thinking about it more, git-dependencies might be major benefactors after all, and having them on CI only is certainly a start, but ultimately people probably also want them locally which makes me arrive with the original problem that seemed like changing their hash (to make them invisible) is the safest bet and easiest to implement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having made sure that git-dependencies with revisions also work as they should by disabling shallow clones or unshallowing existing shallow clones it became clear that these situations can't be handled correctly by older cargos - instead they will find the revision they try to checkout missing and fail with an error.

The only way around this that I see would be to adjust the place where shallow clones are stored or make them unusable in other ways, I don't know yet.

A problem I see with the current implementation is that it itself makes a decision what to do about the shallowness of a clone, at a stage where the destination repository is already set and with it its location. I will see if it's feasible to move the whole clone into another location then once it's observable whether or not a clone is still shallow.

Running all tests against the 'make all clone and fetches shallow' has proven itself to be very useful to see more special cases and make them work, so when changing the location of clones I will use that as a measure of feasibility as well even though as of now there is no automated test that runs like this.

@Byron Byron force-pushed the shallow-support branch 2 times, most recently from 45f874a to e81797d Compare March 14, 2023 13:07
Copy link
Member Author

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will add one more test today, but besides that I believe shallow clones and fetches are working :). I will also run the entire test suite with shallow clones to see if there is unusual failures.

Please feel free to try it yourself.

&msg,
)?;
} else if let Some((action, remote)) =
find_in(&tasks, |t| progress_by_id(remote_progress, t))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specifically for avoiding pauses when shallow cloning the index, which is when the remote does far more work.

src/cargo/sources/git/utils.rs Show resolved Hide resolved
src/cargo/sources/git/utils.rs Show resolved Hide resolved
@Byron Byron force-pushed the shallow-support branch 5 times, most recently from 0ee9c5d to 44924a7 Compare March 15, 2023 18:46
@Byron
Copy link
Member Author

Byron commented Mar 15, 2023

A couple of additional tests have been added to particularly validate that locked revisions of git dependencies work as they should, and furthermore all shallow clones are now placed under a name with the -shallow suffix (I wasn't particularly creative). That way, these shallow repositories aren't going to be visible to git2 or to older cargo's which should make shallow repos backwards compatible. The changes to clone locations have been placed into their own commits to make clearer how tests had to adjusted to accommodate.

All in all, this seems to address all issues that were brought up so far and I am looking forward to receiving more feedback.

@Byron Byron marked this pull request as ready for review March 15, 2023 18:52
@bors
Copy link
Contributor

bors commented Apr 1, 2023

☔ The latest upstream changes (presumably #11881) made this pull request unmergeable. Please resolve the merge conflicts.

@Byron Byron force-pushed the shallow-support branch from 44924a7 to 518b909 Compare April 1, 2023 12:28
@bors
Copy link
Contributor

bors commented Apr 3, 2023

☔ The latest upstream changes (presumably #11928) made this pull request unmergeable. Please resolve the merge conflicts.

@Byron Byron force-pushed the shallow-support branch from 518b909 to 29f0760 Compare April 4, 2023 11:42
Copy link
Member

@weihanglo weihanglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! It is so close to complete that only tiny suggestions can I give 😆.

As of dealing with the blocker, we're getting closer of making cargo a workspace I believe.

src/cargo/sources/git/source.rs Show resolved Hide resolved
src/cargo/sources/git/utils.rs Outdated Show resolved Hide resolved
src/cargo/sources/git/utils.rs Outdated Show resolved Hide resolved
if matches!(self, RemoteKind::GitDependencyForbidShallow) {
return History::Unshallow;
} else {
gix::remote::fetch::Shallow::NoChange
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it put pressure on server side? According to this post from GitHub, they suggest avoiding subsequent fetches on a shallow cloned repo. Is this a thing we need to consider?

Copy link
Member Author

@Byron Byron Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the gist of this advice is that unshallowing shallow clones intentionally or accidentally incurs high cost on the server as it bypasses caches.

Unintentionally unshallowing is something that depends on branch configuration and refspecs (assuming the client code supports shallow), and the fact that cargo limits refspecs for fetches means that unintentional unshallowing would be the exception, not the rule. That could happen nonetheless when the history is rewritten past the shallow boundary for example.

For the crates index, I think this is really something to think about as users would run into this very situation every couple of months. But when they do, the server would traverse only a small history, so the cost incurred is not on the worst end of the spectrum.

Now that we have an implementation to show for, we might even pass it maybe along with an analysis to GitHub to be sure, and I think it's worth seeing this on a timescale as well. We are talking about maybe a year from now when this code might become available by default once git2 is phased out. And by then, the HTTP index is probably already cemented as default so the amount of clones we't actually see like this would be greatly reduced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record. Here are some concern of shallow clones from GitHub Infra Team in the past.

We will continue the conversation on Zulip and post a summary here when we make progess.

src/cargo/sources/git/oxide.rs Show resolved Hide resolved
tests/testsuite/git.rs Outdated Show resolved Hide resolved
tests/testsuite/git.rs Outdated Show resolved Hide resolved
tests/testsuite/git.rs Outdated Show resolved Hide resolved
tests/testsuite/git.rs Outdated Show resolved Hide resolved
src/cargo/sources/git/utils.rs Show resolved Hide resolved
Byron added a commit to Byron/cargo that referenced this pull request Apr 17, 2023
@Byron
Copy link
Member Author

Byron commented Apr 17, 2023

Thanks a lot for the review, it's much appreciated.

In todays session I didn't address everything yet, but replied wherever I did. Tomorrow (or in the next days), I'd expect to address everything.

@Byron Byron force-pushed the shallow-support branch 2 times, most recently from 439c4f2 to c700901 Compare April 18, 2023 06:11
Ok(())
}

fn find_lexicographically_first_bar_checkout() -> std::path::PathBuf {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one not important style suggestion: move these helper functions either to the top or the bottom.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling it out - fixed in c207407551ca7888bff0ab06ea86ea879997f81a.

}

#[cargo_test]
fn gitoxide_unshallows_git_dependencies_that_may_not_be_shallow_anymore() -> anyhow::Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more unshallow, so title needs a chance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good catch, done in in c207407551ca7888bff0ab06ea86ea879997f81a.

// There is a specific commit to fetch and we will just do so in shallow-mode only
// to not disturb the previous logic. Note that with typical settings for shallowing,
// we will just fetch a specific slice of the history.
refspecs.push(format!("+{0}:refs/remotes/origin/HEAD", rev));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! We still have room to optimize with fallbacks and retries. I am not too concerned with that at this moment, but perhaps we mark it as unresolved in the tracking issue.

Copy link
Member

@weihanglo weihanglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some highlights

  • Two unstable options for -Zgitoxide
    • -Zgitoxide=fetch,shallow_deps for fetching git dependencies
    • -Zgitoxide=fetch,shallow_index for fetching registry index
  • Shallow-cloned and shallow-checked-out git repositories reside at their own -shallow suffixed directories, i.e,
    • ~/.cargo/registry/index/*-shallow
    • ~/.cargo/git/db/*-shallow
    • ~/.cargo/git/checkouts/*-shallow
  • When the unstable feature is on, fetching/cloning a git repository is always a shallow fetch. This roughly equals to git fetch --depth 1 everywhere.
  • Even with the presence of Cargo.lock or specifying a commit { rev = "…" }, gitoxide is still smart enough to shallow fetch without unshallowing the existing repository.

I am going to merge this sooner or later.

Complete,
}

#[cargo_test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really your fault. I found git-related tests are all a bit hard to follow, because you need to understand git2 and each helper functions beforehand. I wonder if there is a way to make it simple to learn. Maybe add some equivalent git command in comments to explain what it tries to do?

Not a blocker. I don't expect we to do more sizable updates in this PR 😆.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker. I don't expect we to do more sizable updates in this PR 😆

Maybe no change is done in this PR, but I'd love to do better. My only means of describing what's going on right now is a wordy title for the #[test] function, and messages as last argument in assertions. Maybe a leading paragraph could describe the journey from start to end?

@Byron
Copy link
Member Author

Byron commented May 2, 2023

Some highlights

  • Two unstable options for -Zgitoxide

    • -Zgitoxide=fetch,shallow_deps for fetching git dependencies
    • -Zgitoxide=fetch,shallow_index for fetching registry index
  • Shallow-cloned and shallow-checked-out git repositories reside at their own -shallow suffixed directories, i.e,

    • ~/.cargo/registry/index/*-shallow
    • ~/.cargo/git/db/*-shallow
    • ~/.cargo/git/checkouts/*-shallow
  • When the unstable feature is on, fetching/cloning a git repository is always a shallow fetch. This roughly equals to git fetch --depth 1 everywhere.

  • Even with the presence of Cargo.lock or specifying a commit { rev = "…" }, gitoxide is still smart enough to shallow fetch without unshallowing the existing repository.

I am going to merge this sooner or later.

I wish this summary would be retained in a more prominent spot as it is so very useful :). The tracking issue might be a spot, but maybe there are other options as well.

In any case, I am excited to see this one merged :).

@arlosi
Copy link
Contributor

arlosi commented May 2, 2023

I wish this summary would be retained in a more prominent spot as it is so very useful

Could we add the summary to the existing notes in unstable.md? At, minimum "(planned)" should be removed from shallow-index and shallow-deps as it's being implemented here :)

Co-authored-by: Weihang Lo <me@weihanglo.tw>
@rustbot rustbot added the A-documenting-cargo-itself Area: Cargo's documentation label May 3, 2023
@Byron
Copy link
Member Author

Byron commented May 3, 2023

@arlosi I think that's just the right place :) - please take a look at the updated unstable.md, which now also contains a modified version of @weihanglo 's original comment.

@weihanglo
Copy link
Member

Seems great enough for an unstable feature. We can always tweak it if needed afterward. Thank you!

@bors r+

@bors
Copy link
Contributor

bors commented May 3, 2023

📌 Commit d2734d3 has been approved by weihanglo

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 3, 2023
@bors
Copy link
Contributor

bors commented May 3, 2023

⌛ Testing commit d2734d3 with merge 2f06c80...

@bors
Copy link
Contributor

bors commented May 3, 2023

☀️ Test successful - checks-actions
Approved by: weihanglo
Pushing 2f06c80 to master...

@bors bors merged commit 2f06c80 into rust-lang:master May 3, 2023
@Byron Byron deleted the shallow-support branch May 3, 2023 08:15
bors added a commit that referenced this pull request May 4, 2023
Update lock to normalize `home` dep

I'm not sure what happened, but in #11840 `Cargo.lock` was updated in such a way that the `home` dependency had a qualified source, even though it was equivalent to crates.io. This should only happen if there is some ambiguity (like if something had a path source to the in-tree home).

This causes the `Cargo.lock` file to be modified whenever running cargo commands locally.  This doesn't fail `--locked` because the resolve is equivalent, pointing to the the same source.

Closes #12082
bors added a commit to rust-lang-ci/rust that referenced this pull request May 5, 2023
Update cargo

10 commits in ac84010322a31f4a581dafe26258aa4ac8dea9cd..569b648b5831ae8a515e90c80843a5287c3304ef
2023-05-02 13:41:16 +0000 to 2023-05-05 15:49:44 +0000
- xtask-unpublished: output a markdown table (rust-lang/cargo#12085)
- fix: hack around `libsysroot` instead of `libtest` (rust-lang/cargo#12088)
- Optimize usage under rustup. (rust-lang/cargo#11917)
- Update lock to normalize `home` dep (rust-lang/cargo#12084)
- fix:  doc-test failures (rust-lang/cargo#12055)
- feat(cargo-metadata): add `workspace_default_members` (rust-lang/cargo#11978)
- doc: clarify implications of `cargo-yank` (rust-lang/cargo#11862)
- chore: Use `[workspace.dependencies]` (rust-lang/cargo#12057)
- support for shallow clones and fetches with `gitoxide` (rust-lang/cargo#11840)
- Build by PackageIdSpec, not name, to avoid ambiguity (rust-lang/cargo#12015)

r? `@ghost`
@ehuss ehuss added this to the 1.71.0 milestone May 5, 2023
@rfcbot rfcbot added finished-final-comment-period FCP complete to-announce and removed final-comment-period FCP — a period for last comments before action is taken labels May 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-documenting-cargo-itself Area: Cargo's documentation A-git Area: anything dealing with git A-registries Area: registries disposition-merge FCP with intent to merge finished-final-comment-period FCP complete S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-cargo Team: Cargo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants