-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP registry implementation #10470
HTTP registry implementation #10470
Conversation
r? @ehuss (rust-highfive has picked a reviewer for you, use r? to override) |
I would strongly recommend not benchmarking against |
Oh, and just for reference, I dumped the benchmarking suite I used for the prototype implementation here: https://github.com/jonhoo/cargo-http-bench. You'll probably have to adjust paths/URLs to match your setup, but at least most of the test cases are there. |
I think the old endpoint is still around (but obviously with stale data), though I forget the exact name -- maybe @jonhoo remembers :) In any case, my recollection is that all I had to do was just copy crates.io-index to a place that's internet-accessible (much like the githubusercontent.com link); it may be that e.g. copying crates.io-index to a GitHub Pages site is good enough. I can also refresh and/or setup a new copy of the index on Rust's infrastructure (i.e., Cloudfront), but that might take at least a few days; it would be a one-off copy without significant investment work. I expect that's probably good for benchmarking in any case, though. |
In that case I think the URL is already somewhere in the repo I linked above, and the benchmark might just work as is! |
Looks like I also had a local nginx proxy for doing benchmarks with various throughout throttles. I'll post that config tomorrow! |
Added the nginx config. |
I just read all the comments on #8890. This recommendation was made hear #8890 (comment). It definitely sounds like a good follow up PR, especially on windows where the filesystem tends to be a bottle neck, but not a blocker to landing this PR.
This one can also be a follow up PR, if it is at all tricky. |
I repeated @jonhoo's test to compare git vs the original (greedy) fetch implementation and the new resolver based one using the local nginx server limited to 20MBps and 150 rps. The resolver-based implementation is significantly faster, primarily because it's fetching way fewer crates. The greedy fetch is fetching about 1000 crates for each test, while the resolver fetch is usually 100-200. Both are faster than git. If we increase the rps limit, or reduce the bandwidth then both http implementations do even better compared to git. Note that the git-timings aren't run for most of the test cases, because they'll always be the same. |
That looks really good! Thank you for the analysis! |
The PR description used to have text about how to test against For posterity here it is: To test with crates.io, the following
|
☔ The latest upstream changes (presumably #10482) made this pull request unmergeable. Please resolve the merge conflicts. |
Looking forward to either the code duplication being dealt with, or documentation about why it's not worth the effort. Then I will do a full review. |
I'm working on an update to remove the duplication around caching and tests. There are some differences in test output between git and http backends - specifically around errors such as network unavailability. I'll try to get them as close as possible, but there may be some tests that need to be split. |
simple(cargo_stable); | ||
} | ||
|
||
fn simple(cargo: fn(&Project, &str) -> Execs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repeated boilerplate in these tests is acceptable. And far better than it was. And still kind of unfortunate. I don't know enough about proc macros, I wonder if we could have #[cargo_test(HTTP)] fn simple
generate all three functions?
Definitely not something we have to fix now, but something I want to remember to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a significant amount of time reading this over carefully today and I think it's ready to be merged. Especially given that it is adding to an unstable feature, we can always change or fix things.
But because of the importance of this change, I want to talk over at tomorrow's cargo meeting before we merge.
if ptr == 0 { | ||
f(None) | ||
} else { | ||
// Safety: * `ptr` is only set by `set` below which ensures the type is correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not actually the type I was worried about here. It's the lifetime. What prevents us using downloads after it's been freed? It took me trying to write a counter example to realize why this is safe. ptr
is only set during the call back in set
. The way we use it is a little confusing, because curl needs us to have "callbacks at a distance". But our callbacks are only called within the call back to set.
I don't know how to document this better, nor do I think it has to be done before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is less a review and me coming up to speed on cargo. None of this is critical, so if people can't get to these questions, thats ok.
For what I understand of the code, this change looks correct to me.
Ok(SourceId::new(SourceKind::Registry, url, None)? | ||
.with_precise(Some("locked".to_string()))) | ||
} | ||
"sparse" => { | ||
let url = string.into_url()?; | ||
Ok(SourceId::new(SourceKind::Registry, url, None)? | ||
.with_precise(Some("locked".to_string()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, how come sparse
passes in the full string
while registry passes in the trailing bit (url
)? Do these show up somewhere that we have to maintain compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http registries are still SourceKind::Registry
, so the sparse+
prefix is how it knows we're in http mode. Otherwise it couldn't tell them apart.
I may be able to make the "registry" one also pass the registry+
prefix for consistency -- as long as it doesn't show up somewhere external as you mentioned.
if let Some(index_version) = index_version { | ||
trace!( | ||
"local cache of {} is available at version `{}`", | ||
path.display(), | ||
index_version | ||
); | ||
if self.is_fresh(path) { | ||
return Poll::Ready(Ok(LoadResponse::CacheValid)); | ||
} | ||
} else if self.fresh.contains(path) { | ||
debug!( | ||
"cache did not contain previously downloaded file {}", | ||
path.display() | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there is some context I'm missing but the way this reads is a bit odd.
self.fresh.contains(path)
is a subset of self.is_fresh
. Why are we only doing a subset of the checks in the else
and why are we still processing things (and logging that we don't have it) when its in self.fresh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.fresh
only keeps track of which files have been downloaded in the current session, not the contents of the file.
The "cache did not contain previously downloaded file {}" is a case when the cache isn't working, so we go ahead and get the file again. This could possibly be made into an assert, since it shouldn't happen...
if self.config.offline() { | ||
return Poll::Ready(Err(anyhow::anyhow!( | ||
"can't download index file from '{}': you are in offline mode (--offline)", | ||
self.url | ||
))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, do we need to also check self.config.frozen()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line below ops::http_handle(self.config)?;
will do the appropriate check for frozen (and offline).
So for consistency, we should either remove this check, or add the frozen one.
} | ||
} | ||
|
||
mod tls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my information, I see that src/cargo/core/package.rs
also uses TLS for sharing state with downloads. How come cargo uses that instead of an Arc<Mutex<T>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure. I think an Arc<Mutex<>>
would also work here. This change is based on what was done in package.rs
.
if let Some(index_version) = index_version { | ||
if let Some((key, value)) = index_version.split_once(':') { | ||
match key { | ||
ETAG => headers.append(&format!("If-None-Match: {}", value.trim()))?, | ||
LAST_MODIFIED => { | ||
headers.append(&format!("If-Modified-Since: {}", value.trim()))? | ||
} | ||
_ => debug!("unexpected index version: {}", index_version), | ||
} | ||
} | ||
} | ||
handle.http_headers(headers)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm understanding correctly, index_version
for a registry is registry defined?
This was something that wasn't obvious to me and took some digging. I had assumed this was a schema version number rather than a data revision number.
Digging in some more, it seems like this is pre-established within cargo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index_version
is used for cache validation.
For git-based registries index_version
is the commit id of the HEAD of the repo when the cache file was created. This allows cargo to skip talking to libgit2 when the git repo hasn't changed.
For http-based registries index_version
is a string ETag: <etag value>
or Last-Modified: <last-modified value>
or Unknown
(in that preference order) based on what the server sent when fetching the index. This allows cargo to send If-None-Match
or If-Modified-Since
headers and receive an HTTP 304 (not modified) when fetching metadata.
The cache in http mode also allows --offline
mode to work (where the cache is used regardless of the index_version
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming may no longer match how we use it. If you can think of a better name now would be a great time to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming may no longer match how we use it
The schema versioning added in https://rust-lang.github.io/rfcs/3143-cargo-weak-namespaced-features.html#index-changes is what led to my confusion. it also doesn't help that we are storing structured data in a string.
As for names, its hard to say.
Unfortunately, I can't come up with a clear winner. The best I can come up with is index_digest
as that doesn't convey an Ord
version scheme but a Eq
identifier.
If you can think of a better name now would be a great time to change it.
For me, I disagree as making unrelated improvements to a PR can derail it and make it harder to review. Soon after this PR, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do people like any of these names?
cache_key
cache_identifier
cache_identity
cache_version
cache_tag
index_cache_<one of the above options>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also doesn't help that we are storing structured data in a string
That's true. I can probably make an improvement here.
📌 Commit 412b633 has been approved by |
Thanks for the review! I've been unavailable this week with limited internet access, so I'll get a follow up PR created early next week to address any open issues from this one. |
☀️ Test successful - checks-actions |
Update cargo 13 commits in 109bfbd055325ef87a6e7f63d67da7e838f8300b..1ef1e0a12723ce9548d7da2b63119de9002bead8 2022-03-17 21:43:09 +0000 to 2022-03-31 00:17:18 +0000 - Support `-Zmultitarget` in cargo config (rust-lang/cargo#10473) - doc: Fix document url for libcurl format (rust-lang/cargo#10515) - Fix wrong info in "Environment variables" docs (rust-lang/cargo#10513) - Use the correct flag in --locked --offline error message (rust-lang/cargo#10512) - Don't treat host/target duplicates as duplicates (rust-lang/cargo#10466) - Unstable --keep-going flag (rust-lang/cargo#10383) - Part 1 of RFC2906 - Packages can inherit fields from their root workspace (rust-lang/cargo#10497) - Remove unused profile support for -Zpanic-abort-tests (rust-lang/cargo#10495) - HTTP registry implementation (rust-lang/cargo#10470) - Add a notice about review capacity. (rust-lang/cargo#10501) - Add tests for ignoring symlinks (rust-lang/cargo#10047) - Update doc string for deps_of/compute_deps. (rust-lang/cargo#10494) - Consistently use crate::display_error on errors during drain (rust-lang/cargo#10394)
Implement HTTP registry support described in RFC 2789.
Adds a new unstable flag
-Z http-registry
which allows cargo to interact with remote registries served over http rather than git. These registries can be identified by urls starting withsparse+http://
orsparse+https://
.When fetching index metadata over http, cargo only downloads the metadata for needed crates, which can save significant time and bandwidth over git.
The format of the http index is identical to a checkout of a git-based index.
This change is based on @jonhoo's PR #8890.
cc @Eh2406
Remaining items:
http_registry.rs
)