Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download crates in parallel with HTTP/2 #6005

Merged
merged 15 commits into from
Sep 19, 2018

Conversation

alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Sep 11, 2018

This PR revives some of the work of #5161 by refactoring Cargo to make it much easier to add parallel downloads, and then it does so with the curl crate's new http2 feature to compile nghttp2 has a backend.

The primary refactoring done here is to remove the concept of "download this one package" deep within a Source. Instead a Source still has a download method but it's considered to be largely non-blocking. If a crate needs to be downloaded it immediately returns information as to such. The PackageSet abstraction is now a central location for all parallel downloads, and all users of it have been refactored to be amenable to parallel downloads, when added.

Many more details are in the commits...

@rust-highfive
Copy link

r? @matklad

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton
Copy link
Member Author

r? @ehuss

cc @ishitatsuyuki

@rust-highfive rust-highfive assigned ehuss and unassigned matklad Sep 11, 2018
@Mark-Simulacrum
Copy link
Member

Hm, so I've wondered about this in the past -- I sort of intuitively always thought that parallel downloads from the "same" server wouldn't really help much because in most cases the bandwidth limitation is somewhere in the middle (likely closer to the client than the server, in fact, I'd imagine).

Not that I don't think this is a good idea to refactor in general if only to have downloads and builds potentially running in parallel or parallel downloads from different registries. Mostly just interested if there's perhaps some resource anyone knows about that would help clarify this for me :)

@alexcrichton
Copy link
Member Author

The claims in #5161 (comment) and #467 provide what I think are some pretty strong data for pursuing something like this. I think it's fine to hold off on landing this until there's at least a proof-of-concept to test out as well.

@ishitatsuyuki
Copy link
Contributor

ishitatsuyuki commented Sep 11, 2018

Great work. It was what I was stuck (or maybe just upset) with and this makes implementing the actual part much easier. Thanks!

@Mark-Simulacrum, the whole point of this is to parallelize the initial round trips. The crates.io server in US takes non trivial time to reach from Japan, and thus the time is wasted mostly waiting for a trivial redirect. If we someday absorb the redirect at CDN layer, that could be an improvement as well.

@alexcrichton
Copy link
Member Author

So to add to this dataset, I've got a proof-of-concept working that uses http/2 with libcurl to do downloads in Cargo itself. On my machine in the Mozilla office (connected to a presumably very fast network) I removed my ~/.cargo/registry/{src,cache} folders and then executed cargo fetch in Cargo itself. On nightly this takes about 18 seconds. With this PR it takes about 3. That's... wow!

Copy link
Contributor

@ishitatsuyuki ishitatsuyuki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks.

Overall, this PR introduce a rather complicated abstraction which is probably caused by the way we resolve dependencies currently.

The approach is more like uncontrolled concurrency (like goroutines, or whatever you call), which IMO should be replaced by explicit logic boundaries. Do you know what information is missing from index and requires crate fetch instead? I would like to be aware of them if possible. If you don't know, no worry; this approach will work for the near future, and we can tackle with this when we change the way index works.

/// Completes at least one downloading, maybe waiting for more to complete.
///
/// This function will block the current thread waiting for at least one
/// create to finish downloading. The function may continue to download more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

/// crates if it looks like there's a long enough queue of crates to keep
/// downloading. When only a handful of packages remain this function
/// returns, and it's hoped that by returning we'll be able to push more
/// packages to download into the queu.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

@alexcrichton
Copy link
Member Author

@ishitatsuyuki yeah it's not a great abstraction because it's not as simple as "download everything in this list". I initially didn't want to disturb the existing code too much because unit_dependencies has historically been very complicated and attempting to emulate it anywhere else inevitably led to bugs (so I wanted to use the existing code as-is).

While the unit dependencies may be complicated though it may be the case that package dependencies are much simpler, so if desired we can try to do that instead and always use an interface that looks like "download this list of packages".

I'm not sure what you mean about missing index information though? The index is all Cargo looks at before it starts downloading crates

@ehuss
Copy link
Contributor

ehuss commented Sep 11, 2018

I had to take the new code for a test drive. On my (slow) network, downloading all the crates for a fresh cargo build goes from 48s to 22s!!

@ehuss
Copy link
Contributor

ehuss commented Sep 11, 2018

If you want some early feedback: The progress indicator only works at the resolution of entire packages. If you have only 1 package to download, and it takes a long time to download, cargo prints nothing until after the download is complete. It would be nice if it could display some progress based on the package size.

@alexcrichton
Copy link
Member Author

@ehuss good to hear! That's not quite as large of an improvement as I would have hoped for but still is nice :)

For progress indicators I'm not really sure what to do myself. Previously we'd download one crate at a time which meant it was relatively obvious how to indicate progress, but now we don't have a great way to indicate progress I think because it's a whole soup of crates, and we don't know how big they all are. I initially tried to make a progress bar of "total bytes downloaded" vs "total bytes needed to download", but it was very jumpy and non-continuous, so I wasn't really sure what was going on there.

I don't think though we'll be able to land this with a package resolution, as you've seen with larger crates it's just not enough information compared to what we have today.

@ishitatsuyuki
Copy link
Contributor

I'm not sure what you mean about missing index information though? The index is all Cargo looks at before it starts downloading crates

It looks like we have to download some tarballs in order to determine unit dependencies currently? (Is this correct?) If so, what particular information that is not in the index does it need?

It seems having the list of unit dependencies ahead of time is also good for displaying better progress percentages. We don't know the file size anyway, but we can address it some day by changing the index format.

@alexcrichton
Copy link
Member Author

Oh so the index gives us the "maximal" crate graph sort of in terms of independent of activated features and which target platform. The unit_dependencies section of code, however, is specifically constructing a set of units that use a the minimial crate graph which doesn't include unnecessary platform-specific dependencies or unactivated feature dependencies.

In that sense what we're doing, creating a minimal resolution crate graph, can require more information than what's in the index. The index doesn't contain information about things like the crate's library targets, for example. I don't think we need to change the index for this, though, we can probably just get smarter in Cargo about constructing the minimal crate graph conservatively.

@alexcrichton
Copy link
Member Author

Looking into things a bit, Heroku doesn't support HTTP/2 (ref) which means that we won't get multiplexing when talking to crates.io. Our CDN with static.crates.io, cloudfront, does support HTTP/2 though.

@ishitatsuyuki
Copy link
Contributor

@alexcrichton In my former PR I used pipelining which reliably works for Heroku's reverse proxy. It will achieve similar efficiency improvements without creating too much TCP connections.

@alexcrichton alexcrichton force-pushed the download-parallel branch 3 times, most recently from c7cb406 to 23b7a29 Compare September 13, 2018 23:33
@alexcrichton alexcrichton changed the title Lay some groundwork for downloading crates in parallel Download crates in parallel with HTTP/2 Sep 13, 2018
@alexcrichton alexcrichton force-pushed the download-parallel branch 2 times, most recently from 26374a9 to 955ae8d Compare September 13, 2018 23:52
@alexcrichton
Copy link
Member Author

Ok! I've done some various tinkering locally and I think this is at a good spot. This now works on all platforms, I've tested on all platforms, and I'm relatively happy with the new output progress. The output looks like this:

Fast CPU with fast internet

asciicast

Not as fast computer with throttled LTE-like internet

asciicast

Some notable points of interest

  • The "jerkiest" part of the UI is the fact that we extract tarballs synchronously, and we don't have much opportunity to update the UI during this time. This is a bit of a bummer, but something we can fix later! (and would make this even faster from the looks of it).
  • When we're doing a cargo build we don't actually know how many crates we're downloading. This can cause crate numbers to jump up and down.

Definitely some room for improvement, but I think we're not necessarily hit by any showstoppers!

With the latest commit this should also be good to go in terms of landing, I don't know of any further blockers.

@ehuss
Copy link
Contributor

ehuss commented Sep 14, 2018

Does this need testing when behind a proxy?

The status is still a little jumpy, but better than before. And I definitely don't think it should hold things up, but maybe we can tweak it in the future. I like the new final summary.

@alexcrichton
Copy link
Member Author

Ah right yeah, I should test behind some non-http-2 proxies! I'll do that today.

I also plan on filing a few follow-up issues after this lands to cover the remaining work items of inflating in parallel as well as computing the set of crates to download up-front.

Copy link
Contributor

@ehuss ehuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the typos pointed out by ishitatsuyuki still need to be fixed as well.

// We can continue to iterate on this, but this is hopefully good enough
// for now.
if !self.progress.borrow().as_ref().unwrap().is_enabled() {
self.set.config.shell().status("Downloading", &dl.descriptor)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about saying "Downloaded" instead?

Also, just as a general issue, in this mode nothing is displayed until the first crate is finished, which might be a long time. I'm wondering how common this scenario would be. At least for me, my editor integration does not use ttys, so I only get this mode. Would it be possible to print some generic "Downloading crates..." message at the start when progress is disabled?

packages.get(pkgid)
})
.collect::<CargoResult<Vec<_>>>()?;
let mut pkgs = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these 10 lines be shared with the duplicate in compile_ws?

It seems to fix some issues perhaps!
@alexcrichton
Copy link
Member Author

Also at the recommentation of curl folks I've disabled HTTP1 pipelining, which may also fix the bug you were seeing @ehuss?

@ehuss
Copy link
Contributor

ehuss commented Sep 18, 2018

I can confirm that the pipelining change fixes the problems I was seeing.

With the pipelining on, I tried a newer version of squid but got the same problems. It was version 4.2, and a very basic conf file:

http_port 8081
http_access allow localhost manager
http_access deny manager
http_access allow localhost

@alexcrichton
Copy link
Member Author

Ok! In that case I think we'll leave that turned off but leave HTTP/2 enabled even when proxies are enabled.

Mind taking a final look at the code for an r+?

@ehuss
Copy link
Contributor

ehuss commented Sep 18, 2018

@bors r+

What do you think about asking the broader community to help test this after it has been on nightly for a little bit? I'm a little concerned about people with weird networks/proxies possibly having issues.

@bors
Copy link
Collaborator

bors commented Sep 18, 2018

📌 Commit 9da2e70 has been approved by ehuss

@alexcrichton
Copy link
Member Author

Certainly!

Do you think that we should perhaps move the multiplexing (which I think enables HTTP/2?) behind a config option? That way this would be off by default and the call to test could have instructions about how to enable.

Hm... actually I'm gonna do that.

@bors: r-

Let's hopefully weed out any configuration issues before turning it on by
default!
@alexcrichton
Copy link
Member Author

@bors: r=ehuss

@bors
Copy link
Collaborator

bors commented Sep 18, 2018

📌 Commit f4dcb5c has been approved by ehuss

@bors
Copy link
Collaborator

bors commented Sep 18, 2018

⌛ Testing commit f4dcb5c with merge 57ac392...

bors added a commit that referenced this pull request Sep 18, 2018
Download crates in parallel with HTTP/2

This PR revives some of the work of #5161 by refactoring Cargo to make it much easier to add parallel downloads, and then it does so with the `curl` crate's new `http2` feature to compile `nghttp2` has a backend.

The primary refactoring done here is to remove the concept of "download this one package" deep within a `Source`. Instead a `Source` still has a `download` method but it's considered to be largely non-blocking. If a crate needs to be downloaded it immediately returns information as to such. The `PackageSet` abstraction is now a central location for all parallel downloads, and all users of it have been refactored to be amenable to parallel downloads, when added.

Many more details are in the commits...
@bors
Copy link
Collaborator

bors commented Sep 19, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: ehuss
Pushing 57ac392 to master...

@bors bors merged commit f4dcb5c into rust-lang:master Sep 19, 2018
@alexcrichton alexcrichton deleted the download-parallel branch September 23, 2018 19:41
alexcrichton added a commit to alexcrichton/cargo that referenced this pull request Sep 24, 2018
This commit brings back the old one-line-per-crate output that Cargo's
had since the beginning of time but was removed in rust-lang#6005. This was
requested on our [users forum][1]

[1]: https://internals.rust-lang.org/t/testing-cargos-parallel-downloads/8466/2
alexcrichton added a commit to alexcrichton/cargo that referenced this pull request Sep 24, 2018
Regressed in rust-lang#6005 it looks like the build plan requires all packages to
be downloaded rather than just those coming out of `unit_dependenices`,
so let's make sure to download everything!

Closes rust-lang#6082
bors added a commit that referenced this pull request Sep 24, 2018
Print a line per downloaded crate

This commit brings back the old one-line-per-crate output that Cargo's
had since the beginning of time but was removed in #6005. This was
requested on our [users forum][1]

[1]: https://internals.rust-lang.org/t/testing-cargos-parallel-downloads/8466/2
bors added a commit that referenced this pull request Sep 24, 2018
Fix `--build-plan` with dev-dependencies

Regressed in #6005 it looks like the build plan requires all packages to
be downloaded rather than just those coming out of `unit_dependenices`,
so let's make sure to download everything!

Closes #6082
@ehuss ehuss added this to the 1.31.0 milestone Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants