-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote execution via DOCKER_HOST incurs 50x ping latency. #440
Comments
Hi @lynaghk, thanks for filing this issue and for all the detailed information! I would guess the latency is coming from a combination of your first two guesses. From your data, it looks like most of the Docker commands are experiencing about ~200ms of extra overhead in the Mac <> EC2 case (the words "remote" and "local" seem to have swapped meaning in the issue description compared to the table, so I will try to avoid using those terms to prevent confusion). That amounts to about ~8 round trips given your ping estimates. TCP itself takes a few round trips to set up a connection, send a packet of data, and terminate the connection—and some of the messages are probably broken into multiple packets which might add a bit more latency (especially due to TCP slow start). All told, my guess is that the root cause here is likely just the fact that Docker is setting up a fresh TCP connection for every request, resulting in extra round trips for the handshake and suboptimal behavior due to resetting the congestion control parameters. I am not a networking expert, however, so I could be mistaken. Regarding the protocol itself, I think Docker uses gRPC (and protobufs). My understanding is that gRPC normally tries to reuse TCP connections, but in this case the Docker CLI is being spawned as a separate process by Toast for every interaction, since that's how the CLI works. So it's a new connection every time. I doubt DNS resolution is the culprit, since I would expect it to be cached after the first query. I think the only possible solution to this would be to switch Toast to using the Docker API directly rather than the Docker CLI, e.g., by using bollard. However, a while back there was an effort to switch a related project Docuum to bollard, and it had to be reverted due to some difficult-to-debug issues we found with both bollard and Docker (1, 2). So I'm hesitant to attempt that for Toast. I'll leave this issue open for a little while in case there is anything to add to the discussion, then I'll close it if it seems like a lost cause. (Fixed a couple typos.) |
Thanks for the quick response @stepchowfun! You're correct, the spreadsheet labels are swapped --- that's what I get for debugging instead of going to sleep =D Using
The downside is that, yeah, it'd be a pretty substantial change to the Toast internals to reduce a few hundred milliseconds of latency that most users may not even notice. How would you feel about supporting the Whether the effort is worth it probably depends on:
E.g., I'd love if, e.g., toast "felt" as fast as ripgrep, but no idea if that's feasible. |
I'd love for Toast to be snappier, however I'm concerned about the maintenance burden of having to support both implementations (e.g., testing with new Docker releases, etc.).. Even if it's behind a flag, I wouldn't want to neglect the code and let it bitrot. So I prefer that we have only one way to do things to ease the maintenance costs. Also I don't have super high confidence in Bollard based on my past experience with it. I'm not strongly opposed, however. Hypothetically if someone created a pull request for this and it made Toast feel a lot faster and it passes all the integration tests, I might accept it (but then I'd prefer that it not be behind a flag, and that we just switch to Bollard entirely). But I don't want to promise that and end up wasting anyone's time. I should also mention that the Bollard implementation would be incompatible with #362, but that's probably not a big deal. |
Makes complete sense. I also just discovered #441 so that'd require additional work to support the remote engine use case. FWIW, I can't look into Bollard near-term myself, so feel free to close this issue. If I do take a crack at it, I'll definitely mention it here, though. |
Sounds good. Thanks for the thorough bug report and discussion. I wish there was an easy fix to make remote execution faster. It's possible you're the only person who's tried doing that with Toast so far, so this is unexplored territory. |
I had a crack at converting a few docker CLI calls to use bollard instead.
However, using Alternatively, I considered using Docker's JSON API directly rather than going through I'm new to the async corners of Rust, so my proof-of-concept code isn't particularly efficient.
I basically changed a few tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap()
.block_on(async {
let docker = bollard::Docker::connect_with_socket_defaults().unwrap();
let options = Some(bollard::container::UploadToContainerOptions {
path: "/",
..Default::default()
});
//todo kevin: actually stream rather than reading into memory. rust types are hard.
let mut data = Vec::new();
tar.read_to_end(&mut data).unwrap();
docker
.upload_to_container(container, options, data.into())
.await
.map(|_res| ())
.map_err(|_err| Failure::System("kevin todo".to_string(), None))
}) Anyway, just an FYI progress report. Let me know if you'd prefer me to move this to a separate issue or just work on my own fork. |
That speedup is awesome. Regarding binary size, I wouldn't worry about it at all. 5.7 MB is nothing. |
First, let me say that Toast is really well done! I haven't done anything with Docker/containers before, but the Toast README is succinct and clear, and I'm impressed with how you've kept the scope focused. Nice job, and thanks for sharing it with us!
I'm investigating using Toast from a Mac to run tasks on a remote backend via the
DOCKER_HOST
env var. However, doing so introduces substantial latency beyond what I'd expect from network overhead. It could be that this is entirely in the Docker engine and out of scope for Toast, but I figured it'd be worth raising here in case anyone has a similar use case and possible suggestion for how to improve the situation.I'm testing by running this minimal toast.yml:
and all of the numbers below refer to
Docker version 20.10.13, build a224086
on an EC2 ARM machine with Toast 0.45.5.The only difference is whether I invoke toast from a terminal on the EC2 machine or from my Mac. Ping to the EC2 machine is 20--30 ms.
SSH has a 2s login overhead, so it's out of the question. I'm using TCP instead, here's the command I'm using from my Mac:
(The
ts
adds a microsecond timestamp to each printed log line.)Here's a comparison between execution from the Mac and from a terminal on the server. Note that the remote execution via
DOCKER_HOST
takes about 1s longer than a local execution on the EC2 server:Again, I have no experience with docker, but can field a few guesses:
docker
Do you have any suggestions for things to investigate to improve the latency here? Thanks!
The text was updated successfully, but these errors were encountered: