Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote execution via DOCKER_HOST incurs 50x ping latency. #440

Closed
lynaghk opened this issue Jul 26, 2022 · 7 comments
Closed

Remote execution via DOCKER_HOST incurs 50x ping latency. #440

lynaghk opened this issue Jul 26, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@lynaghk
Copy link

lynaghk commented Jul 26, 2022

First, let me say that Toast is really well done! I haven't done anything with Docker/containers before, but the Toast README is succinct and clear, and I'm impressed with how you've kept the scope focused. Nice job, and thanks for sharing it with us!

I'm investigating using Toast from a Mac to run tasks on a remote backend via the DOCKER_HOST env var. However, doing so introduces substantial latency beyond what I'd expect from network overhead. It could be that this is entirely in the Docker engine and out of scope for Toast, but I figured it'd be worth raising here in case anyone has a similar use case and possible suggestion for how to improve the situation.

I'm testing by running this minimal toast.yml:

image: ubuntu:20.04@sha256:c01a10535d9ea3e4065dee4abdb288821b2918929f1e1584287a03e731b9176c
command_prefix: set -euo pipefail
tasks:
  hello:
    cache: false
    command: echo "hello world" > my_output
    output_paths:
      - my_output

and all of the numbers below refer to Docker version 20.10.13, build a224086 on an EC2 ARM machine with Toast 0.45.5.
The only difference is whether I invoke toast from a terminal on the EC2 machine or from my Mac. Ping to the EC2 machine is 20--30 ms.

SSH has a 2s login overhead, so it's out of the question. I'm using TCP instead, here's the command I'm using from my Mac:

LOG_LEVEL=debug DOCKER_HOST=tcp://containertron:2375 toast hello 2>&1 | ts "[%.S]"

(The ts adds a microsecond timestamp to each printed log line.)

Here's a comparison between execution from the Mac and from a terminal on the server. Note that the remote execution via DOCKER_HOST takes about 1s longer than a local execution on the EC2 server:

Screen Shot 2022-07-25 at 10 42 59 PM

Again, I have no experience with docker, but can field a few guesses:

  • TCP connection overhead
  • a bunch of 25 ms round-trips between the Docker CLI and the engine
  • protocol overhead (if Docker chats something different over TCP than UNIX socket?)
  • DNS resolution overhead across Toast's multiple invocations of docker

Do you have any suggestions for things to investigate to improve the latency here? Thanks!

@lynaghk lynaghk added the bug Something isn't working label Jul 26, 2022
@stepchowfun
Copy link
Owner

stepchowfun commented Jul 26, 2022

Hi @lynaghk, thanks for filing this issue and for all the detailed information!

I would guess the latency is coming from a combination of your first two guesses. From your data, it looks like most of the Docker commands are experiencing about ~200ms of extra overhead in the Mac <> EC2 case (the words "remote" and "local" seem to have swapped meaning in the issue description compared to the table, so I will try to avoid using those terms to prevent confusion). That amounts to about ~8 round trips given your ping estimates. TCP itself takes a few round trips to set up a connection, send a packet of data, and terminate the connection—and some of the messages are probably broken into multiple packets which might add a bit more latency (especially due to TCP slow start).

All told, my guess is that the root cause here is likely just the fact that Docker is setting up a fresh TCP connection for every request, resulting in extra round trips for the handshake and suboptimal behavior due to resetting the congestion control parameters. I am not a networking expert, however, so I could be mistaken.

Regarding the protocol itself, I think Docker uses gRPC (and protobufs). My understanding is that gRPC normally tries to reuse TCP connections, but in this case the Docker CLI is being spawned as a separate process by Toast for every interaction, since that's how the CLI works. So it's a new connection every time.

I doubt DNS resolution is the culprit, since I would expect it to be cached after the first query.

I think the only possible solution to this would be to switch Toast to using the Docker API directly rather than the Docker CLI, e.g., by using bollard. However, a while back there was an effort to switch a related project Docuum to bollard, and it had to be reverted due to some difficult-to-debug issues we found with both bollard and Docker (1, 2). So I'm hesitant to attempt that for Toast.

I'll leave this issue open for a little while in case there is anything to add to the discussion, then I'll close it if it seems like a lost cause.

(Fixed a couple typos.)

@lynaghk
Copy link
Author

lynaghk commented Jul 26, 2022

Thanks for the quick response @stepchowfun!

You're correct, the spreadsheet labels are swapped --- that's what I get for debugging instead of going to sleep =D

Using bollard to connect directly is an interesting idea. The benefits are that it'd:

  • eliminate the latency due to TCP overhead (in the remote case)
  • eliminate the latency of spawning the docker binary in both the remote and local cases (even with a warm cache, time docker --version takes ~90ms on my M1 Macbook Air)
  • eliminate the need to have a docker binary installed at all (in the remote case)

The downside is that, yeah, it'd be a pretty substantial change to the Toast internals to reduce a few hundred milliseconds of latency that most users may not even notice.

How would you feel about supporting the bollard interface hidden behind a flag? That would make it easy to test performance and semantics between the docker command line driver versus bollard. Implementation-wise, my first instinct would be to refactor Toast's docker module functions into a trait, then implement for cli and bollard structs.

Whether the effort is worth it probably depends on:

  • how far we can push Docker itself into a low-latency regime
  • if we can, whether low-latency execution (remote or local) unlocks compelling use cases

E.g., I'd love if, e.g., toast "felt" as fast as ripgrep, but no idea if that's feasible.

@stepchowfun
Copy link
Owner

I'd love for Toast to be snappier, however I'm concerned about the maintenance burden of having to support both implementations (e.g., testing with new Docker releases, etc.).. Even if it's behind a flag, I wouldn't want to neglect the code and let it bitrot. So I prefer that we have only one way to do things to ease the maintenance costs. Also I don't have super high confidence in Bollard based on my past experience with it.

I'm not strongly opposed, however. Hypothetically if someone created a pull request for this and it made Toast feel a lot faster and it passes all the integration tests, I might accept it (but then I'd prefer that it not be behind a flag, and that we just switch to Bollard entirely). But I don't want to promise that and end up wasting anyone's time.

I should also mention that the Bollard implementation would be incompatible with #362, but that's probably not a big deal.

@lynaghk
Copy link
Author

lynaghk commented Jul 26, 2022

Makes complete sense. I also just discovered #441 so that'd require additional work to support the remote engine use case.

FWIW, I can't look into Bollard near-term myself, so feel free to close this issue. If I do take a crack at it, I'll definitely mention it here, though.

@stepchowfun
Copy link
Owner

Sounds good. Thanks for the thorough bug report and discussion. I wish there was an easy fix to make remote execution faster. It's possible you're the only person who's tried doing that with Toast so far, so this is unexplored territory.

@lynaghk
Copy link
Author

lynaghk commented Jul 30, 2022

I had a crack at converting a few docker CLI calls to use bollard instead.
There's significant speedup from eliminating the overhead of invoking the 65 MB docker binary.
It's about 1/5th the time with a local DOCKER_HOST and I expect the speedup will be even more substantial with a remote DOCKER_HOST.

$ LOG_LEVEL=debug rtss toast hello
   4.5ms    4.5ms # [DEBUG] Attempting to load configuration file `/Users/dev/Library/Application Support/toast/toast.yml`…
   4.5ms   58.1μs # [DEBUG] Configuration file not found. Using the default configuration.
   4.7ms    0.2ms # [INFO] Ready to run 1 task: `hello`.
   4.7ms          # [INFO] Running task `hello`…
   5.2ms    0.4ms # [INFO] Reading files…
  25.4ms   20.2ms # [DEBUG] Checking existence of image `toast:toast-caa7db7eed79c4b0f97a1a66e4b7a1e508e73b91e02c05aacc17546db9f9104f`…
  25.4ms   46.5μs # [INFO] Checking existence of image…
 136.3ms  110.9ms # [DEBUG] Creating container from image `toast:toast-caa7db7eed79c4b0f97a1a66e4b7a1e508e73b91e02c05aacc17546db9f9104f`…
 136.4ms   79.2μs # [INFO] Creating container…
 245.0ms  108.6ms # [DEBUG] Copying `out` from container `0a76504cd464c4ec530e7872bdd8bb4c824a37f89e332d348a49d68ddc1a51f3`…
 245.2ms    0.2ms # [INFO] Copying files from the container…
 459.6ms  214.4ms # [DEBUG] Deleting container `0a76504cd464c4ec530e7872bdd8bb4c824a37f89e332d348a49d68ddc1a51f3`…
 459.6ms   42.8μs # [INFO] Deleting container…
 576.8ms    exit status: 0

$ LOG_LEVEL=debug rtss ~/software/toast/target/release/toast hello
   3.8ms    3.8ms # [DEBUG] Attempting to load configuration file `/Users/dev/Library/Application Support/toast/toast.yml`…
   3.8ms   28.6μs # [DEBUG] Configuration file not found. Using the default configuration.
   4.0ms    0.2ms # [INFO] Ready to run 1 task: `hello`.
   4.0ms          # [INFO] Running task `hello`…
   4.4ms    0.4ms # [INFO] Reading files…
  22.5ms   18.1ms # [DEBUG] Checking existence of image `toast:toast-caa7db7eed79c4b0f97a1a66e4b7a1e508e73b91e02c05aacc17546db9f9104f`…
  29.1ms    6.6ms # [DEBUG] Creating container from image `toast:toast-caa7db7eed79c4b0f97a1a66e4b7a1e508e73b91e02c05aacc17546db9f9104f`…
  63.7ms   34.5ms # [DEBUG] Copying `out` from container `fbcb8fe7a2a1aeb8ab635167d062839899de8f401fd931f05dc241d810575783`…
 105.3ms   41.7ms # [DEBUG] Deleting container `fbcb8fe7a2a1aeb8ab635167d062839899de8f401fd931f05dc241d810575783`…
 114.3ms    exit status: 0

However, using bollard doubles the binary size from 2.6 to 5.7 MB (1.6 to 3.1 MB when debug symbols stripped and link time optimization enabled).

Alternatively, I considered using Docker's JSON API directly rather than going through bollard, but it's not clear how much codesize or complexity that might actually save given the need to stream files in/out of containers, configure ports, etc.

I'm new to the async corners of Rust, so my proof-of-concept code isn't particularly efficient.
Off the top of my head, things that need to get sorted out:

  • share a single tokio runtime rather than building one at every docker API invocation
  • don't read tarballs fully into memory; actually stream, uh, futures::stream::Streams into std::io::Readers or whatever
  • translate error/result types into what toast already has

I basically changed a few run_quiet_stdin callsites to something like:

    tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(async {
            let docker = bollard::Docker::connect_with_socket_defaults().unwrap();
            let options = Some(bollard::container::UploadToContainerOptions {
                path: "/",
                ..Default::default()
            });

            //todo kevin: actually stream rather than reading into memory. rust types are hard.
            let mut data = Vec::new();
            tar.read_to_end(&mut data).unwrap();

            docker
                .upload_to_container(container, options, data.into())
                .await
                .map(|_res| ())
                .map_err(|_err| Failure::System("kevin todo".to_string(), None))
        })

Anyway, just an FYI progress report. Let me know if you'd prefer me to move this to a separate issue or just work on my own fork.

@stepchowfun
Copy link
Owner

That speedup is awesome.

Regarding binary size, I wouldn't worry about it at all. 5.7 MB is nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants