-
Notifications
You must be signed in to change notification settings - Fork 2.1k
cli: set a timeout for the initial connection ping #3663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d51817c to
36a3b0b
Compare
Codecov Report
@@ Coverage Diff @@
## master #3663 +/- ##
==========================================
+ Coverage 59.02% 59.11% +0.08%
==========================================
Files 289 289
Lines 24626 24684 +58
==========================================
+ Hits 14535 14591 +56
- Misses 9218 9220 +2
Partials 873 873 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments from me.
I think it would be ideal to make cli.Initialize accept context.Context from caller (which could be with a timeout) which would be more idiomatic. However since this is public API, we probably shouldn't change it because it would be a breaking change.
cli/command/cli_test.go
Outdated
| dir := fs.NewDir(t, t.Name()) | ||
| defer dir.Remove() | ||
|
|
||
| socket := dir.Join("my.sock") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing.T has a nice TempDir function for creating a temporary directory:
| dir := fs.NewDir(t, t.Name()) | |
| defer dir.Remove() | |
| socket := dir.Join("my.sock") | |
| dir := t.TempDir() | |
| socket := path.join(dir, "my.sock") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
cli/command/cli_test.go
Outdated
| cli := &DockerCli{client: apiClient} | ||
| cli.initTimeout = time.Millisecond | ||
| cli.Initialize(flags.NewClientOptions()) | ||
| assert.Assert(t, received) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This triggers a race condition detector (go test -race) because received = true above is done in other goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
cli/command/cli_test.go
Outdated
|
|
||
| cli := &DockerCli{client: apiClient} | ||
| cli.initTimeout = time.Millisecond | ||
| cli.Initialize(flags.NewClientOptions()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Initialize would hang in this situation (in case some future regression happens and the timeout behaviour breaks), then the test will also hang. We should probably add some timeout here - if the Initialize doesn't return in expected time then fail the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
cli/command/cli_test.go
Outdated
| assert.NilError(t, err) | ||
|
|
||
| cli := &DockerCli{client: apiClient} | ||
| cli.initTimeout = time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Instead of writing to a private field, you could also make an InitializeOpt function that would set this, that could be passed directly to the cli.Initialize:
cli.Initialize(flags.NewClientOptions(), WithTimeout(time.Millisecond))This would also make it possible to specify different timeout for the external caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
a7ca454 to
26a1841
Compare
|
@vvoland is there anything left you need from me on this? (i don't have permission to merge the PR) |
|
Looks good to me, but maybe @thaJeztah would like to take a look? |
| // time="2020-04-10T10:16:26Z" level=warning msg="commandConn.CloseWrite: commandconn: failed to wait: signal: killed" | ||
| var cancel func() | ||
| ctx, cancel = context.WithTimeout(ctx, 2*time.Second) | ||
| ctx, cancel = context.WithTimeout(ctx, cli.getInitTimeout()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminds me that I still wanted to look into this TODO as well;
Lines 135 to 138 in d0df532
| func (cli *DockerCli) ServerInfo() ServerInfo { | |
| // TODO(thaJeztah) make ServerInfo() lazily load the info (ping only when needed) | |
| return cli.serverInfo | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the unix:// addition is the fix. The configurability wasn't in the original pr, it's something @vvoland asked for in review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for those details (I probably should've read back the whole discussion above)
So some thoughts;
First of all, "the fix"; I think this was an incorrect assumption on my side. I added this check in b397391 / #2424, wich the assumption that a local socket would not really need a timeout (unless something was horribly wrong), and of course "horribly wrong" now appears to not be theoretical (per your ticket) 😂
I assume the same could also happen on a named pipe (npipe://) on Windows or fd:// (default on Linux for systemd created socket)? If so, we could be explicit;
if !strings.HasPrefix(cli.DockerEndpoint().Host, "ssh://") {We should verify though if cli.DockerEndpoint().Host is guaranteed to always have a scheme (I know there's some "fuzzy" / "relaxed" handling of some values in our code-base, and we should verify that the values is at least normalized (scheme added if missing) early on). At a quick glance, I think that's the case though; calls getServerHost(), which does normalising (lol, we need to get rid of some abstraction here and there);
Line 324 in a32cd16
| host, err := getServerHost(opts.Hosts, opts.TLSOptions) |
As to making the timeout configurable; in general, I'm not against adding configurable options; that said, once added, it's pretty hard to (deprecate and) remove an option, so I also try to be conservative where possible. I realize the 2 seconds was just some arbitrary value ("2 seconds should be enough?"); for your use-case, are you planning on adjusting that default, or would the "fix" be enough (only needed to prevent a full hang, but default is "fine" otherwise)?
What I'm thinking is, that we could; split the fix from the feature; the fix (one line change, and appears to be low risk) could be candidate for backporting to 20.10 if we think that's good. For the feature (make the timeout configurable), we could keep that in draft (for now), and still decide to take it if it turns out there's good use-cases for overriding. Ultimately (also see my other comment above), I think it would be great if we didn't bother with pinging at all, unless we need it (in most cases that would be "the moment we need to connect to the API anyway").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I backed out the WithInitializationTimeout setting. I also changed it to check "not ssh://".
ya, i'm pretty confident that this codepath always normalizes the Host to have a scheme:
Line 71 in a32cd16
| func parseDockerDaemonHost(addr string) (string, error) { |
speaking to some of the more high-level comments:
I still don't know how the user's Docker engine got into the state where it was hanging on the ping, but I think it's a good engineering value to be paranoid.
I really like the general Go convention of splitting "Connection" from "Client", where the Connection handles pings/healthchecks and the Client handles API types. grpc or kubernetes/client-go both separate them out this way. i think the Ping() here is a symptom of that more high-level problem of the client trying to do both in a single object. but appreciate that's a larger change.
|
I did a quick squash and rebase of this PR in #3722 (I'll rebase this one to restore the |
Note that this does not fully fix the referenced issue, but at least makes sure that API clients don't hang forever on the initialization step. See: docker#3652 Signed-off-by: Nick Santos <nick.santos@docker.com>
|
squashed this one! ya, the WithInitializeTimeout() bit isn't hard to add, and i'm not attached to it. |
|
got merged in #3722 |
Note that this does not fully fix the referenced issue, but
at least makes sure that API clients don't hang forever on
the initialization step. (now it hangs on a later request)
See: #3652
- What I did
see attached issue
- How I did it
see attached issue
- How to verify it
The
docker psin the attached issue still hangs, but now hangs at a later step. You can see this by runningkill -s ABRTon the hung process to see where it's hanging- Description for the changelog
Add timeout to initial docker client initialization
- A picture of a cute animal (not mandatory but encouraged)
