Skip to content

Conversation

@nicks
Copy link
Contributor

@nicks nicks commented Jun 6, 2022

Note that this does not fully fix the referenced issue, but
at least makes sure that API clients don't hang forever on
the initialization step. (now it hangs on a later request)

See: #3652

- What I did
see attached issue

- How I did it
see attached issue

- How to verify it
The docker ps in the attached issue still hangs, but now hangs at a later step. You can see this by running kill -s ABRT on the hung process to see where it's hanging

- Description for the changelog
Add timeout to initial docker client initialization

- A picture of a cute animal (not mandatory but encouraged)
IMG_20200502_114336

@nicks nicks force-pushed the nicks/issue3652 branch 2 times, most recently from d51817c to 36a3b0b Compare June 6, 2022 16:30
@codecov-commenter
Copy link

codecov-commenter commented Jun 6, 2022

Codecov Report

Merging #3663 (26a1841) into master (b75c262) will increase coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3663      +/-   ##
==========================================
+ Coverage   59.02%   59.11%   +0.08%     
==========================================
  Files         289      289              
  Lines       24626    24684      +58     
==========================================
+ Hits        14535    14591      +56     
- Misses       9218     9220       +2     
  Partials      873      873              

Copy link
Collaborator

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from me.
I think it would be ideal to make cli.Initialize accept context.Context from caller (which could be with a timeout) which would be more idiomatic. However since this is public API, we probably shouldn't change it because it would be a breaking change.

Comment on lines 173 to 176
dir := fs.NewDir(t, t.Name())
defer dir.Remove()

socket := dir.Join("my.sock")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing.T has a nice TempDir function for creating a temporary directory:

Suggested change
dir := fs.NewDir(t, t.Name())
defer dir.Remove()
socket := dir.Join("my.sock")
dir := t.TempDir()
socket := path.join(dir, "my.sock")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

cli := &DockerCli{client: apiClient}
cli.initTimeout = time.Millisecond
cli.Initialize(flags.NewClientOptions())
assert.Assert(t, received)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This triggers a race condition detector (go test -race) because received = true above is done in other goroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!


cli := &DockerCli{client: apiClient}
cli.initTimeout = time.Millisecond
cli.Initialize(flags.NewClientOptions())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Initialize would hang in this situation (in case some future regression happens and the timeout behaviour breaks), then the test will also hang. We should probably add some timeout here - if the Initialize doesn't return in expected time then fail the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

assert.NilError(t, err)

cli := &DockerCli{client: apiClient}
cli.initTimeout = time.Millisecond
Copy link
Collaborator

@vvoland vvoland Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Instead of writing to a private field, you could also make an InitializeOpt function that would set this, that could be passed directly to the cli.Initialize:

cli.Initialize(flags.NewClientOptions(), WithTimeout(time.Millisecond))

This would also make it possible to specify different timeout for the external caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@nicks nicks force-pushed the nicks/issue3652 branch 2 times, most recently from a7ca454 to 26a1841 Compare July 8, 2022 22:50
@nicks nicks requested a review from vvoland July 8, 2022 23:01
@nicks
Copy link
Contributor Author

nicks commented Jul 13, 2022

@vvoland is there anything left you need from me on this? (i don't have permission to merge the PR)

@vvoland
Copy link
Collaborator

vvoland commented Jul 14, 2022

Looks good to me, but maybe @thaJeztah would like to take a look?

// time="2020-04-10T10:16:26Z" level=warning msg="commandConn.CloseWrite: commandconn: failed to wait: signal: killed"
var cancel func()
ctx, cancel = context.WithTimeout(ctx, 2*time.Second)
ctx, cancel = context.WithTimeout(ctx, cli.getInitTimeout())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminds me that I still wanted to look into this TODO as well;

cli/cli/command/cli.go

Lines 135 to 138 in d0df532

func (cli *DockerCli) ServerInfo() ServerInfo {
// TODO(thaJeztah) make ServerInfo() lazily load the info (ping only when needed)
return cli.serverInfo
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious though; #3652 describes that the CLI hangs indefinitely; but there's already a 2 second timeout. I understand that this PR makes that timeout configurable, but wouldn't #3652 already timeout after 2 seconds? Or is the unix:// addition the actual fix here?

Copy link
Contributor Author

@nicks nicks Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the unix:// addition is the fix. The configurability wasn't in the original pr, it's something @vvoland asked for in review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those details (I probably should've read back the whole discussion above)

So some thoughts;

First of all, "the fix"; I think this was an incorrect assumption on my side. I added this check in b397391 / #2424, wich the assumption that a local socket would not really need a timeout (unless something was horribly wrong), and of course "horribly wrong" now appears to not be theoretical (per your ticket) 😂

I assume the same could also happen on a named pipe (npipe://) on Windows or fd:// (default on Linux for systemd created socket)? If so, we could be explicit;

if !strings.HasPrefix(cli.DockerEndpoint().Host, "ssh://") {

We should verify though if cli.DockerEndpoint().Host is guaranteed to always have a scheme (I know there's some "fuzzy" / "relaxed" handling of some values in our code-base, and we should verify that the values is at least normalized (scheme added if missing) early on). At a quick glance, I think that's the case though; calls getServerHost(), which does normalising (lol, we need to get rid of some abstraction here and there);

host, err := getServerHost(opts.Hosts, opts.TLSOptions)

As to making the timeout configurable; in general, I'm not against adding configurable options; that said, once added, it's pretty hard to (deprecate and) remove an option, so I also try to be conservative where possible. I realize the 2 seconds was just some arbitrary value ("2 seconds should be enough?"); for your use-case, are you planning on adjusting that default, or would the "fix" be enough (only needed to prevent a full hang, but default is "fine" otherwise)?

What I'm thinking is, that we could; split the fix from the feature; the fix (one line change, and appears to be low risk) could be candidate for backporting to 20.10 if we think that's good. For the feature (make the timeout configurable), we could keep that in draft (for now), and still decide to take it if it turns out there's good use-cases for overriding. Ultimately (also see my other comment above), I think it would be great if we didn't bother with pinging at all, unless we need it (in most cases that would be "the moment we need to connect to the API anyway").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

I backed out the WithInitializationTimeout setting. I also changed it to check "not ssh://".

ya, i'm pretty confident that this codepath always normalizes the Host to have a scheme:

func parseDockerDaemonHost(addr string) (string, error) {

speaking to some of the more high-level comments:

I still don't know how the user's Docker engine got into the state where it was hanging on the ping, but I think it's a good engineering value to be paranoid.

I really like the general Go convention of splitting "Connection" from "Client", where the Connection handles pings/healthchecks and the Client handles API types. grpc or kubernetes/client-go both separate them out this way. i think the Ping() here is a symptom of that more high-level problem of the client trying to do both in a single object. but appreciate that's a larger change.

@thaJeztah
Copy link
Member

I did a quick squash and rebase of this PR in #3722

(I'll rebase this one to restore the WithInitializeTimeout() for further discussion 👍

Note that this does not fully fix the referenced issue, but
at least makes sure that API clients don't hang forever on
the initialization step.

See: docker#3652
Signed-off-by: Nick Santos <nick.santos@docker.com>
@nicks nicks force-pushed the nicks/issue3652 branch from 4de1b74 to eff737f Compare July 29, 2022 20:58
@nicks
Copy link
Contributor Author

nicks commented Jul 29, 2022

squashed this one!

ya, the WithInitializeTimeout() bit isn't hard to add, and i'm not attached to it.

@nicks nicks closed this Aug 27, 2022
@nicks
Copy link
Contributor Author

nicks commented Aug 27, 2022

got merged in #3722

@nicks nicks deleted the nicks/issue3652 branch August 27, 2022 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants