Description
Environment:
macOS Mojave 10.14.6
Client: Docker Engine - Community
Version: 19.03.1
API version: 1.40
Go version: go1.12.5
Git commit: 74b1e89
Built: Thu Jul 25 21:18:17 2019
OS/Arch: darwin/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.1
API version: 1.40 (minimum version 1.12)
Go version: go1.12.5
Git commit: 74b1e89
Built: Thu Jul 25 21:17:52 2019
OS/Arch: linux/amd64
Experimental: true
containerd:
Version: v1.2.6
GitCommit: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc:
Version: 1.0.0-rc8
GitCommit: 425e105d5a03fabd737a126ad93d62a9eeede87f
docker-init:
Version: 0.18.0
GitCommit: fec3683
It appears the SSH forwarding system fails to release file descriptors as SSH agent connections are created and dropped.
Running a command, such as docker buildx build --ssh=default=$SSH_AUTH_SOCK --progress=plain --tag=outdoorsy/scotty .
to build a Dockerfile, in which the command RUN --mount=type=ssh go mod download
is run, results in the SSH agent on the host machine locking up, as file descriptors are never released.
There appears to be two issues.
Firstly, we appear to be leaking connections on this line https://github.com/moby/buildkit/blob/master/session/sshforward/copy.go#L20. This should probably be changed to a defer conn.Close()
.
Secondly, https://github.com/moby/buildkit/blob/master/session/sshforward/copy.go#L42 this line will hang forever, even after the ssh client has terminated. I can't figure out why. I would have assumed that, as the SSH client terminated, it would close its connection, resulting in an EOF on this read, which would then return from the function. In combination with the conn.Close
change, this would result in everything shutting down nicely.
To replicate this issue, you need to run a command that will generate a huge number of git requests in a single RUN
.
To see the issue with the conn.Read
never returning EOF, you can run
RUN --mount=type=ssh mkdir -p -m 0600 ~/.ssh && \
ssh-keyscan github.com >> ~/.ssh/known_hosts && \
ssh -T git@github.com ; sleep 10 ; exit 0
This spins up an SSH client, which will use the SSH_AUTH_SOCK, calls exit 0
so the build won't die, then sleeps for 10 seconds to give the FD a chance to die. It never does. I put in a bunch of fmt.Println
calls in the Copy
routine to see what it was doing. It will not exit until the RUN
is done, at which point the first loop receives a context cancelled
error and returns.
So, to summarize, it appears the Copy
routine does not return until the RUN
command finishes (triggering a context cancelled error). Because of this, the SSH FDs are held open, and, with an operation that uses SSH a lot, causes the host ssh-agent
to lock up.