-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-59764] Connection timeout is not honored #755
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked at the code a little. It looks like this is more "the tip of the iceberg" than a full solution to the problem...
It would appear that, although PortUtils.ConnectionCheckSSH
has a sshTimeoutMillis
field, nothing ever calls withSSHTimeout
to set it, so it always stays with the default value of 2 seconds.
This is bad ... but it does mean that any solution to the problems in this area can be free to make changes as nothing's going to be affected by them...
The user-facing configuration is from DockerComputerSSHConnector
and this has launchTimeoutSeconds
, maxNumRetries
and retryWaitTime
as user-controlled configuration for connection-related matters, but these are really intended for hudson.plugin.sshslaves.SSHLauncher
rather than for use by the docker-plugin
code itself.
The docker-plugin
code doesn't have any timeout durations specified by the user for it to use for itself; everything it has is intended for the SSHLauncher
.
So that means that the docker-plugin
either has to use the SSHLauncher
configuration fields or use defaults ... and it's the fact that it's passing in zero as the connectTimeout
argument to sshConnection.connect
that's the cause of this issue.
However, I believe that simply swapping 0, sshTimeoutMillis
to sshTimeoutMillis, 0
will merely move the problem and not solve it.
The arguments to the connect
method here are connect(ServerHostKeyVerifier verifier, int connectTimeout, int readTimeout, int kexTimeout)
The timeout arguments get passed to com.trilead.ssh2.transport.TransportManager.establishConnection(ProxyData proxyData, int connectTimeout, int readTimeout)
In here, the connectTimeout
is used purely when establishing the initial TCP connect to the SSH daemon, and the readTimeout
to the socket once open.
So, if we simply swap the two values over, we're then setting a readTimeout
of zero, and that means that in the event that the SSH daemon accepts incoming connections but then does not respond to any further traffic (which is possible when things are failing, so it's a failure mode this code must handle) then the docker plugin code would wait indefinately, and that'd be bad...
We need to ensure that there are no circumstances under which this code can hang indefinately. If cloud/slave provisioning hangs then "Bad Things" happen to Jenkins that manifests itself as all sorts of other problems.
I therefore think that the correct fix would be sshConnection.connect(null, sshTimeoutMillis, sshTimeoutMillis, sshTimeoutMillis)
, i.e. to pass in our one-and-only timeout field for every timeout field the underlying Connection
class requires.
ALSO if we're trying to fix JENKINS-59764 then we should also improve the error message from DockerComputerSSHConnector.createLauncher so that we provide seconds and milliseconds rather than just seconds ... and we should also add text to the exception message saying something like "Try increasing the number of retries (currently "+maxNumRetriesOrNull+") and/or the retry wait time (currently "+retryWaitTimeOrNull+") to allow for containers taking longer to start."
so that anyone encountering this error message will know that the answer is to increase the retries and/or wait time ... especially if either of them are null/zero.
src/main/java/com/nirima/jenkins/plugins/docker/utils/PortUtils.java
Outdated
Show resolved
Hide resolved
I did not dig too much on it but on my test, the default value for sshTimeoutMillis was 30 seconds, and it changes if you modified the value in the advanced SSH options on the template. After set the connection timeout and remove the time read as we have in the ssh-slaves-plugin it works as expected no connection is lost https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L832-L834
the read timeout is set to 0 due JENKINS-48618, I have to check if with the latest changes on trilead-ssh2 this is fixed I think so but I did not test yet.
I've tested for 2 hours launching 10 docker agents every 3 minutes, no issues related to the connection. |
I think you're confusing what's set in the ssh-slaves-plugin with what's set in the docker-plugin.
It's set to zero in the ssh-slaves-plugin ... and JENKINS-48618 states that this is not correct. I don't see why the docker-plugin should make the same error. Overall, regardless of the ssh-slaves-plugin's issues, the docker-plugin isn't the ssh-slaves-plugin and the code under scrutiny in this PR has a much simpler job and slightly different requirements - they're different bits of code doing different things for different reasons, so there's no reason for them to be identical.
That only shows everything is working fine under ideal conditions and under minimal load - demonstrating that everything continues to work as it should when containers are misbehaving and/or things are under load is much more difficult. Personally I've found (in real world usage where I work) that a docker daemon can lock up, and that docker containers can lock up, and networks can stop passing traffic, and these scenarios all result in TCP/IP connectivity halting, and code can only handle that if it has read timeouts active on all network sockets that it uses. i.e. I agree that setting a connection timeout is a good thing, but I strongly believe that it is important that the docker-plugin does not the readTimeout to zero when it's polling to see if a container's SSH daemon is running. |
yep, you are right I saw the correct values before that call, I did not inspect in that point, in this snippet the values are correct https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L248-L255 but as you said those values are not sent to the connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PortUtil changes are good 👍
The DockerComputerSSHConnector changes are going in the right direction but need some changes.
@@ -253,8 +254,12 @@ protected ComputerLauncher createLauncher(DockerAPI api, String workdir, Inspect | |||
if ( retryWaitTimeOrNull!=null ) { | |||
connectionCheck.withEveryRetryWaitFor( retryWaitTimeOrNull, TimeUnit.SECONDS ); | |||
} | |||
final Integer launchTimeoutSeconds = getLaunchTimeoutSeconds(); | |||
if( launchTimeoutSeconds != null) { | |||
connectionCheckSSH = connectionCheckSSH.withSSHTimeout(launchTimeoutSeconds, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm in favor of something calling withSSHTimeout
, the launchTimeoutSeconds
value was never intended to be used for this purpose.
It would be wrong for the code to use the SSHLauncher's launchTimeoutSeconds
field to also be the SSH timeout value as that's not as-documented and not as-intended; it would cause confusion if we reuse this field without documenting it as such, and it's going to be difficult to explain WTF the field is if we try to document it (and, in my experience, when code's behavior is difficult to justify in the documention, that means the code is wrong!).
Unfortunately there is no other existing configuration field that is suitable here, so we either have to add a new configuration field explicitly for this (which we could do, but would be more work), or we just leave it with its default value (which hasn't been a problem in the last N years, so it's unlikely to be a big issue now).
i.e. don't use this field, either add a new one or just don't set it at all.
...alternatively: Promise me that, after the ssh-slaves-plugin introduces a new field for the readTimeout, you submit another PR for this plugin that adds a field of the same name as SSHLauncher's readTimeout field that pass the field to the SSHLauncher when that's create and also use it at this point in the code for the withSSHTimeout
value 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm in favor of something calling withSSHTimeout, the launchTimeoutSeconds value was never intended to be used for this purpose.
It is the value passed from UI and used everywhere, this value is changed when you changeConnection Timeout in Seconds
setting in the UI, so I guess it is the correct value
https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/com/nirima/jenkins/plugins/docker/launcher/DockerComputerSSHLauncher.java#L30
https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L359-L362
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I agree that the launchTimeoutSeconds
field is the one described as Connection Timeout in Seconds
in the UI and that the configuration logic for all of that is working as expected.
👎 What I don't agree with is using that field for the withSSHTimeout
value.
The "Connection Timeout in Seconds"/launchTimeoutSeconds
field (in the DockerComputerSSHLauncher) is equivalent to (and used as) the field of the same name in the SSHLauncher class. That field is documented (by the ssh-slaves-plugin) as:
Set the timeout value for ssh agent launch in seconds. If empty, it will be reset to default value.
This will only set the timeout for agent launching; once launched, the timeout will not apply.
i.e. the launchTimeoutSeconds
field holds the timeout that the SSHLauncher uses to set a limit on the time it takes for the already-logged-in SSH connection to result in an active running slave.jar process before the SSHLauncher declares a failure; it is not the time limit for the SSH daemon to respond to a login, so it'd be wrong to use it as that.
The SSHLauncher (currently) has no "timeout while waiting for progress during the SSH login process" field (and this is why the DockerComputerSSHLauncher has no equivalent). The launchTimeoutSeconds
field (that it does have) is not intended for (or described as) that purpose, so we should not use it for that purpose.
TL;DR: Our withSSHTimeout
method sets a different kind of connection timeout to the one described by the "Connection Timeout in Seconds" field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. the launchTimeoutSeconds field holds the timeout that the SSHLauncher uses to set a limit on the time it takes for the already-logged-in SSH connection to result in an active running slave.jar process before the SSHLauncher declares a failure; it is not the time limit for the SSH daemon to respond to a login, so it'd be wrong to use it as that.
in the ssh-slaves-plugin, launchTimeoutSeconds
is used as timeout for the SSH connection, the "Connection Timeout in Seconds" in the ssh-slaves-plugin is the same field than "Connection Timeout in Seconds" in the DockerComputerSSHLauncher, in this PR the launchTimeoutSeconds
is used as timeout for the SSH connect (same ssh-slaves), read, and connection(same ssh-slaves). ConnectionCheckSSH
checks if it is possible to connect to the SSH agent, for that uses trilead-ssh2 connect method with the same parameter it would use for the Launcher, I do not see any need to define a timeout while waiting for progress during the SSH login process
because ConnectionCheckSSH
is checking the connection and it uses the parameter defined for the SSHLauncher, for me it is correct, by adding another parameter to the docker-plugin only for the SSH would be overhead more the UI, making difficult to tune the plugin even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've got to the bottom of this at long last...
TL;DR: My statements used to be correct but, in the as-yet-unreleased version of the ssh-slaves-plugin, what you've said will be true.
My reasoning...
Both plugins are using com.trilead.ssh2.Connection
's method connect(ServerHostKeyVerifier verifier, int connectTimeout, int readTimeout, int kexTimeout)
. This is something they have in common and that hasn't changed.
Our underlying requirement for the docker-plugin's PortUtils code is that it must test "is that container ready to receive a SSH connection from the ssh-slaves-plugin code?", and we need it to do that in a way that can not deadlock.
In the latest release of the docker-plugin, the connectTimeout
value was hard-coded as zero with the other two coming from the sshTimeoutMillis
field, which was always left at the default of 2 seconds.
With this PR, all 3 parameters are sshTimeoutMillis
and that'll be set to the "Connection Timeout in Seconds" value.
In the latest (1.31.0) release of the ssh-slaves-plugin code, the connectTimeout
value comes from the "Connection Timeout in Seconds" field (aka launchTimeoutSeconds
), the readTimeout parameter passed to com.trilead.ssh2.Connection
's connect
method is a hard-coded value of 0 /*read timeout - JENKINS-48618*/
and the kexTimeout
is the same as connectTimeout
... but it's all run inside a newSingleThreadExecutor
with an overall timeout set to "Connection Timeout in Seconds".
However, in the bleeding-edge code (the current master branch), it's changed: kexTimeout
is now 5 seconds (hard-coded) longer and the newSingleThreadExecutor
is not given an overall timeout set to "Connection Timeout in Seconds".
It's the changes in the ssh-slaves-plugin code that confused me - you were looking at the code changes you were making and I was looking at the code in the last release, which is why we were both seeing different things 😁
In summary: If that's the direction that the ssh-slaves-plugin is going then it's ok to use the "Connection Timeout in Seconds" field the same way.
src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java
Show resolved
Hide resolved
src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java
Outdated
Show resolved
Hide resolved
Code changes went into released of docker-plugin version 1.1.9 |
See JENKINS-59764.
I've found that the connector used by the docker-plugin passes the connection timeout in the wrong argument to the connect method.
It does not require a test because of the connection method would not be tested here.
Submitter checklist
@pjdarton