Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-59764] Connection timeout is not honored #755

Merged
merged 3 commits into from
Nov 18, 2019

Conversation

kuisathaverat
Copy link
Contributor

@kuisathaverat kuisathaverat commented Oct 13, 2019

See JENKINS-59764.

I've found that the connector used by the docker-plugin passes the connection timeout in the wrong argument to the connect method.
It does not require a test because of the connection method would not be tested here.

Submitter checklist

  • JIRA issue is well described
  • Appropriate autotests or explanation to why this change has no tests

@pjdarton

Copy link
Member

@pjdarton pjdarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at the code a little. It looks like this is more "the tip of the iceberg" than a full solution to the problem...

It would appear that, although PortUtils.ConnectionCheckSSH has a sshTimeoutMillis field, nothing ever calls withSSHTimeout to set it, so it always stays with the default value of 2 seconds.
This is bad ... but it does mean that any solution to the problems in this area can be free to make changes as nothing's going to be affected by them...
The user-facing configuration is from DockerComputerSSHConnector and this has launchTimeoutSeconds, maxNumRetries and retryWaitTime as user-controlled configuration for connection-related matters, but these are really intended for hudson.plugin.sshslaves.SSHLauncher rather than for use by the docker-plugin code itself.
The docker-plugin code doesn't have any timeout durations specified by the user for it to use for itself; everything it has is intended for the SSHLauncher.

So that means that the docker-plugin either has to use the SSHLauncher configuration fields or use defaults ... and it's the fact that it's passing in zero as the connectTimeout argument to sshConnection.connect that's the cause of this issue.
However, I believe that simply swapping 0, sshTimeoutMillis to sshTimeoutMillis, 0 will merely move the problem and not solve it.
The arguments to the connect method here are connect(ServerHostKeyVerifier verifier, int connectTimeout, int readTimeout, int kexTimeout)
The timeout arguments get passed to com.trilead.ssh2.transport.TransportManager.establishConnection(ProxyData proxyData, int connectTimeout, int readTimeout)
In here, the connectTimeout is used purely when establishing the initial TCP connect to the SSH daemon, and the readTimeout to the socket once open.

So, if we simply swap the two values over, we're then setting a readTimeout of zero, and that means that in the event that the SSH daemon accepts incoming connections but then does not respond to any further traffic (which is possible when things are failing, so it's a failure mode this code must handle) then the docker plugin code would wait indefinately, and that'd be bad...
We need to ensure that there are no circumstances under which this code can hang indefinately. If cloud/slave provisioning hangs then "Bad Things" happen to Jenkins that manifests itself as all sorts of other problems.

I therefore think that the correct fix would be sshConnection.connect(null, sshTimeoutMillis, sshTimeoutMillis, sshTimeoutMillis), i.e. to pass in our one-and-only timeout field for every timeout field the underlying Connection class requires.

ALSO if we're trying to fix JENKINS-59764 then we should also improve the error message from DockerComputerSSHConnector.createLauncher so that we provide seconds and milliseconds rather than just seconds ... and we should also add text to the exception message saying something like "Try increasing the number of retries (currently "+maxNumRetriesOrNull+") and/or the retry wait time (currently "+retryWaitTimeOrNull+") to allow for containers taking longer to start." so that anyone encountering this error message will know that the answer is to increase the retries and/or wait time ... especially if either of them are null/zero.

@kuisathaverat
Copy link
Contributor Author

It would appear that, although PortUtils.ConnectionCheckSSH has a sshTimeoutMillis field, nothing ever calls withSSHTimeout to set it, so it always stays with the default value of 2 seconds.
This is bad ... but it does mean that any solution to the problems in this area can be free to make changes as nothing's going to be affected by them...

I did not dig too much on it but on my test, the default value for sshTimeoutMillis was 30 seconds, and it changes if you modified the value in the advanced SSH options on the template. After set the connection timeout and remove the time read as we have in the ssh-slaves-plugin it works as expected no connection is lost https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L832-L834

I therefore think that the correct fix would be sshConnection.connect(null, sshTimeoutMillis, sshTimeoutMillis, sshTimeoutMillis), i.e. to pass in our one-and-only timeout field for every timeout field the underlying Connection class requires.

the read timeout is set to 0 due JENKINS-48618, I have to check if with the latest changes on trilead-ssh2 this is fixed I think so but I did not test yet.

So, if we simply swap the two values over, we're then setting a readTimeout of zero, and that means that in the event that the SSH daemon accepts incoming connections but then does not respond to any further traffic (which is possible when things are failing, so it's a failure mode this code must handle) then the docker plugin code would wait indefinately, and that'd be bad...
We need to ensure that there are no circumstances under which this code can hang indefinately. If cloud/slave provisioning hangs then "Bad Things" happen to Jenkins that manifests itself as all sorts of other problems.

I've tested for 2 hours launching 10 docker agents every 3 minutes, no issues related to the connection.

@pjdarton
Copy link
Member

on my test, the default value for sshTimeoutMillis was 30 seconds

I think you're confusing what's set in the ssh-slaves-plugin with what's set in the docker-plugin.
The default value for PortUtils.ConnectionCheckSSH.sshTimeoutMillis is SECONDS.toMillis(2) as set here and the only place it's changed is via the withSSHTimeout method, and that isn't called anywhere except for the unit test code, PortUtilsTest.java.
The only timeout field that the docker-plugin code has available (in this region of the code) is the retryWaitTime field in DockerComputerSSHConnector (which defaults to 2 in the code), and that's passed to SSHLauncher's constructor's retryWaitTime parameter and to PortUtils.ConnectionCheck.withEveryRetryWaitFor.
As far as I am aware, nothing in the docker-plugin is set to 30 seconds. The default value for maxNumRetries is 30 times, but that's the only "30" I found in the docker-plugin code.

read timeout is set to 0 due JENKINS-48618

It's set to zero in the ssh-slaves-plugin ... and JENKINS-48618 states that this is not correct. I don't see why the docker-plugin should make the same error.
According to the code that the docker-plugin is calling, readTimeout is passed to java.net.Socket.setSoTimeout on the TCP socket between the master and the SSH daemon on the container, and it's important (to prevent the docker-plugin from locking up when docker containers misbehave) that the docker-plugin retain this setting.
It may also be a good idea for the ssh-slaves-plugin to use a non-zero setting as well, if that plugin doesn't have any other blockage-detecting logic and normal usage will guarantee a regular traffic to ensure that the timeout does not trigger inadvertently. However I believe that the default master...slave communications includes regular polling so that, if communications between the master and slave halted, this polling would detect it and initiate a disconnection, so I suspect that it's less of a problem for the ssh-slaves-plugin and I suspect that if a user turns off that polling then that may result in the readTimeout triggering and causing a disconnection when that isn't desired. Personally I'd be wary about making such low-level changes to such a high-profile plugin - there's a lot of hidden complexity here.

Overall, regardless of the ssh-slaves-plugin's issues, the docker-plugin isn't the ssh-slaves-plugin and the code under scrutiny in this PR has a much simpler job and slightly different requirements - they're different bits of code doing different things for different reasons, so there's no reason for them to be identical.
i.e. the docker-plugin code needs to "get it right" on its own basis and not merely follow the lead of the ssh-slaves-plugin.

I've tested for 2 hours launching 10 docker agents every 3 minutes, no issues related to the connection.

That only shows everything is working fine under ideal conditions and under minimal load - demonstrating that everything continues to work as it should when containers are misbehaving and/or things are under load is much more difficult.
e.g. you've not proven that you haven't accidentally broken behavior that only happens when docker containers misbehave.
Unfortunately the existing docker-plugin unit tests don't test for that either (ideally they should, but that's difficult to code).

Personally I've found (in real world usage where I work) that a docker daemon can lock up, and that docker containers can lock up, and networks can stop passing traffic, and these scenarios all result in TCP/IP connectivity halting, and code can only handle that if it has read timeouts active on all network sockets that it uses.
That's why I believe that the docker-plugin MUST continue to set readTimeout to a non-zero number in order to ensure that it cannot deadlock.
(FYI it was these lockups that got me involved in the docker-plugin to start with, several years ago now)
The only reason I could see why the docker-plugin would set a readTimeout of zero would be to workaround a bug in the trilead code, but that'd be better fixed in the trilead code and have Jenkins stick to a non-buggy version until it's fixed.

i.e. I agree that setting a connection timeout is a good thing, but I strongly believe that it is important that the docker-plugin does not the readTimeout to zero when it's polling to see if a container's SSH daemon is running.

@kuisathaverat
Copy link
Contributor Author

I think you're confusing what's set in the ssh-slaves-plugin with what's set in the docker-plugin.

yep, you are right I saw the correct values before that call, I did not inspect in that point, in this snippet the values are correct https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L248-L255 but as you said those values are not sent to the connection.

Copy link
Member

@pjdarton pjdarton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PortUtil changes are good 👍
The DockerComputerSSHConnector changes are going in the right direction but need some changes.

@@ -253,8 +254,12 @@ protected ComputerLauncher createLauncher(DockerAPI api, String workdir, Inspect
if ( retryWaitTimeOrNull!=null ) {
connectionCheck.withEveryRetryWaitFor( retryWaitTimeOrNull, TimeUnit.SECONDS );
}
final Integer launchTimeoutSeconds = getLaunchTimeoutSeconds();
if( launchTimeoutSeconds != null) {
connectionCheckSSH = connectionCheckSSH.withSSHTimeout(launchTimeoutSeconds, TimeUnit.SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm in favor of something calling withSSHTimeout, the launchTimeoutSeconds value was never intended to be used for this purpose.
It would be wrong for the code to use the SSHLauncher's launchTimeoutSeconds field to also be the SSH timeout value as that's not as-documented and not as-intended; it would cause confusion if we reuse this field without documenting it as such, and it's going to be difficult to explain WTF the field is if we try to document it (and, in my experience, when code's behavior is difficult to justify in the documention, that means the code is wrong!).
Unfortunately there is no other existing configuration field that is suitable here, so we either have to add a new configuration field explicitly for this (which we could do, but would be more work), or we just leave it with its default value (which hasn't been a problem in the last N years, so it's unlikely to be a big issue now).
i.e. don't use this field, either add a new one or just don't set it at all.

...alternatively: Promise me that, after the ssh-slaves-plugin introduces a new field for the readTimeout, you submit another PR for this plugin that adds a field of the same name as SSHLauncher's readTimeout field that pass the field to the SSHLauncher when that's create and also use it at this point in the code for the withSSHTimeout value 😀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm in favor of something calling withSSHTimeout, the launchTimeoutSeconds value was never intended to be used for this purpose.

It is the value passed from UI and used everywhere, this value is changed when you changeConnection Timeout in Seconds setting in the UI, so I guess it is the correct value

Screenshot 2019-10-15 at 09 32 10

https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/com/nirima/jenkins/plugins/docker/launcher/DockerComputerSSHLauncher.java#L30
https://github.com/jenkinsci/docker-plugin/blob/master/src/main/java/io/jenkins/docker/connector/DockerComputerSSHConnector.java#L359-L362

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree that the launchTimeoutSeconds field is the one described as Connection Timeout in Seconds in the UI and that the configuration logic for all of that is working as expected.
👎 What I don't agree with is using that field for the withSSHTimeout value.

The "Connection Timeout in Seconds"/launchTimeoutSeconds field (in the DockerComputerSSHLauncher) is equivalent to (and used as) the field of the same name in the SSHLauncher class. That field is documented (by the ssh-slaves-plugin) as:

Set the timeout value for ssh agent launch in seconds. If empty, it will be reset to default value.
This will only set the timeout for agent launching; once launched, the timeout will not apply.

i.e. the launchTimeoutSeconds field holds the timeout that the SSHLauncher uses to set a limit on the time it takes for the already-logged-in SSH connection to result in an active running slave.jar process before the SSHLauncher declares a failure; it is not the time limit for the SSH daemon to respond to a login, so it'd be wrong to use it as that.

The SSHLauncher (currently) has no "timeout while waiting for progress during the SSH login process" field (and this is why the DockerComputerSSHLauncher has no equivalent). The launchTimeoutSeconds field (that it does have) is not intended for (or described as) that purpose, so we should not use it for that purpose.

TL;DR: Our withSSHTimeout method sets a different kind of connection timeout to the one described by the "Connection Timeout in Seconds" field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. the launchTimeoutSeconds field holds the timeout that the SSHLauncher uses to set a limit on the time it takes for the already-logged-in SSH connection to result in an active running slave.jar process before the SSHLauncher declares a failure; it is not the time limit for the SSH daemon to respond to a login, so it'd be wrong to use it as that.

in the ssh-slaves-plugin, launchTimeoutSeconds is used as timeout for the SSH connection, the "Connection Timeout in Seconds" in the ssh-slaves-plugin is the same field than "Connection Timeout in Seconds" in the DockerComputerSSHLauncher, in this PR the launchTimeoutSeconds is used as timeout for the SSH connect (same ssh-slaves), read, and connection(same ssh-slaves). ConnectionCheckSSH checks if it is possible to connect to the SSH agent, for that uses trilead-ssh2 connect method with the same parameter it would use for the Launcher, I do not see any need to define a timeout while waiting for progress during the SSH login process because ConnectionCheckSSH is checking the connection and it uses the parameter defined for the SSHLauncher, for me it is correct, by adding another parameter to the docker-plugin only for the SSH would be overhead more the UI, making difficult to tune the plugin even more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've got to the bottom of this at long last...
TL;DR: My statements used to be correct but, in the as-yet-unreleased version of the ssh-slaves-plugin, what you've said will be true.

My reasoning...

Both plugins are using com.trilead.ssh2.Connection's method connect(ServerHostKeyVerifier verifier, int connectTimeout, int readTimeout, int kexTimeout). This is something they have in common and that hasn't changed.

Our underlying requirement for the docker-plugin's PortUtils code is that it must test "is that container ready to receive a SSH connection from the ssh-slaves-plugin code?", and we need it to do that in a way that can not deadlock.
In the latest release of the docker-plugin, the connectTimeout value was hard-coded as zero with the other two coming from the sshTimeoutMillis field, which was always left at the default of 2 seconds.
With this PR, all 3 parameters are sshTimeoutMillis and that'll be set to the "Connection Timeout in Seconds" value.

In the latest (1.31.0) release of the ssh-slaves-plugin code, the connectTimeout value comes from the "Connection Timeout in Seconds" field (aka launchTimeoutSeconds), the readTimeout parameter passed to com.trilead.ssh2.Connection's connect method is a hard-coded value of 0 /*read timeout - JENKINS-48618*/ and the kexTimeout is the same as connectTimeout ... but it's all run inside a newSingleThreadExecutor with an overall timeout set to "Connection Timeout in Seconds".
However, in the bleeding-edge code (the current master branch), it's changed: kexTimeout is now 5 seconds (hard-coded) longer and the newSingleThreadExecutor is not given an overall timeout set to "Connection Timeout in Seconds".

It's the changes in the ssh-slaves-plugin code that confused me - you were looking at the code changes you were making and I was looking at the code in the last release, which is why we were both seeing different things 😁

In summary: If that's the direction that the ssh-slaves-plugin is going then it's ok to use the "Connection Timeout in Seconds" field the same way.

@pjdarton pjdarton merged commit 02b6dbf into jenkinsci:master Nov 18, 2019
@pjdarton
Copy link
Member

Code changes went into released of docker-plugin version 1.1.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants