Fix false-alarm timeout exception when waiting for key exchange to complete #912
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
As described in #909, the SSHJ client may hit a timeout exception almost as soon as the main thread starts waiting for the key exchange to complete, even though the timeout has not been reached.
Why isn't this issue more widespread? I don't know :) I started hitting it when I enabled the heartbeater by setting the keepalive interval just before starting a connection.
Diagnosis
This issue is an existing race condition that doesn't actually require the heartbeater to be running, but apparently the presence of the heartbeater thread (at least on my production Ubuntu 20.04 linux systems) creates the conditions for this particular interleaving to be more likely to occur.
Based on the SSHJ logs and code, the issue can happen during the following sequence of events between the main thread and the reader thread:
What happens under the covers is:
SSH_MSG_KEXINIT
from the server before the main thread has a chance to initiate that exchange.done.clear()
inKeyExchanger.startKex
, the main thread also callsKeyExchanger.startKex
, skips the key exchange initiation (because it's already ongoing) and callsdone.await
with a timeout to wait for the exchange to complete.done.await
callsPromise.retrieve
which callsPromise.tryRetrieve
, which waits for the associated condition viacond.await
.done.clear()
, which delivers a value ofnull
to thedone
promise.cond.await
to wake up the main thread, andPromise.tryRetrieve
returns thisnull
value.Promise.retrieve
throws a timeout exception because the retrieved value isnull
, even though there wasn't a timeout.I can consistently reproduce this issue by attaching a debugger in order to force this particular thread interleaving. I set a breakpoint at the entrance of
KeyExchanger.startKex
, let both the main thread and the reader hit it, resume the reader thread (so it will initiate the exchange) but stop it right beforedone.clear
, resume the main thread which will pause atPromise.retrieve
, and then resume the reader so it callsdone.clear
, which delivers anull
toPromise.retrieve
on the main thread and thus triggers the exception.Solution
One option would be to add extra synchronization to
KeyExchanger.startKex
to ensure thatwaitForDone
, which callsdone.await
, is not called beforedone.clear
.The other option I see, which avoids extra synchronization and its drawbacks, is to think of
Promise.clear
as just the act of clearing the value tonull
, without actually "delivering" thatnull
value to anyone waiting on that promise. This is what I am proposing in the current diff.The only concern with doing the latter could be any side effects it may have on other uses of
Promise.clear
. I inspected the code and found one other use ofclear
, inUserAuthImpl.authenticate
. In that method,clear
is called beforeretrieve
, so the call toretrieve
will not be affected by this change inclear
. Presumably, all calls toauthenticate
should happen in the same thread, but if somehow an application calls it from different threads, then this change fixes the same issue that we target in this PR for the key exchange; parallel calls toauthenticate
will no longer throw a spurious timeout exception in such an unlikely use case. The only other place wherePromise.clear
is called isEvent.clear
, and an inspection of the uses ofEvent.clear
results in similar conclusions: calls toclear
cannot occur while the event is waiting on its promise.Testing
I repeated the same procedure that reproduces the bug. With this fix applied, the main thread at
Promise.retrieve
no longer receives anull
fromdone.clear
from the reader thread, and instead is awakened later when the reader actually delivers a non-null value viadone.set
, thus avoiding the exception.