-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-1707: Preemptively cancel in progress operations when SDAM heartbeats timeout. #1170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* No connections may be checked out or created in this pool until ready() is called again. | ||
*/ | ||
clear(): void; | ||
clear(closeInUseConnections: Boolean): void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional with default false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think closeInUseConnections
can be optional, especially for drivers without background thread.
@patrickfreed what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should default to false and SDAM should require that it only be used in the event of network timeouts. Definitely want to document the default here.
Regarding the API of whether it's optional or not, that's more up to individual driver implementations. I don't think this should be optional functionality, if that's what you were asking though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -782,6 +789,9 @@ thread SHOULD | |||
Timeout: Background Connection Pooling | |||
<../client-side-operations-timeout/client-side-operations-timeout.rst#background-connection-pooling>`__. | |||
|
|||
A pool SHOULD allow ability to force run next maintenance iteration to remove perished connections including "in use" connections. | |||
In this case, A pool MAY skip populating connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure whether this step is necessary? It looks like minPoolSize
can be handled as before even though they will be available only after pool will become ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Maybe "In this case, A pool MAY skip populating connections." can be omitted, as it's part of pausable behaviour: "Connections are not created in the background to satisfy minPoolSize"
- "force run next maintenance iteration" ==> "Schedule the next Background Thread Run to run as soon as possible" or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1185,3 +1188,4 @@ Changelog | |||
.. _Client Side Operations Timeout Spec: /source/client-side-operations-timeout/client-side-operations-timeout.rst | |||
.. _timeoutMS: /source/client-side-operations-timeout/client-side-operations-timeout.rst#timeoutMS | |||
.. _t-digest algorithm: https://github.com/tdunning/t-digest | |||
.. _Why do we need force running prune maintenance call in Clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-do-we-need-force-running-prune-maintenance-call-in-Clear-logic? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this link doesn't work, will fix it in the next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this link work now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
Should Closing a Connection Pool section also be updated to manually close in use connections too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only want to cancel in progress operations when an SDAM heartbeat fails with a time out (and not for other errors that clear the pool), I think this PR needs to change the SDAM spec as well.
Edit: I now see that this PR already updates the SDAM spec.
@@ -697,7 +697,7 @@ The event API here is assumed to be like the standard `Python Event | |||
topology.onServerDescriptionChanged(description, connection pool for server) | |||
if description.error != Null: | |||
# Clear the connection pool only after the server description is set to Unknown. | |||
clear connection pool for server | |||
clear(closeInUseConnections: isNetworkError(description.error)) connection pool for server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isNetworkError -> isNetworkTimeout
isNetworkTimeout is also used in source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
changed pool generation. As part of this request, a pool SHOULD inform maintenance | ||
thread whether "in use" connections should be closed as well. A pool SHOULD be | ||
informed about it via closeInUseConnections parameter in Clear method. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably "a pool SHOULD force the next maintenance step..." needs to be clarified. Does this mean that the next step should be scheduled to run as soon as possible?
Maybe something like: The next Background Thread Run SHOULD be scheduled as soon as possible. Next pruning iteration MUST close "in use" perished connections if requested by closeInUseConnections flag...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should omit the last "requested by closeInUseConnections flag." part? As it might be interpreted as scheduling the next run sooner will be done only in closeInUseConnections:true
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, can we bump it to its own paragraph below this one? That way the closeInUseConnections
and background thread scheduling parts will be in separate paragraphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not notice the
A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed.
sentence. I am fine with current wording.
nit: remove the dot from connections.requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
@@ -782,6 +789,9 @@ thread SHOULD | |||
Timeout: Background Connection Pooling | |||
<../client-side-operations-timeout/client-side-operations-timeout.rst#background-connection-pooling>`__. | |||
|
|||
A pool SHOULD allow ability to force run next maintenance iteration to remove perished connections including "in use" connections. | |||
In this case, A pool MAY skip populating connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Maybe "In this case, A pool MAY skip populating connections." can be omitted, as it's part of pausable behaviour: "Connections are not created in the background to satisfy minPoolSize"
- "force run next maintenance iteration" ==> "Schedule the next Background Thread Run to run as soon as possible" or similar
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
* No connections may be checked out or created in this pool until ready() is called again. | ||
*/ | ||
clear(): void; | ||
clear(closeInUseConnections: Boolean): void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think closeInUseConnections
can be optional, especially for drivers without background thread.
@patrickfreed what do you think?
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
* No connections may be checked out or created in this pool until ready() is called again. | ||
*/ | ||
clear(): void; | ||
clear(closeInUseConnections: Boolean): void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should default to false and SDAM should require that it only be used in the event of network timeouts. Definitely want to document the default here.
Regarding the API of whether it's optional or not, that's more up to individual driver implementations. I don't think this should be optional functionality, if that's what you were asking though.
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/tests/pool-clear-destroy-all-conns.yml
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Show resolved
Hide resolved
8e5ad8f
to
0af13aa
Compare
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/tests/pool-clear-destroy-all-conns.yml
Outdated
Show resolved
Hide resolved
@@ -1185,3 +1188,4 @@ Changelog | |||
.. _Client Side Operations Timeout Spec: /source/client-side-operations-timeout/client-side-operations-timeout.rst | |||
.. _timeoutMS: /source/client-side-operations-timeout/client-side-operations-timeout.rst#timeoutMS | |||
.. _t-digest algorithm: https://github.com/tdunning/t-digest | |||
.. _Why do we need force running prune maintenance call in Clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-do-we-need-force-running-prune-maintenance-call-in-Clear-logic? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this link work now?
source/connection-monitoring-and-pooling/tests/pool-clear-destroy-available-conns.yml
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/tests/pool-clear-destroy-all-conns.yml
Outdated
Show resolved
Hide resolved
84cdbe3
to
182a117
Compare
@@ -728,6 +730,13 @@ eagerly so that any operations waiting on `Connections <#connection>`_ can retry | |||
as soon as possible. The pool MUST NOT rely on WaitQueueTimeoutMS to clear | |||
requests from the WaitQueue. | |||
|
|||
The clearing method MUST provide the option to close any in-use connections as part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether we should clarify that only in-use
connections removal should be limited up to the generation at the moment of clear. To prevent the following scheduling (steps 3-5 happen fast):
- pool.ready() (gen1)
- pool.clear(inUser:true) (gen2)
- pool.ready() (gen2)
- pool.clear(inUse:false) (gen3)
- pool.ready() (gen3)
- prune runs and closes inUse with
gen <= gen2
(the correct behaviour isgen <= gen1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a really good point. Maybe add the following line:
The pool MUST only close in use connections whose generation is less than or equal to the generation of the pool at the moment of the clear (before the increment) that used the
closeInUseConnections
flag.
Unfortunately, I don't think there's a non-racy way to unit test this that I can think of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
changed pool generation. As part of this request, a pool SHOULD inform maintenance | ||
thread whether "in use" connections should be closed as well. A pool SHOULD be | ||
informed about it via closeInUseConnections parameter in Clear method. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not notice the
A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed.
sentence. I am fine with current wording.
nit: remove the dot from connections.requested
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
…-and-pooling.rst Co-authored-by: Patrick Freed <patrick.freed@mongodb.com>
$where : sleep(2000) || true | ||
expectError: | ||
isError: true | ||
- name: waitForEvent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this test is a bit flaky. Since previous runOnThread
ensures that thread is started, but not that underlying operation has actually been launched. So there was a chance that failpoint is triggered before find is started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there was a chance that failpoint is triggered before find is started
Would increasing "times" from 2 to 4 help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failpoint being triggered before the find has started should be okay, so long as the monitor check doesn't time out before the find begins. In the quickest case, the monitor can take 1000ms to time out after the failpoint is enabled (a check happens to start immediately after the failpoint is enabled), and the find should certainly take less than 1000ms to begin executing, considering the pool has already been populated with a connection. If we wait for the find to start before enabling the failpoint, we instead are concerned with the find completing before the monitor times out (in longest case, monitor could take 1500ms after the failpoint is enabled to time out, which should still be okay assuming it takes less than 500ms to enable the failpoint).
Can you elaborate more on the events you were seeing when it was failing?
Would increasing "times" from 2 to 4 help here?
This could possibly be the cause, though I imagine it would be pretty rare. If the RTT monitor hits the failpoint, times out, and then hits it again, it could prevent the server monitor from ever triggering it if the monitor happened to start a check right before the failpoint was enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on the events you were seeing when it was failing?
I see than in some rare cases no error happens at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, @ShaneHarvey's suggestion of upping to times: 4
may help. Did you see any difference after adding the waitForEvent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would increasing "times" from 2 to 4 help here?
yep, it works too, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks good, but the comment explaining why we're using 4 seems inaccurate. As I alluded to above, there shouldn't be any problem if a heartbeat triggers the failpoint before the find starts, so long as the heartbeat doesn't time out before the find starts. I'd suggest rewording the comment to something more simple like the following (borrowed from the SDAM tests):
# Use "times: 4" to increase the probability that the Monitor check triggers
# the failpoint, since the RTT hello may trigger this failpoint one or many
# times as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
$where : sleep(2000) || true | ||
expectError: | ||
isError: true | ||
- name: waitForEvent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failpoint being triggered before the find has started should be okay, so long as the monitor check doesn't time out before the find begins. In the quickest case, the monitor can take 1000ms to time out after the failpoint is enabled (a check happens to start immediately after the failpoint is enabled), and the find should certainly take less than 1000ms to begin executing, considering the pool has already been populated with a connection. If we wait for the find to start before enabling the failpoint, we instead are concerned with the find completing before the monitor times out (in longest case, monitor could take 1500ms after the failpoint is enabled to time out, which should still be okay assuming it takes less than 500ms to enable the failpoint).
Can you elaborate more on the events you were seeing when it was failing?
Would increasing "times" from 2 to 4 help here?
This could possibly be the cause, though I imagine it would be pretty rare. If the RTT monitor hits the failpoint, times out, and then hits it again, it could prevent the server monitor from ever triggering it if the monitor happened to start a check right before the failpoint was enabled.
source/connection-monitoring-and-pooling/tests/pool-clear-interrupt-in-use.yml
Outdated
Show resolved
Hide resolved
source/server-discovery-and-monitoring/tests/unified/interruptInUse-pool-clear.yml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Content looks good! One last test comment suggestion but otherwise LGTM
$where : sleep(2000) || true | ||
expectError: | ||
isError: true | ||
- name: waitForEvent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks good, but the comment explaining why we're using 4 seems inaccurate. As I alluded to above, there shouldn't be any problem if a heartbeat triggers the failpoint before the find starts, so long as the heartbeat doesn't time out before the find starts. I'd suggest rewording the comment to something more simple like the following (borrowed from the SDAM tests):
# Use "times: 4" to increase the probability that the Monitor check triggers
# the failpoint, since the RTT hello may trigger this failpoint one or many
# times as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed a SHOULD/MUST mistmatch, but otherwise all looks good to me.
Co-authored-by: Patrick Freed <patrick.freed@mongodb.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed one formatting issue. Otherwise LGTM
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (minor comment)
@@ -1115,6 +1140,26 @@ clear the pool again. This situation is possible if the pool is cleared by the | |||
background thread after it encounters an error establishing a connection, but | |||
the ServerDescription for the endpoint was not updated accordingly yet. | |||
|
|||
Why does the pool need to support interrupting in use connections as part of its clear logic? | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: the tilde and the previous line should have the same length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
DRIVERS-1707