Skip to content

DRIVERS-1707: Preemptively cancel in progress operations when SDAM heartbeats timeout. #1170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Sep 16, 2022

Conversation

DmitryLukyanov
Copy link

@DmitryLukyanov DmitryLukyanov commented Apr 5, 2022

* No connections may be checked out or created in this pool until ready() is called again.
*/
clear(): void;
clear(closeInUseConnections: Boolean): void;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional with default false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think closeInUseConnections can be optional, especially for drivers without background thread.
@patrickfreed what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should default to false and SDAM should require that it only be used in the event of network timeouts. Definitely want to document the default here.

Regarding the API of whether it's optional or not, that's more up to individual driver implementations. I don't think this should be optional functionality, if that's what you were asking though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -782,6 +789,9 @@ thread SHOULD
Timeout: Background Connection Pooling
<../client-side-operations-timeout/client-side-operations-timeout.rst#background-connection-pooling>`__.

A pool SHOULD allow ability to force run next maintenance iteration to remove perished connections including "in use" connections.
In this case, A pool MAY skip populating connections.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure whether this step is necessary? It looks like minPoolSize can be handled as before even though they will be available only after pool will become ready

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Maybe "In this case, A pool MAY skip populating connections." can be omitted, as it's part of pausable behaviour: "Connections are not created in the background to satisfy minPoolSize"
  2. "force run next maintenance iteration" ==> "Schedule the next Background Thread Run to run as soon as possible" or similar

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@DmitryLukyanov DmitryLukyanov marked this pull request as ready for review April 5, 2022 15:06
@DmitryLukyanov DmitryLukyanov requested review from a team as code owners April 5, 2022 15:06
@DmitryLukyanov DmitryLukyanov requested review from nbbeeken and BorisDog and removed request for a team April 5, 2022 15:06
@@ -1185,3 +1188,4 @@ Changelog
.. _Client Side Operations Timeout Spec: /source/client-side-operations-timeout/client-side-operations-timeout.rst
.. _timeoutMS: /source/client-side-operations-timeout/client-side-operations-timeout.rst#timeoutMS
.. _t-digest algorithm: https://github.com/tdunning/t-digest
.. _Why do we need force running prune maintenance call in Clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-do-we-need-force-running-prune-maintenance-call-in-Clear-logic?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link doesn't work, will fix it in the next commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this link work now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@DmitryLukyanov
Copy link
Author

Should Closing a Connection Pool section also be updated to manually close in use connections too?

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only want to cancel in progress operations when an SDAM heartbeat fails with a time out (and not for other errors that clear the pool), I think this PR needs to change the SDAM spec as well.

Edit: I now see that this PR already updates the SDAM spec.

@@ -697,7 +697,7 @@ The event API here is assumed to be like the standard `Python Event
topology.onServerDescriptionChanged(description, connection pool for server)
if description.error != Null:
# Clear the connection pool only after the server description is set to Unknown.
clear connection pool for server
clear(closeInUseConnections: isNetworkError(description.error)) connection pool for server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isNetworkError -> isNetworkTimeout

isNetworkTimeout is also used in source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@DmitryLukyanov DmitryLukyanov changed the title DRIVERS-4026: Preemptively cancel in progress operations when SDAM heartbeats timeout. DRIVERS-1707: Preemptively cancel in progress operations when SDAM heartbeats timeout. Apr 5, 2022
changed pool generation. As part of this request, a pool SHOULD inform maintenance
thread whether "in use" connections should be closed as well. A pool SHOULD be
informed about it via closeInUseConnections parameter in Clear method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably "a pool SHOULD force the next maintenance step..." needs to be clarified. Does this mean that the next step should be scheduled to run as soon as possible?
Maybe something like: The next Background Thread Run SHOULD be scheduled as soon as possible. Next pruning iteration MUST close "in use" perished connections if requested by closeInUseConnections flag...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should omit the last "requested by closeInUseConnections flag." part? As it might be interpreted as scheduling the next run sooner will be done only in closeInUseConnections:true case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, can we bump it to its own paragraph below this one? That way the closeInUseConnections and background thread scheduling parts will be in separate paragraphs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice the
A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed. sentence. I am fine with current wording.

nit: remove the dot from connections.requested

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -782,6 +789,9 @@ thread SHOULD
Timeout: Background Connection Pooling
<../client-side-operations-timeout/client-side-operations-timeout.rst#background-connection-pooling>`__.

A pool SHOULD allow ability to force run next maintenance iteration to remove perished connections including "in use" connections.
In this case, A pool MAY skip populating connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Maybe "In this case, A pool MAY skip populating connections." can be omitted, as it's part of pausable behaviour: "Connections are not created in the background to satisfy minPoolSize"
  2. "force run next maintenance iteration" ==> "Schedule the next Background Thread Run to run as soon as possible" or similar

* No connections may be checked out or created in this pool until ready() is called again.
*/
clear(): void;
clear(closeInUseConnections: Boolean): void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think closeInUseConnections can be optional, especially for drivers without background thread.
@patrickfreed what do you think?

@DmitryLukyanov DmitryLukyanov requested a review from BorisDog April 6, 2022 00:51
* No connections may be checked out or created in this pool until ready() is called again.
*/
clear(): void;
clear(closeInUseConnections: Boolean): void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should default to false and SDAM should require that it only be used in the event of network timeouts. Definitely want to document the default here.

Regarding the API of whether it's optional or not, that's more up to individual driver implementations. I don't think this should be optional functionality, if that's what you were asking though.

@DmitryLukyanov DmitryLukyanov force-pushed the DRIVERS-4026 branch 2 times, most recently from 8e5ad8f to 0af13aa Compare April 13, 2022 15:59
@@ -1185,3 +1188,4 @@ Changelog
.. _Client Side Operations Timeout Spec: /source/client-side-operations-timeout/client-side-operations-timeout.rst
.. _timeoutMS: /source/client-side-operations-timeout/client-side-operations-timeout.rst#timeoutMS
.. _t-digest algorithm: https://github.com/tdunning/t-digest
.. _Why do we need force running prune maintenance call in Clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-do-we-need-force-running-prune-maintenance-call-in-Clear-logic?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this link work now?

@DmitryLukyanov DmitryLukyanov force-pushed the DRIVERS-4026 branch 2 times, most recently from 84cdbe3 to 182a117 Compare April 19, 2022 20:12
@@ -728,6 +730,13 @@ eagerly so that any operations waiting on `Connections <#connection>`_ can retry
as soon as possible. The pool MUST NOT rely on WaitQueueTimeoutMS to clear
requests from the WaitQueue.

The clearing method MUST provide the option to close any in-use connections as part
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether we should clarify that only in-use connections removal should be limited up to the generation at the moment of clear. To prevent the following scheduling (steps 3-5 happen fast):

  1. pool.ready() (gen1)
  2. pool.clear(inUser:true) (gen2)
  3. pool.ready() (gen2)
  4. pool.clear(inUse:false) (gen3)
  5. pool.ready() (gen3)
  6. prune runs and closes inUse with gen <= gen2 (the correct behaviour is gen <= gen1

cc @patrickfreed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a really good point. Maybe add the following line:

The pool MUST only close in use connections whose generation is less than or equal to the generation of the pool at the moment of the clear (before the increment) that used the closeInUseConnections flag.

Unfortunately, I don't think there's a non-racy way to unit test this that I can think of.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

changed pool generation. As part of this request, a pool SHOULD inform maintenance
thread whether "in use" connections should be closed as well. A pool SHOULD be
informed about it via closeInUseConnections parameter in Clear method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice the
A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed. sentence. I am fine with current wording.

nit: remove the dot from connections.requested

$where : sleep(2000) || true
expectError:
isError: true
- name: waitForEvent
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this test is a bit flaky. Since previous runOnThread ensures that thread is started, but not that underlying operation has actually been launched. So there was a chance that failpoint is triggered before find is started

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there was a chance that failpoint is triggered before find is started

Would increasing "times" from 2 to 4 help here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failpoint being triggered before the find has started should be okay, so long as the monitor check doesn't time out before the find begins. In the quickest case, the monitor can take 1000ms to time out after the failpoint is enabled (a check happens to start immediately after the failpoint is enabled), and the find should certainly take less than 1000ms to begin executing, considering the pool has already been populated with a connection. If we wait for the find to start before enabling the failpoint, we instead are concerned with the find completing before the monitor times out (in longest case, monitor could take 1500ms after the failpoint is enabled to time out, which should still be okay assuming it takes less than 500ms to enable the failpoint).

Can you elaborate more on the events you were seeing when it was failing?

Would increasing "times" from 2 to 4 help here?

This could possibly be the cause, though I imagine it would be pretty rare. If the RTT monitor hits the failpoint, times out, and then hits it again, it could prevent the server monitor from ever triggering it if the monitor happened to start a check right before the failpoint was enabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on the events you were seeing when it was failing?

I see than in some rare cases no error happens at all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, @ShaneHarvey's suggestion of upping to times: 4 may help. Did you see any difference after adding the waitForEvent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would increasing "times" from 2 to 4 help here?

yep, it works too, done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good, but the comment explaining why we're using 4 seems inaccurate. As I alluded to above, there shouldn't be any problem if a heartbeat triggers the failpoint before the find starts, so long as the heartbeat doesn't time out before the find starts. I'd suggest rewording the comment to something more simple like the following (borrowed from the SDAM tests):

# Use "times: 4" to increase the probability that the Monitor check triggers
# the failpoint, since the RTT hello may trigger this failpoint one or many 
# times as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

$where : sleep(2000) || true
expectError:
isError: true
- name: waitForEvent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failpoint being triggered before the find has started should be okay, so long as the monitor check doesn't time out before the find begins. In the quickest case, the monitor can take 1000ms to time out after the failpoint is enabled (a check happens to start immediately after the failpoint is enabled), and the find should certainly take less than 1000ms to begin executing, considering the pool has already been populated with a connection. If we wait for the find to start before enabling the failpoint, we instead are concerned with the find completing before the monitor times out (in longest case, monitor could take 1500ms after the failpoint is enabled to time out, which should still be okay assuming it takes less than 500ms to enable the failpoint).

Can you elaborate more on the events you were seeing when it was failing?

Would increasing "times" from 2 to 4 help here?

This could possibly be the cause, though I imagine it would be pretty rare. If the RTT monitor hits the failpoint, times out, and then hits it again, it could prevent the server monitor from ever triggering it if the monitor happened to start a check right before the failpoint was enabled.

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content looks good! One last test comment suggestion but otherwise LGTM

$where : sleep(2000) || true
expectError:
isError: true
- name: waitForEvent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good, but the comment explaining why we're using 4 seems inaccurate. As I alluded to above, there shouldn't be any problem if a heartbeat triggers the failpoint before the find starts, so long as the heartbeat doesn't time out before the find starts. I'd suggest rewording the comment to something more simple like the following (borrowed from the SDAM tests):

# Use "times: 4" to increase the probability that the Monitor check triggers
# the failpoint, since the RTT hello may trigger this failpoint one or many 
# times as well.

Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed a SHOULD/MUST mistmatch, but otherwise all looks good to me.

Co-authored-by: Patrick Freed <patrick.freed@mongodb.com>
Copy link
Contributor

@patrickfreed patrickfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed one formatting issue. Otherwise LGTM

Copy link
Contributor

@BorisDog BorisDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (minor comment)

@@ -1115,6 +1140,26 @@ clear the pool again. This situation is possible if the pool is cleared by the
background thread after it encounters an error establishing a connection, but
the ServerDescription for the endpoint was not updated accordingly yet.

Why does the pool need to support interrupting in use connections as part of its clear logic?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: the tilde and the previous line should have the same length?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants