DRIVERS-1707: Preemptively cancel in progress operations when SDAM heartbeats timeout. #1170

DmitryLukyanov · 2022-04-05T14:57:39Z

DmitryLukyanov · 2022-04-05T14:58:01Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

       *  No connections may be checked out or created in this pool until ready() is called again.
       */
-      clear(): void;
+      clear(closeInUseConnections: Boolean): void;


Optional with default false?

I think closeInUseConnections can be optional, especially for drivers without background thread.
@patrickfreed what do you think?

I agree this should default to false and SDAM should require that it only be used in the event of network timeouts. Definitely want to document the default here.

Regarding the API of whether it's optional or not, that's more up to individual driver implementations. I don't think this should be optional functionality, if that's what you were asking though.

DmitryLukyanov · 2022-04-05T14:59:28Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

@@ -782,6 +789,9 @@ thread SHOULD
  Timeout: Background Connection Pooling
  <../client-side-operations-timeout/client-side-operations-timeout.rst#background-connection-pooling>`__.

+A pool SHOULD allow ability to force run next maintenance iteration to remove perished connections including "in use" connections. 
+In this case, A pool MAY skip populating connections.


not sure whether this step is necessary? It looks like minPoolSize can be handled as before even though they will be available only after pool will become ready

Maybe "In this case, A pool MAY skip populating connections." can be omitted, as it's part of pausable behaviour: "Connections are not created in the background to satisfy minPoolSize"

"force run next maintenance iteration" ==> "Schedule the next Background Thread Run to run as soon as possible" or similar

DmitryLukyanov · 2022-04-05T15:11:24Z

source/server-discovery-and-monitoring/server-monitoring.rst

@@ -1185,3 +1188,4 @@ Changelog
 .. _Client Side Operations Timeout Spec: /source/client-side-operations-timeout/client-side-operations-timeout.rst
 .. _timeoutMS: /source/client-side-operations-timeout/client-side-operations-timeout.rst#timeoutMS
 .. _t-digest algorithm: https://github.com/tdunning/t-digest
+.. _Why do we need force running prune maintenance call in Clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-do-we-need-force-running-prune-maintenance-call-in-Clear-logic?


this link doesn't work, will fix it in the next commit

does this link work now?

DmitryLukyanov · 2022-04-05T15:24:55Z

Should Closing a Connection Pool section also be updated to manually close in use connections too?

ShaneHarvey

~~Since we only want to cancel in progress operations when an SDAM heartbeat fails with a time out (and not for other errors that clear the pool), I think this PR needs to change the SDAM spec as well.~~

Edit: I now see that this PR already updates the SDAM spec.

ShaneHarvey · 2022-04-05T18:53:18Z

source/server-discovery-and-monitoring/server-monitoring.rst

@@ -697,7 +697,7 @@ The event API here is assumed to be like the standard `Python Event
                topology.onServerDescriptionChanged(description, connection pool for server)
                if description.error != Null:
                    # Clear the connection pool only after the server description is set to Unknown.
-                    clear connection pool for server
+                    clear(closeInUseConnections: isNetworkError(description.error)) connection pool for server


isNetworkError -> isNetworkTimeout

isNetworkTimeout is also used in source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst

source/server-discovery-and-monitoring/server-monitoring.rst

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

BorisDog · 2022-04-05T23:20:04Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

+changed pool generation. As part of this request, a pool SHOULD inform maintenance
+thread whether "in use" connections should be closed as well. A pool SHOULD be 
+informed about it via closeInUseConnections parameter in Clear method.
+


Probably "a pool SHOULD force the next maintenance step..." needs to be clarified. Does this mean that the next step should be scheduled to run as soon as possible?
Maybe something like: The next Background Thread Run SHOULD be scheduled as soon as possible. Next pruning iteration MUST close "in use" perished connections if requested by closeInUseConnections flag...

Maybe we should omit the last "requested by closeInUseConnections flag." part? As it might be interpreted as scheduling the next run sooner will be done only in closeInUseConnections:true case.

If we do that, can we bump it to its own paragraph below this one? That way the closeInUseConnections and background thread scheduling parts will be in separate paragraphs.

I did not notice the
A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed. sentence. I am fine with current wording.

nit: remove the dot from connections.requested

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

BorisDog · 2022-04-05T23:25:55Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

@@ -782,6 +789,9 @@ thread SHOULD
  Timeout: Background Connection Pooling
  <../client-side-operations-timeout/client-side-operations-timeout.rst#background-connection-pooling>`__.

+A pool SHOULD allow ability to force run next maintenance iteration to remove perished connections including "in use" connections. 
+In this case, A pool MAY skip populating connections.


Maybe "In this case, A pool MAY skip populating connections." can be omitted, as it's part of pausable behaviour: "Connections are not created in the background to satisfy minPoolSize"

"force run next maintenance iteration" ==> "Schedule the next Background Thread Run to run as soon as possible" or similar

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

BorisDog · 2022-04-05T23:38:51Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

       *  No connections may be checked out or created in this pool until ready() is called again.
       */
-      clear(): void;
+      clear(closeInUseConnections: Boolean): void;


I think closeInUseConnections can be optional, especially for drivers without background thread.
@patrickfreed what do you think?

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

patrickfreed · 2022-04-08T21:45:01Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

       *  No connections may be checked out or created in this pool until ready() is called again.
       */
-      clear(): void;
+      clear(closeInUseConnections: Boolean): void;


I agree this should default to false and SDAM should require that it only be used in the event of network timeouts. Definitely want to document the default here.

Regarding the API of whether it's optional or not, that's more up to individual driver implementations. I don't think this should be optional functionality, if that's what you were asking though.

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

source/connection-monitoring-and-pooling/tests/README.rst

source/connection-monitoring-and-pooling/tests/pool-clear-destroy-all-conns.yml

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

source/connection-monitoring-and-pooling/tests/pool-clear-destroy-all-conns.yml

patrickfreed · 2022-04-14T21:54:26Z

source/server-discovery-and-monitoring/server-monitoring.rst

@@ -1185,3 +1188,4 @@ Changelog
 .. _Client Side Operations Timeout Spec: /source/client-side-operations-timeout/client-side-operations-timeout.rst
 .. _timeoutMS: /source/client-side-operations-timeout/client-side-operations-timeout.rst#timeoutMS
 .. _t-digest algorithm: https://github.com/tdunning/t-digest
+.. _Why do we need force running prune maintenance call in Clear logic?: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#Why-do-we-need-force-running-prune-maintenance-call-in-Clear-logic?


does this link work now?

source/connection-monitoring-and-pooling/tests/pool-clear-destroy-available-conns.yml

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

source/connection-monitoring-and-pooling/tests/pool-clear-destroy-all-conns.yml

BorisDog · 2022-04-20T23:12:19Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

@@ -728,6 +730,13 @@ eagerly so that any operations waiting on `Connections <#connection>`_ can retry
 as soon as possible. The pool MUST NOT rely on WaitQueueTimeoutMS to clear
 requests from the WaitQueue.

+The clearing method MUST provide the option to close any in-use connections as part


I am wondering whether we should clarify that only in-use connections removal should be limited up to the generation at the moment of clear. To prevent the following scheduling (steps 3-5 happen fast):

pool.ready() (gen1)

pool.clear(inUser:true) (gen2)

pool.ready() (gen2)

pool.clear(inUse:false) (gen3)

pool.ready() (gen3)

prune runs and closes inUse with gen <= gen2 (the correct behaviour is gen <= gen1

cc @patrickfreed

Yeah this is a really good point. Maybe add the following line:

The pool MUST only close in use connections whose generation is less than or equal to the generation of the pool at the moment of the clear (before the increment) that used the closeInUseConnections flag.

Unfortunately, I don't think there's a non-racy way to unit test this that I can think of.

BorisDog · 2022-04-20T23:18:13Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

+changed pool generation. As part of this request, a pool SHOULD inform maintenance
+thread whether "in use" connections should be closed as well. A pool SHOULD be 
+informed about it via closeInUseConnections parameter in Clear method.
+


I did not notice the
A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed. sentence. I am fine with current wording.

nit: remove the dot from connections.requested

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

…-and-pooling.rst Co-authored-by: Patrick Freed <patrick.freed@mongodb.com>

…ifications into DRIVERS-4026

DmitryLukyanov · 2022-09-15T03:31:47Z

source/server-discovery-and-monitoring/tests/unified/interruptInUse-pool-clear.yml

+                $where : sleep(2000) || true
+            expectError:
+              isError: true
+      - name: waitForEvent


it looks like this test is a bit flaky. Since previous runOnThread ensures that thread is started, but not that underlying operation has actually been launched. So there was a chance that failpoint is triggered before find is started

So there was a chance that failpoint is triggered before find is started

Would increasing "times" from 2 to 4 help here?

The failpoint being triggered before the find has started should be okay, so long as the monitor check doesn't time out before the find begins. In the quickest case, the monitor can take 1000ms to time out after the failpoint is enabled (a check happens to start immediately after the failpoint is enabled), and the find should certainly take less than 1000ms to begin executing, considering the pool has already been populated with a connection. If we wait for the find to start before enabling the failpoint, we instead are concerned with the find completing before the monitor times out (in longest case, monitor could take 1500ms after the failpoint is enabled to time out, which should still be okay assuming it takes less than 500ms to enable the failpoint).

Can you elaborate more on the events you were seeing when it was failing?

Would increasing "times" from 2 to 4 help here?

This could possibly be the cause, though I imagine it would be pretty rare. If the RTT monitor hits the failpoint, times out, and then hits it again, it could prevent the server monitor from ever triggering it if the monitor happened to start a check right before the failpoint was enabled.

Can you elaborate more on the events you were seeing when it was failing?

I see than in some rare cases no error happens at all

In that case, @ShaneHarvey's suggestion of upping to times: 4 may help. Did you see any difference after adding the waitForEvent?

Would increasing "times" from 2 to 4 help here?

yep, it works too, done

Test looks good, but the comment explaining why we're using 4 seems inaccurate. As I alluded to above, there shouldn't be any problem if a heartbeat triggers the failpoint before the find starts, so long as the heartbeat doesn't time out before the find starts. I'd suggest rewording the comment to something more simple like the following (borrowed from the SDAM tests):

# Use "times: 4" to increase the probability that the Monitor check triggers # the failpoint, since the RTT hello may trigger this failpoint one or many # times as well.

patrickfreed · 2022-09-15T20:38:21Z

source/server-discovery-and-monitoring/tests/unified/interruptInUse-pool-clear.yml

+                $where : sleep(2000) || true
+            expectError:
+              isError: true
+      - name: waitForEvent


The failpoint being triggered before the find has started should be okay, so long as the monitor check doesn't time out before the find begins. In the quickest case, the monitor can take 1000ms to time out after the failpoint is enabled (a check happens to start immediately after the failpoint is enabled), and the find should certainly take less than 1000ms to begin executing, considering the pool has already been populated with a connection. If we wait for the find to start before enabling the failpoint, we instead are concerned with the find completing before the monitor times out (in longest case, monitor could take 1500ms after the failpoint is enabled to time out, which should still be okay assuming it takes less than 500ms to enable the failpoint).

Can you elaborate more on the events you were seeing when it was failing?

Would increasing "times" from 2 to 4 help here?

This could possibly be the cause, though I imagine it would be pretty rare. If the RTT monitor hits the failpoint, times out, and then hits it again, it could prevent the server monitor from ever triggering it if the monitor happened to start a check right before the failpoint was enabled.

source/connection-monitoring-and-pooling/tests/pool-clear-interrupt-in-use.yml

source/server-discovery-and-monitoring/tests/unified/interruptInUse-pool-clear.yml

patrickfreed

Content looks good! One last test comment suggestion but otherwise LGTM

patrickfreed · 2022-09-16T13:52:44Z

source/server-discovery-and-monitoring/tests/unified/interruptInUse-pool-clear.yml

+                $where : sleep(2000) || true
+            expectError:
+              isError: true
+      - name: waitForEvent


Test looks good, but the comment explaining why we're using 4 seems inaccurate. As I alluded to above, there shouldn't be any problem if a heartbeat triggers the failpoint before the find starts, so long as the heartbeat doesn't time out before the find starts. I'd suggest rewording the comment to something more simple like the following (borrowed from the SDAM tests):

# Use "times: 4" to increase the probability that the Monitor check triggers # the failpoint, since the RTT hello may trigger this failpoint one or many # times as well.

patrickfreed

Noticed a SHOULD/MUST mistmatch, but otherwise all looks good to me.

source/server-discovery-and-monitoring/server-monitoring.rst

Co-authored-by: Patrick Freed <patrick.freed@mongodb.com>

patrickfreed

LGTM!

ShaneHarvey

Noticed one formatting issue. Otherwise LGTM

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

…ifications into DRIVERS-4026

BorisDog

LGTM (minor comment)

BorisDog · 2022-09-16T22:24:44Z

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst

@@ -1115,6 +1140,26 @@ clear the pool again. This situation is possible if the pool is cleared by the
 background thread after it encounters an error establishing a connection, but
 the ServerDescription for the endpoint was not updated accordingly yet.

+Why does the pool need to support interrupting in use connections as part of its clear logic?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


minor: the tilde and the previous line should have the same length?

DmitryLukyanov commented Apr 5, 2022

View reviewed changes

DmitryLukyanov marked this pull request as ready for review April 5, 2022 15:06

DmitryLukyanov requested review from a team as code owners April 5, 2022 15:06

DmitryLukyanov requested review from nbbeeken and BorisDog and removed request for a team April 5, 2022 15:06

DmitryLukyanov commented Apr 5, 2022

View reviewed changes

ShaneHarvey reviewed Apr 5, 2022

View reviewed changes

source/server-discovery-and-monitoring/server-monitoring.rst Outdated Show resolved Hide resolved

DmitryLukyanov changed the title ~~DRIVERS-4026: Preemptively cancel in progress operations when SDAM heartbeats timeout.~~ DRIVERS-1707: Preemptively cancel in progress operations when SDAM heartbeats timeout. Apr 5, 2022

DmitryLukyanov requested a review from ShaneHarvey April 5, 2022 19:28

BorisDog requested changes Apr 5, 2022

View reviewed changes

DmitryLukyanov requested a review from BorisDog April 6, 2022 00:51

patrickfreed reviewed Apr 11, 2022

View reviewed changes

DmitryLukyanov force-pushed the DRIVERS-4026 branch 2 times, most recently from 8e5ad8f to 0af13aa Compare April 13, 2022 15:59

DmitryLukyanov requested a review from patrickfreed April 13, 2022 15:59

patrickfreed reviewed Apr 14, 2022

View reviewed changes

DmitryLukyanov requested a review from patrickfreed April 18, 2022 21:30

patrickfreed reviewed Apr 19, 2022

View reviewed changes

DmitryLukyanov force-pushed the DRIVERS-4026 branch 2 times, most recently from 84cdbe3 to 182a117 Compare April 19, 2022 20:12

BorisDog reviewed Apr 20, 2022

View reviewed changes

DmitryLukyanov requested review from BorisDog and patrickfreed April 21, 2022 00:07

DmitryLukyanov requested a review from ShaneHarvey September 15, 2022 03:20

DmitryLukyanov and others added 4 commits September 15, 2022 07:20

Update source/connection-monitoring-and-pooling/connection-monitoring…

988ca92

…-and-pooling.rst Co-authored-by: Patrick Freed <patrick.freed@mongodb.com>

Code review

5494c63

yml formatting.

10b7ae8

Merge branch 'DRIVERS-4026' of https://github.com/DmitryLukyanov/spec…

33c57e0

…ifications into DRIVERS-4026

DmitryLukyanov commented Sep 15, 2022

View reviewed changes

DmitryLukyanov requested a review from patrickfreed September 15, 2022 03:32

patrickfreed reviewed Sep 15, 2022

View reviewed changes

DmitryLukyanov added 3 commits September 16, 2022 02:05

remove pool-clear-interrupt-in-use

88fb3a0

Code review

e8a34c9

Formatting.

6f72375

DmitryLukyanov requested a review from patrickfreed September 16, 2022 00:40

patrickfreed reviewed Sep 16, 2022

View reviewed changes

Rewording.

a318540

DmitryLukyanov requested a review from patrickfreed September 16, 2022 16:14

patrickfreed reviewed Sep 16, 2022

View reviewed changes

source/server-discovery-and-monitoring/server-monitoring.rst Outdated Show resolved Hide resolved

Update source/server-discovery-and-monitoring/server-monitoring.rst

a8dbd02

Co-authored-by: Patrick Freed <patrick.freed@mongodb.com>

DmitryLukyanov requested a review from patrickfreed September 16, 2022 18:49

patrickfreed approved these changes Sep 16, 2022

View reviewed changes

ShaneHarvey approved these changes Sep 16, 2022

View reviewed changes

ShaneHarvey requested changes Sep 16, 2022

View reviewed changes

source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst Outdated Show resolved Hide resolved

DmitryLukyanov added 2 commits September 17, 2022 01:24

Code review.

3704c24

Merge branch 'DRIVERS-4026' of https://github.com/DmitryLukyanov/spec…

370dd7a

…ifications into DRIVERS-4026

DmitryLukyanov requested a review from ShaneHarvey September 16, 2022 21:25

ShaneHarvey approved these changes Sep 16, 2022

View reviewed changes

BorisDog approved these changes Sep 16, 2022

View reviewed changes

Code review.

ede9d8d

DmitryLukyanov merged commit 23bd4db into mongodb:master Sep 16, 2022

kmahar mentioned this pull request Sep 27, 2022

DRIVERS-1677, DRIVERS-1673: Add general logging specification and command logging #1303

Merged

jmikola referenced this pull request Nov 1, 2022

DRIVERS-1707 formatting and bug fixes for yml files (#1333)

fe95775

DRIVERS-1707: Preemptively cancel in progress operations when SDAM heartbeats timeout. #1170

DRIVERS-1707: Preemptively cancel in progress operations when SDAM heartbeats timeout. #1170

Uh oh!

Conversation

DmitryLukyanov commented Apr 5, 2022 • edited by jyemin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DmitryLukyanov commented Apr 5, 2022

Uh oh!

ShaneHarvey left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

DmitryLukyanov commented Apr 5, 2022 •

edited by jyemin

Loading

ShaneHarvey left a comment •

edited

Loading