Skip to content

DRIVERS-1707: Preemptively cancel in progress operations when SDAM heartbeats timeout. #1170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Sep 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
d0c2c66
DRIVERS-4026: Preemptively cancel in progress operations when SDAM he…
DmitryLukyanov Apr 5, 2022
ae50b04
Update unified test format spec.
DmitryLukyanov Sep 2, 2022
3dea134
Update change log.
DmitryLukyanov Sep 2, 2022
dc0342f
Update link.
DmitryLukyanov Sep 2, 2022
80cfea9
Update unified test format wording
DmitryLukyanov Sep 2, 2022
82ce69a
Fix yml structure
DmitryLukyanov Sep 2, 2022
aeff456
Fix yml structure.
DmitryLukyanov Sep 2, 2022
8dcce40
Code review.
DmitryLukyanov Sep 2, 2022
ca6affd
tests simplification
DmitryLukyanov Sep 2, 2022
5e79c4b
update sdam yml file
DmitryLukyanov Sep 2, 2022
0b4c229
Update source/unified-test-format/unified-test-format.rst
DmitryLukyanov Sep 6, 2022
a068be9
Update source/unified-test-format/unified-test-format.rst
DmitryLukyanov Sep 6, 2022
d8b53ba
Update source/unified-test-format/unified-test-format.rst
DmitryLukyanov Sep 6, 2022
4a93643
Add invalid unified test
DmitryLukyanov Sep 6, 2022
44965f1
add pool integration test.
DmitryLukyanov Sep 6, 2022
21b47f2
Add yml comments
DmitryLukyanov Sep 6, 2022
d53c15e
change yml formatting
DmitryLukyanov Sep 6, 2022
e9f930b
Add retry write test
DmitryLukyanov Sep 6, 2022
f727192
Fix yml formatting.
DmitryLukyanov Sep 6, 2022
994884b
Remove wait in pool-clear-closing-pending-connections.
DmitryLukyanov Sep 6, 2022
b793955
Fix event ordering. Don't call standalone tests.
DmitryLukyanov Sep 7, 2022
a0b35e9
Code review.
DmitryLukyanov Sep 9, 2022
cccc06e
Rename tests.
DmitryLukyanov Sep 12, 2022
bfd75b2
Rename tests.
DmitryLukyanov Sep 12, 2022
3b06c4a
Renaming
DmitryLukyanov Sep 12, 2022
7711a1f
Define "interrupt" meaning.
DmitryLukyanov Sep 12, 2022
c299046
Cherrypick origin changes.
DmitryLukyanov Sep 12, 2022
f9e3297
Formatting.
DmitryLukyanov Sep 12, 2022
d7f57e1
Fix typo.
DmitryLukyanov Sep 12, 2022
383464f
Update test.
DmitryLukyanov Sep 15, 2022
988ca92
Update source/connection-monitoring-and-pooling/connection-monitoring…
DmitryLukyanov Sep 15, 2022
5494c63
Code review
DmitryLukyanov Sep 15, 2022
10b7ae8
yml formatting.
DmitryLukyanov Sep 15, 2022
33c57e0
Merge branch 'DRIVERS-4026' of https://github.com/DmitryLukyanov/spec…
DmitryLukyanov Sep 15, 2022
88fb3a0
remove `pool-clear-interrupt-in-use`
DmitryLukyanov Sep 15, 2022
e8a34c9
Code review
DmitryLukyanov Sep 16, 2022
6f72375
Formatting.
DmitryLukyanov Sep 16, 2022
a318540
Rewording.
DmitryLukyanov Sep 16, 2022
a8dbd02
Update source/server-discovery-and-monitoring/server-monitoring.rst
DmitryLukyanov Sep 16, 2022
3704c24
Code review.
DmitryLukyanov Sep 16, 2022
370dd7a
Merge branch 'DRIVERS-4026' of https://github.com/DmitryLukyanov/spec…
DmitryLukyanov Sep 16, 2022
ede9d8d
Code review.
DmitryLukyanov Sep 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -354,8 +354,10 @@ has the following properties:
/**
* Mark all current Connections as stale, clear the WaitQueue, and mark the pool as "paused".
* No connections may be checked out or created in this pool until ready() is called again.
* interruptInUseConnections specifies whether the pool will force interrupt "in use" connections as part of the clear.
* Default false.
*/
clear(): void;
clear(interruptInUseConnections: Optional<Boolean>): void;

/**
* Mark the pool as "ready", allowing checkOuts to resume and connections to be created in the background.
Expand Down Expand Up @@ -728,8 +730,24 @@ eagerly so that any operations waiting on `Connections <#connection>`_ can retry
as soon as possible. The pool MUST NOT rely on WaitQueueTimeoutMS to clear
requests from the WaitQueue.

Clearning a load balanced pool
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The clearing method MUST provide the option to interrupt any in-use connections as part
of the clearing (henceforth referred to as the interruptInUseConnections flag in this
specification). "Interrupting a Connection" is defined as canceling whatever task the
Connection is currently performing and marking the Connection as perished (e.g. by closing
its underlying socket). The interrupting of these Connections MUST be performed as soon as possible
but MUST NOT block the pool or prevent it from processing further requests. If the pool has a background
thread, and it is responsible for interrupting in-use connections, its next run MUST be scheduled as soon as
possible.

The pool MUST only interrupt in-use Connections whose generation is less than or equal
to the generation of the pool at the moment of the clear (before the increment)
that used the interruptInUseConnections flag. Any operations that have their Connections
interrupted in this way MUST fail with a retryable error. If possible, the error SHOULD
be a PoolClearedError with the following message: "Connection to <pool address> interrupted
due to server monitor timeout".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably "a pool SHOULD force the next maintenance step..." needs to be clarified. Does this mean that the next step should be scheduled to run as soon as possible?
Maybe something like: The next Background Thread Run SHOULD be scheduled as soon as possible. Next pruning iteration MUST close "in use" perished connections if requested by closeInUseConnections flag...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should omit the last "requested by closeInUseConnections flag." part? As it might be interpreted as scheduling the next run sooner will be done only in closeInUseConnections:true case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, can we bump it to its own paragraph below this one? That way the closeInUseConnections and background thread scheduling parts will be in separate paragraphs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice the
A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed. sentence. I am fine with current wording.

nit: remove the dot from connections.requested

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Clearing a load balanced pool
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A Pool MUST also have a method of clearing all `Connections <#connection>`_ for
a specific ``serviceId`` for use when in load balancer mode. This method
Expand Down Expand Up @@ -777,11 +795,13 @@ monitoring the state of all available `Connections <#connection>`_. This backgro
thread SHOULD

- Populate `Connections <#connection>`_ to ensure that the pool always satisfies minPoolSize.
- Remove and close perished available `Connections <#connection>`_.
- Remove and close perished available `Connections <#connection>`_ including "in use" connections if `interruptInUseConnections` option was set to true in the most recent pool clear.
- Apply timeouts to connection establishment per `Client Side Operations
Timeout: Background Connection Pooling
<../client-side-operations-timeout/client-side-operations-timeout.rst#background-connection-pooling>`__.

A pool SHOULD allow immediate scheduling of the next background thread iteration after a clear is performed.

Conceptually, the aforementioned activities are organized into sequential Background Thread Runs.
A Run MUST do as much work as readily available and then end instead of waiting for more work.
For example, instead of waiting for pendingConnectionCount to become less than maxConnecting when satisfying minPoolSize,
Expand Down Expand Up @@ -849,7 +869,12 @@ See the `Load Balancer Specification <../load-balancers/load-balancers.rst#event
* The service id for which the pool was cleared for in load balancing mode.
* See load balancer specification for more information about this field.
*/
serviceId: Optional<ObjectId>
serviceId: Optional<ObjectId>;

/**
* A flag whether the pool forced interrupting "in use" connections as part of the clear.
*/
interruptInUseConnections: Optional<Boolean>;
}

/**
Expand Down Expand Up @@ -1115,6 +1140,26 @@ clear the pool again. This situation is possible if the pool is cleared by the
background thread after it encounters an error establishing a connection, but
the ServerDescription for the endpoint was not updated accordingly yet.

Why does the pool need to support interrupting in use connections as part of its clear logic?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If a SDAM monitor has observed a network timeout, we assume that all connections
including "in use" connections are no longer healthy. In some cases connections
will fail to detect the network timeout fast enough. For example, a server request
can hang at the OS level in TCP retry loop up for 17 minutes before failing. Therefore
these connections MUST be proactively interrupted in the case of a server monitor network timeout.
Requesting an immediate backround thread run will speed up this process.

Why don't we configure TCP_USER_TIMEOUT?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ideally, a reasonable TCP_USER_TIMEOUT can help with detecting stale connections as an
alternative to `interruptInUseConnections` in Clear.
Unfortunately this approach is platform dependent and not each driver allows easily configuring it.
For example, C# driver can configure this socket option on linux only with target frameworks
higher or equal to .net 5.0. On macOS, there is no straight equavalent for this option,
it's possible that we can find some equavalent configuration, but this configuration will also
require target frameworks higher than or equal to .net 5.0. The advantage of using Background Thread to
manage perished connections is that it will work regardless of environment setup.

Backwards Compatibility
=======================

Expand Down Expand Up @@ -1175,6 +1220,8 @@ Change log

:2021-11-08: Make maxConnecting configurable.

:2022-04-05: Preemptively cancel in progress operations when SDAM heartbeats timeout.

.. Section for links.

.. _Application Errors: /source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#application-errors
Expand Down
3 changes: 3 additions & 0 deletions source/connection-monitoring-and-pooling/tests/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,9 @@ Valid Unit Test Operations are the following:
- ``connection``: A string label identifying which connection to check in. Should be a label that was previously set with ``checkOut``

- ``pool.clear()``: call ``clear`` on Pool

- ``interruptInUseConnections``: Determines whether "in use" connections should be also interrupted

- ``pool.close()``: call ``close`` on Pool
- ``pool.ready()``: call ``ready`` on Pool

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
{
"version": 1,
"style": "unit",
"description": "Connections MUST be interrupted as soon as possible (interruptInUseConnections=true)",
"poolOptions": {
"backgroundThreadIntervalMS": 10000
},
"operations": [
{
"name": "ready"
},
{
"name": "checkOut"
},
{
"name": "checkOut",
"label": "conn"
},
{
"name": "clear",
"interruptInUseConnections": true
},
{
"name": "waitForEvent",
"event": "ConnectionPoolCleared",
"count": 1,
"timeout": 1000
},
{
"name": "waitForEvent",
"event": "ConnectionClosed",
"count": 2,
"timeout": 1000
},
{
"name": "close"
}
],
"events": [
{
"type": "ConnectionCheckedOut",
"connectionId": 1,
"address": 42
},
{
"type": "ConnectionCheckedOut",
"connectionId": 2,
"address": 42
},
{
"type": "ConnectionPoolCleared",
"interruptInUseConnections": true
},
{
"type": "ConnectionClosed",
"reason": "stale",
"address": 42
},
{
"type": "ConnectionClosed",
"reason": "stale",
"address": 42
},
{
"type": "ConnectionPoolClosed",
"address": 42
}
],
"ignore": [
"ConnectionCreated",
"ConnectionPoolReady",
"ConnectionReady",
"ConnectionCheckOutStarted",
"ConnectionPoolCreated",
"ConnectionCheckedIn"
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
version: 1
style: unit
description: Connections MUST be interrupted as soon as possible (interruptInUseConnections=true)
poolOptions:
# ensure it's not involved by default
backgroundThreadIntervalMS: 10000
operations:
- name: ready
- name: checkOut
- name: checkOut
label: conn
- name: clear
interruptInUseConnections: true
- name: waitForEvent
event: ConnectionPoolCleared
count: 1
timeout: 1000
- name: waitForEvent
event: ConnectionClosed
count: 2
timeout: 1000
- name: close
events:
- type: ConnectionCheckedOut
connectionId: 1
address: 42
- type: ConnectionCheckedOut
connectionId: 2
address: 42
- type: ConnectionPoolCleared
interruptInUseConnections: true
- type: ConnectionClosed
reason: stale
address: 42
- type: ConnectionClosed
reason: stale
address: 42
- type: ConnectionPoolClosed
address: 42
ignore:
- ConnectionCreated
- ConnectionPoolReady
- ConnectionReady
- ConnectionCheckOutStarted
- ConnectionPoolCreated
- ConnectionCheckedIn
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
{
"version": 1,
"style": "integration",
"description": "clear with interruptInUseConnections = true closes pending connections",
"runOn": [
{
"minServerVersion": "4.9.0"
}
],
"failPoint": {
"configureFailPoint": "failCommand",
"mode": "alwaysOn",
"data": {
"failCommands": [
"isMaster",
"hello"
],
"closeConnection": false,
"blockConnection": true,
"blockTimeMS": 1000
}
},
"poolOptions": {
"minPoolSize": 0
},
"operations": [
{
"name": "ready"
},
{
"name": "start",
"target": "thread1"
},
{
"name": "checkOut",
"thread": "thread1"
},
{
"name": "waitForEvent",
"event": "ConnectionCreated",
"count": 1
},
{
"name": "clear",
"interruptInUseConnections": true
},
{
"name": "waitForEvent",
"event": "ConnectionCheckOutFailed",
"count": 1
}
],
"events": [
{
"type": "ConnectionCheckOutStarted"
},
{
"type": "ConnectionCreated"
},
{
"type": "ConnectionPoolCleared",
"interruptInUseConnections": true
},
{
"type": "ConnectionClosed"
},
{
"type": "ConnectionCheckOutFailed"
}
],
"ignore": [
"ConnectionCheckedIn",
"ConnectionCheckedOut",
"ConnectionPoolCreated",
"ConnectionPoolReady"
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
version: 1
style: integration
description: clear with interruptInUseConnections = true closes pending connections
runOn:
-
minServerVersion: "4.9.0"
failPoint:
configureFailPoint: failCommand
mode: "alwaysOn"
data:
failCommands: ["isMaster","hello"]
closeConnection: false
blockConnection: true
blockTimeMS: 1000
poolOptions:
minPoolSize: 0
operations:
- name: ready
- name: start
target: thread1
- name: checkOut
thread: thread1
- name: waitForEvent
event: ConnectionCreated
count: 1
- name: clear
interruptInUseConnections: true
- name: waitForEvent
event: ConnectionCheckOutFailed
count: 1
events:
- type: ConnectionCheckOutStarted
- type: ConnectionCreated
- type: ConnectionPoolCleared
interruptInUseConnections: true
- type: ConnectionClosed
- type: ConnectionCheckOutFailed
ignore:
- ConnectionCheckedIn
- ConnectionCheckedOut
- ConnectionPoolCreated
- ConnectionPoolReady
Loading