-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-2469 Add durations to connection pool events #1448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
DRIVERS-2469
@@ -12,6 +12,8 @@ Abstract | |||
|
|||
Drivers currently support a variety of options that allow users to configure connection pooling behavior. Users are confused by drivers supporting different subsets of these options. Additionally, drivers implement their connection pools differently, making it difficult to design cross-driver pool functionality. By unifying and codifying pooling options and behavior across all drivers, we will increase user comprehension and code base maintainability. | |||
|
|||
This specification does not apply to drivers that do not support multitasking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fleshes out the current state of affairs. Not understanding this part costed me some time when working on this PR, which is why the change is relevant.
* The maximum amount of time a thread can wait for a connection | ||
* to become available. | ||
* The maximum amount of time a thread can wait for either an available non-perished connection (limited by `maxPoolSize`), | ||
* or a pending connection (limited by `maxConnecting`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clarifies the definition to match the reality. Needs to be done as part of this PR because this PR relies on the definition of waitQueueTimeoutMS
.
|
||
# If there is no background thread, the pool MUST ensure that | ||
# there are at least minPoolSize total connections. | ||
do asynchronously: | ||
while totalConnectionCount < minPoolSize: | ||
populate the pool with a connection | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Moved the part of the
checkOut
logic that is concerned with populating the pool at the end of thecheckOut
logic because the previous position was weird: between creating a new connection that is about to be checked out and establishing it. - Replaced
This MUST be done in a non-blocking manner
with thedo asynchronously
block, because it's the only interpretation of "non-blocking manner" I can see here, and I find it clearer. Expressing the requirement as pseudocode is also more explicit as opposed to being expressed in a comment.
@@ -1288,9 +1338,13 @@ In addition to the common fields defined above, this message MUST contain the fo | |||
- If ``reason`` is ``ConnectionError``, the associated error. The type and format of this value is flexible; see the | |||
`logging specification <../logging/logging.rst#representing-errors-in-log-messages>`_ for details on representing errors in log messages. | |||
|
|||
* - waitedMS | |||
- Int64 | |||
- ``ConnectionCheckOutFailedEvent.waited`` converted to milliseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose milliseconds because this unit is used when logging CommandSucceededEvent/CommandFailedEvent.duration
. However, at least when logging ConnectionCheckedOutEvent
/ConnectionCheckOutFailedEvent.waited
, nanoseconds is definitely a more appropriate unit because those wait times can be quite small. Fortunately, an application can still access the value using a smaller unit by listening to ConnectionCheckedOutEvent
/ConnectionCheckOutFailedEvent
.
For reviewers: should nanoseconds be used when logging ConnectionCheckedOutEvent
/ConnectionCheckOutFailedEvent.waited
?
DRIVERS-2469
@@ -131,8 +133,8 @@ Additionally, Drivers that implement a Connection Pool MUST support the followin | |||
/** | |||
* NOTE: This option has been deprecated in favor of timeoutMS. | |||
* | |||
* The maximum amount of time a thread can wait for a connection | |||
* to become available. | |||
* The maximum amount of time a thread can wait for either an available non-perished connection (limited by `maxPoolSize`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Since this is in a code block, line wrapping is preserved, resulting in a very long line. Consider wrapping this long line to 80-90 characters to make it easier to read in the Github RST viewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* The driver MUST document this behavior as well as explicitly warn users that the behavior may change in the future. | ||
* | ||
* A driver MAY choose the type idiomatic to the driver. | ||
* If the type chosen does not convey units, e.g., `int64`, then the driver MAY include units in the name, e.g., `durationMS`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Since this is in a code block, line wrapping is preserved, resulting in very long lines. Consider wrapping these long lines to 80-90 characters to make it easier to read in the Github RST viewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* The driver MUST document this behavior as well as explicitly warn users that the behavior may change in the future. | ||
* | ||
* A driver MAY choose the type idiomatic to the driver. | ||
* If the type chosen does not convey units, e.g., `int64`, then the driver MAY include units in the name, e.g., `waitedMS`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Since this is in a code block, line wrapping is preserved, resulting in very long lines. Consider wrapping these long lines to 80-90 characters to make it easier to read in the Github RST viewer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* The time elapsed waiting for conditions specified by `ConnectionPoolOptions.waitQueueTimeoutMS`. | ||
* More specifically, the time elapsed between the `ConnectionCheckOutStartedEvent` and: | ||
* - the `ConnectionCreatedEvent` emitted by the same checking out, if this checking out emitted `ConnectionCreatedEvent`; | ||
* - this event, otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that excluding the connection establishment duration will make using the check-out duration values difficult. If the check-out waited for a connection to be created and established, we should include that time in the check-out duration measurement. Not including it may lead to very confusing results.
Consider the example:
A running application using a MongoDB driver has a "maxPoolSize" of 10, a full connection pool, and rarely creates new connections. The average reported check-out duration is 50ms, which is mostly spent waiting on another connection to become idle. The server starts closing connections for some reason, so the driver starts creating more connections, which take about 500ms to establish. Check-outs that create a new connection start reporting a near-zero duration because they immediately request a new connection instead of waiting on an idle one. As a result, the average check-out duration actually goes down, despite the check-outs actually taking longer because of the time spent creating connections.
Instead, we should always report the entire time between when a check-out starts and when it delivers a connection to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result, the average check-out duration actually goes down
The important thing to note here is that ConnectionCheckedOutEvent
/ConnectionCheckOutFailedEvent.waited
is not a "check-out duration", and it is not documented as such.
Instead, we should always report the entire time between when a check-out starts and when it delivers a connection to the caller.
The description of https://jira.mongodb.org/browse/DRIVERS-2469 requests the waited time and time to establish, which is what I added.
If the checkout time appears to be also needed, we may
- either replace the waited time with the checkout time,
- or add the checkout time to be the third reported duration.
Is my understanding correct that you prefer the first option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's the phrasing from the ticket:
- The amount of time a request to check out a connection waited for an idle or new connection.
- The total amount of time a new connection took to establish (i.e. the time between requesting a new connection and the new connection being ready to send application operations on).
I agree with Matt here, for the reasons that Matt described. But also because I interpret the request in the drivers ticket to deliver the following new metrics:
- the whole time it took for any checkout to succeed or fail (the time between
ConnectionCheckoutStarted
andConnectionCheckoutSucceeded
orConnectionCheckoutFailed
) - the time it takes for any connection to establish (the time between
ConnectionCreated
andConnectionReady
)
As it currently stands, a user would not be able to interpret the duration
property on the ConnectionCheckoutSucceededEvent
without monitoring for connection establishment and correlating the events. I'd argue that this makes the duration
properties meaningless as its currently specified.
If we make the duration
on the ConnectionCheckoutSucceededEvent
the time between ConnectionCheckoutStarted
and the successful connection checkout, users might see spikes in this value when new connections are created and they can confirm this corresponds to new connections being established if they wish, but the duration can be interpreted by itself without monitoring for other events.
edit: I think I used duration
instead of waited
in my response here. But I meant to say the reference waited
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, I think we should replace "waited" time with the full checkout duration. My intent in DRIVERS-2469 was not to distinguish time waited for an idle connection and time waited for a new connection, but to add two mostly orthogonal metrics (check-out duration and connection establishment duration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
DRIVERS-2469
…ited` with `duration` DRIVERS-2469
DRIVERS-2469
DRIVERS-2469
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
DRIVERS-2469
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about testing. First, we're missing tests for ConnectionCheckoutFailed.duration
with the existing tests. Second, the tests currently provide no guarantees about what the fields actually mean. Lastly, the testing strategy relies on logging having been implemented by a driver to test these changes.
Did you consider either trying to add tests to the CMAP unit/integration tests somehow or adding prose tests?
--
Also - can we update the title of the PR to include the driver's ticket? That's the usual convention in the specs repo (look at other open PRs and commits).
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Outdated
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Show resolved
Hide resolved
source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst
Show resolved
Hide resolved
…t is emitted not only when checking out DRIVERS-2469
A response to #1448 (review) (@baileympearson)
I did. I looked at |
This spec change is implemented in mongodb/mongo-java-driver#1166 for the MongoDB Java Driver.
There are no unified tests for
CommandSucceededEvent
/CommandFailedEvent.duration
(see https://github.com/mongodb/specifications/blob/master/source/command-logging-and-monitoring/command-logging-and-monitoring.rst), similarly, there are no unified tests in this PR except those for logging.DRIVERS-2469
Please complete the following before merging: