Skip to content

DRIVERS-2469 Add durations to connection pool events #1448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 9, 2023

Conversation

stIncMale
Copy link
Member

@stIncMale stIncMale commented Aug 4, 2023

This spec change is implemented in mongodb/mongo-java-driver#1166 for the MongoDB Java Driver.

There are no unified tests for CommandSucceededEvent/CommandFailedEvent.duration (see https://github.com/mongodb/specifications/blob/master/source/command-logging-and-monitoring/command-logging-and-monitoring.rst), similarly, there are no unified tests in this PR except those for logging.

DRIVERS-2469

Please complete the following before merging:

  • Update changelog.
  • Make sure there are generated JSON files from the YAML test files.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, sharded clusters, and serverless).

@stIncMale stIncMale self-assigned this Aug 4, 2023
@stIncMale stIncMale requested a review from a team as a code owner August 4, 2023 16:40
@stIncMale stIncMale requested review from patrickfreed and baileympearson and removed request for a team August 4, 2023 16:40
@@ -12,6 +12,8 @@ Abstract

Drivers currently support a variety of options that allow users to configure connection pooling behavior. Users are confused by drivers supporting different subsets of these options. Additionally, drivers implement their connection pools differently, making it difficult to design cross-driver pool functionality. By unifying and codifying pooling options and behavior across all drivers, we will increase user comprehension and code base maintainability.

This specification does not apply to drivers that do not support multitasking.
Copy link
Member Author

@stIncMale stIncMale Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fleshes out the current state of affairs. Not understanding this part costed me some time when working on this PR, which is why the change is relevant.

* The maximum amount of time a thread can wait for a connection
* to become available.
* The maximum amount of time a thread can wait for either an available non-perished connection (limited by `maxPoolSize`),
* or a pending connection (limited by `maxConnecting`).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clarifies the definition to match the reality. Needs to be done as part of this PR because this PR relies on the definition of waitQueueTimeoutMS.

Comment on lines +662 to +668

# If there is no background thread, the pool MUST ensure that
# there are at least minPoolSize total connections.
do asynchronously:
while totalConnectionCount < minPoolSize:
populate the pool with a connection

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Moved the part of the checkOut logic that is concerned with populating the pool at the end of the checkOut logic because the previous position was weird: between creating a new connection that is about to be checked out and establishing it.
  2. Replaced This MUST be done in a non-blocking manner with the do asynchronously block, because it's the only interpretation of "non-blocking manner" I can see here, and I find it clearer. Expressing the requirement as pseudocode is also more explicit as opposed to being expressed in a comment.

@@ -1288,9 +1338,13 @@ In addition to the common fields defined above, this message MUST contain the fo
- If ``reason`` is ``ConnectionError``, the associated error. The type and format of this value is flexible; see the
`logging specification <../logging/logging.rst#representing-errors-in-log-messages>`_ for details on representing errors in log messages.

* - waitedMS
- Int64
- ``ConnectionCheckOutFailedEvent.waited`` converted to milliseconds.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose milliseconds because this unit is used when logging CommandSucceededEvent/CommandFailedEvent.duration. However, at least when logging ConnectionCheckedOutEvent/ConnectionCheckOutFailedEvent.waited, nanoseconds is definitely a more appropriate unit because those wait times can be quite small. Fortunately, an application can still access the value using a smaller unit by listening to ConnectionCheckedOutEvent/ConnectionCheckOutFailedEvent.

For reviewers: should nanoseconds be used when logging ConnectionCheckedOutEvent/ConnectionCheckOutFailedEvent.waited?

DRIVERS-2469
@stIncMale stIncMale requested a review from matthewdale August 4, 2023 21:56
@@ -131,8 +133,8 @@ Additionally, Drivers that implement a Connection Pool MUST support the followin
/**
* NOTE: This option has been deprecated in favor of timeoutMS.
*
* The maximum amount of time a thread can wait for a connection
* to become available.
* The maximum amount of time a thread can wait for either an available non-perished connection (limited by `maxPoolSize`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Since this is in a code block, line wrapping is preserved, resulting in a very long line. Consider wrapping this long line to 80-90 characters to make it easier to read in the Github RST viewer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* The driver MUST document this behavior as well as explicitly warn users that the behavior may change in the future.
*
* A driver MAY choose the type idiomatic to the driver.
* If the type chosen does not convey units, e.g., `int64`, then the driver MAY include units in the name, e.g., `durationMS`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Since this is in a code block, line wrapping is preserved, resulting in very long lines. Consider wrapping these long lines to 80-90 characters to make it easier to read in the Github RST viewer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* The driver MUST document this behavior as well as explicitly warn users that the behavior may change in the future.
*
* A driver MAY choose the type idiomatic to the driver.
* If the type chosen does not convey units, e.g., `int64`, then the driver MAY include units in the name, e.g., `waitedMS`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Since this is in a code block, line wrapping is preserved, resulting in very long lines. Consider wrapping these long lines to 80-90 characters to make it easier to read in the Github RST viewer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 1012 to 1015
* The time elapsed waiting for conditions specified by `ConnectionPoolOptions.waitQueueTimeoutMS`.
* More specifically, the time elapsed between the `ConnectionCheckOutStartedEvent` and:
* - the `ConnectionCreatedEvent` emitted by the same checking out, if this checking out emitted `ConnectionCreatedEvent`;
* - this event, otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that excluding the connection establishment duration will make using the check-out duration values difficult. If the check-out waited for a connection to be created and established, we should include that time in the check-out duration measurement. Not including it may lead to very confusing results.

Consider the example:
A running application using a MongoDB driver has a "maxPoolSize" of 10, a full connection pool, and rarely creates new connections. The average reported check-out duration is 50ms, which is mostly spent waiting on another connection to become idle. The server starts closing connections for some reason, so the driver starts creating more connections, which take about 500ms to establish. Check-outs that create a new connection start reporting a near-zero duration because they immediately request a new connection instead of waiting on an idle one. As a result, the average check-out duration actually goes down, despite the check-outs actually taking longer because of the time spent creating connections.

Instead, we should always report the entire time between when a check-out starts and when it delivers a connection to the caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result, the average check-out duration actually goes down

The important thing to note here is that ConnectionCheckedOutEvent/ConnectionCheckOutFailedEvent.waited is not a "check-out duration", and it is not documented as such.

Instead, we should always report the entire time between when a check-out starts and when it delivers a connection to the caller.

The description of https://jira.mongodb.org/browse/DRIVERS-2469 requests the waited time and time to establish, which is what I added.

If the checkout time appears to be also needed, we may

  1. either replace the waited time with the checkout time,
  2. or add the checkout time to be the third reported duration.

Is my understanding correct that you prefer the first option?

Copy link
Contributor

@baileympearson baileympearson Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the phrasing from the ticket:

  • The amount of time a request to check out a connection waited for an idle or new connection.
  • The total amount of time a new connection took to establish (i.e. the time between requesting a new connection and the new connection being ready to send application operations on).

I agree with Matt here, for the reasons that Matt described. But also because I interpret the request in the drivers ticket to deliver the following new metrics:

  • the whole time it took for any checkout to succeed or fail (the time between ConnectionCheckoutStarted and ConnectionCheckoutSucceeded or ConnectionCheckoutFailed)
  • the time it takes for any connection to establish (the time between ConnectionCreated and ConnectionReady)

As it currently stands, a user would not be able to interpret the duration property on the ConnectionCheckoutSucceededEvent without monitoring for connection establishment and correlating the events. I'd argue that this makes the duration properties meaningless as its currently specified.

If we make the duration on the ConnectionCheckoutSucceededEvent the time between ConnectionCheckoutStarted and the successful connection checkout, users might see spikes in this value when new connections are created and they can confirm this corresponds to new connections being established if they wish, but the duration can be interpreted by itself without monitoring for other events.

edit: I think I used duration instead of waited in my response here. But I meant to say the reference waited field.

Copy link
Contributor

@matthewdale matthewdale Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I think we should replace "waited" time with the full checkout duration. My intent in DRIVERS-2469 was not to distinguish time waited for an idle connection and time waited for a new connection, but to add two mostly orthogonal metrics (check-out duration and connection establishment duration).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@stIncMale stIncMale requested a review from matthewdale August 8, 2023 20:10
DRIVERS-2469
DRIVERS-2469
Copy link
Contributor

@matthewdale matthewdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

Copy link
Contributor

@baileympearson baileympearson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about testing. First, we're missing tests for ConnectionCheckoutFailed.duration with the existing tests. Second, the tests currently provide no guarantees about what the fields actually mean. Lastly, the testing strategy relies on logging having been implemented by a driver to test these changes.

Did you consider either trying to add tests to the CMAP unit/integration tests somehow or adding prose tests?

--

Also - can we update the title of the PR to include the driver's ticket? That's the usual convention in the specs repo (look at other open PRs and commits).

…t is emitted not only when checking out

DRIVERS-2469
@stIncMale stIncMale changed the title Add durations to connection pool events DRIVERS-2469 Add durations to connection pool events Aug 9, 2023
@stIncMale
Copy link
Member Author

A response to #1448 (review) (@baileympearson)

First, we're missing tests for ConnectionCheckoutFailed.duration with the existing tests.

It's here: https://github.com/stIncMale/specifications/blob/DRIVERS-2469/source/connection-monitoring-and-pooling/tests/logging/connection-logging.yml#L216-L224.

Second, the tests currently provide no guarantees about what the fields actually mean. Lastly, the testing strategy relies on logging having been implemented by a driver to test these changes.

Did you consider either trying to add tests to the CMAP unit/integration tests somehow or adding prose tests?

I did. I looked at CommandSucceededEvent/CommandFailedEvent.duration, observed that the only tests for that field is logging tests, and used the same approach, which I also mentioned in the description of the PR. Do you request prose tests for the fields added in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants