Skip to content

DRIVERS-2707 Relax the CMAP documentation requirement for durations in events #1455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 10, 2023

Conversation

stIncMale
Copy link
Member

@stIncMale stIncMale commented Aug 25, 2023

DRIVERS-2707

@baileympearson, @matthewdale I added you as reviewers because you approved the PR where I introduced the wording that this PR changes.

Please complete the following before merging:

  • Update changelog.
  • Make sure there are generated JSON files from the YAML test files.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, sharded clusters, and serverless).

@stIncMale stIncMale self-assigned this Aug 25, 2023
@stIncMale stIncMale requested a review from a team as a code owner August 25, 2023 21:15
@stIncMale stIncMale requested review from BorisDog and removed request for a team August 25, 2023 21:15
@stIncMale stIncMale changed the title Relax the CMAP documentation requirement for durations in events DRIVERS-2707 Relax the CMAP documentation requirement for durations in events Aug 25, 2023
DRIVERS-2707
Copy link
Contributor

@matthewdale matthewdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

… events

Move the rationale to the "Design Rationale" section.

DRIVERS-2707
@jyemin jyemin removed their request for review August 30, 2023 18:12
@stIncMale
Copy link
Member Author

stIncMale commented Sep 11, 2023

There were conversations in Slack related to this PR, but no consensus was achieved. @baileympearson, we need to move forward here somehow.

I am fine with either excluding from duration the time to synchronously deliver ConnectionCheckOutStartedEvent/ConnectionCreatedEvent, which is the current behavior, or including it. But I recognize that duration and waitQueueTimeoutMS are related: if we don't start counting time towards them at the same instant, we expose ourselves to behaviors that are surprising to users. Following are two examples with large synchronous delivery time spans to make the surprising behaviors apparent.

1.a. EXCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent from duration. Start counting time towards waitQueueTimeoutMS BEFORE emitting ConnectionCheckOutStartedEvent.

  1. Start counting time towards waitQueueTimeoutMS, which is 5 ms.
  2. Emit and synchronously deliver ConnectionCheckOutStartedEvent, which takes 100 ms.
  3. Start counting time towards ConnectionCheckOutFailedEvent.duration.
  4. Fail the checkout because of the timeout (100 ms > 5 ms).
  5. Measure duration since step 3, which is 1 ms. Emit ConnectionCheckOutFailedEvent with duration 1 ms.
  6. User observes a checkout failed due to waitQueueTimeoutMS (5 ms), but ConnectionCheckOutFailedEvent.duration (1ms) tells a user that there were plenty of time left to wait for an available connection (5 ms - 1 ms = 4 ms).

2.a. INCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent to duration. Start counting time towards waitQueueTimeoutMS AFTER emitting ConnectionCheckOutStartedEvent.

  1. Start counting time towards ConnectionCheckedOutEvent.duration.
  2. Emit and synchronously deliver ConnectionCheckOutStartedEvent, which takes 100 ms.
  3. Start counting time towards waitQueueTimeoutMS, which is 5 ms.
  4. Discover that there is an available connection.
  5. Measure duration since step 3, which is 1 ms. Emit ConnectionCheckedOutEvent with duration 100 ms + 1 ms = 101 ms.
  6. User observes a checkout succeeded with duration 101 ms despite waitQueueTimeoutMS being 5ms, and no new connections having been established. From this observation the user makes a conclusion that a driver did not respect waitQueueTimeoutMS.

While it is true that a driver does not provide hard real-time guarantees, the driver is still supposed to do its best to try and meet the specified timeouts. In these examples the driver failed to do its best, because there are ways to implement the functionality such that the user is spared from observing confusing data in executions as specified above.

The below options neither have the issues as in the examples above, nor do they have any new issues compared to those examples. Obviously, a driver still does not provide hard real-time guarantees, and a noticeable delay may happen at any point in execution, not only in user code when we synchronously deliver events. But a driver still must do its best, and the options below are better than the options chosen in the examples above:

1.b. EXCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent from duration. Start counting time towards waitQueueTimeoutMS AFTER emitting ConnectionCheckOutStartedEvent (this is the current state of this PR).

2.b. INCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent to duration. Start counting time towards waitQueueTimeoutMS BEFORE emitting ConnectionCheckOutStartedEvent.

Both of these options allow a driver to start counting time towards duration and waitQueueTimeoutMS at the same instant, and the spec will require this behavior if we choose either option 1.b or 2.b.

@baileympearson
Copy link
Contributor

@stIncMale

As a side note - I don't think it's correct that duration and waitQueueTimeoutMS are related and that they must be aligned. It's already the case that no matter what we do with event delivery, waitQueueTimeoutMS and duration can have very different values. waitQueueTimeoutMS should not govern connection establishment, and consequently in drivers that establish connections synchronously when performing connection checkout ConnectionCheckoutSucceeded.duration could be significantly greater than waitQueueTimeoutMS. This is irrespective of synchronous / asynchronous event processing.


To answer your question from slack (which is in reference to my suggestion that duration be the difference between construction times of ConnectionCheckoutStartedEvents and ConnectionCheckoutSucceeded/failedEvents):

The above means that the duration of synchronous delivery of ConnectionCheckoutStarted is counted towards ConnectionCheckoutSucceeded.duration (option 1). Could you elaborate how the above avoids the discussion around waitQueueTimeoutMS?

I interpret the intention behind DRIVERS-2469 as: users have requested a way to determine how long connection checkout takes so they can optimize their application's consumption of driver connection pools. In drivers where events are emitted synchronously, this can impact pool performance. So, any synchronous event process definitely should be included in the new measurement.

Standardizing and mandating how drivers handle synchronous emissions of events is bad because:

  • If it's required behavior for certain drivers, it should probably be tested. But this behavior is really challenging to test even in just Node, I can't think of a way to test this across drivers.
  • It's overly complicated and requires unnecessary spec changes for drivers that don't already handle waitQueueTimeoutMS and ConnectionCheckoutStartedEvents in line with whatever this PR defines as the "correct" ordering

The simplest way to measure that duration for all drivers is to take the difference between the when drivers construct ConnectionCheckoutStartedEvent and when drivers construct ConnectionCheckoutSucceededEvent.

This doesn't clarify in the spec whether or not the durations of synchronous event emissions should be included or not. However, it does

  • not mandate drivers document a pool internal behavior (the original objection that caused this PR)
  • provide a simple, uniform definition that works regardless of connection pool implementations
  • codify synchronous event emissions in such a way that drivers can change and improve the performance of their connection pool
  • not mandate a behavior on driver connection pools that would be difficult to test reliably (speaking from Node's perspective)

This also closely matches my intuition of "matching" corresponding checkout started and checkout succeeded events and subtracting timestamps, which I think is how many users might view the new "duration" property as well. And specifically for Node, we actually have a timestamp on every pool event, so a user might actually be confused as to why the duration property is significantly different from ConnectionCheckoutSucceeded - ConnectionCheckoutStarted.


Given the options outlined in proceeding comment, I strongly lean towards option 1.b. If we do commit to behavior here, we should test it (probably with an optional prose test for drivers that emit events synchronously).

1.b. EXCLUDE the time to synchronously deliver ConnectionCheckOutStartedEvent from duration. Start counting time towards waitQueueTimeoutMS AFTER emitting ConnectionCheckOutStartedEvent (this is the current state of this PR).

Copy link
Contributor

@matthewdale matthewdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@stIncMale stIncMale requested review from baileympearson and removed request for baileympearson October 4, 2023 16:35
Comment on lines +73 to +74
Conceptually, event emission is instantaneous, i.e., one may talk about the instant an event is emitted,
and represents the start of an activity of delivering the event to a subscribed user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed clarifying that durations specifically apply to the point in time directly before the event emission. Can we mention that here instead? Whether or not the event emission is instantaneous, slow, async or sync - it makes no matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two terms: emission, delivery. The first one is already used in the spec, the second one is not. When we talked, I said that I am concerned that an implementor may interpret emission as delivery. Interestingly, this is exactly what happened to you when you wrote

Whether or not the event emission is instantaneous, slow, async or sync - it makes no matter.

It's the delivery that may be slow/sync/async (and, as we discussed, I mentioned neither of those words), and can't be instantaneous, because it's an activity. This activity has the beginning and the end. The beginning of the delivery activity is emission.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - but what we discussed was specifying that we start measuring the duration immediately before emitting an event because we wanted consistency in when drivers started timing checkout durations relative to event emissions. That is not captured in this PR.

Copy link
Member Author

@stIncMale stIncMale Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what we discussed was specifying that we start measuring the duration immediately before emitting an event because we wanted consistency in when drivers started timing checkout durations relative to event emissions

It may be that each of us was hearing/meaning a different thing when we discussed this. At the moment I do not understand what ambiguity (or a different issue) is in the PR text.

Durations are defined as the time elapsed between emitting event X and event Y. Emission is defined as the start of the delivery activity; it is instantaneous, i.e., we can talk about event X being emitted at instant tx (note how the word "instantaneous" is defined in the spec right away, leaving no ambiguity of what it means). Thus, the duration is ty - tx.

Could you tell what undesirable implementation you envision that is allowed by the spec as it currently is in this PR, and what wording added to the spec will forbid that implementation? If you will use the word "immediately" or its derivative, could you also define it?1


1 It is usually fine to use the word "immediately" colloquially (we did use it when we talked), but it is not necessary the right way to express an idea precisely. "immediately" is a useful word when we talk about ordering relations (e.g., "in the total order of natural numbers 1 is followed immediately by 2"), but is much less useful when it is used in the temporal meaning, because the latter is vague and seems to be suitable only for informal usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're saying the same thing, I just didn't understand the phrasing. I also wouldn't be surprised if others don't have the same precise understandings of emission and delivery. When I say immediately, I don't really care about the exact moment temporally. I am okay with the wording, now that it has been clarified but would probably prefer something that is a little less dependent on the exact definitions of phrases and words (of which we don't define in the specification).

Perhaps, to avoid ambiguity and misunderstandings that stem from language, we could just update the pseudocode?

connection = Null
record checkout start time
emit ConnectionCheckOutStartedEvent and equivalent log message
try:
  enter WaitQueue
  wait until at top of wait queue
  # Note that in a lock-based implementation of the wait queue would
  # only allow one thread in the following block at a time
  while connection is Null:
    if a connection is available:
      while connection is Null and a connection is available:
        connection = next available connection
        if connection is perished:
          close connection
          connection = Null
    else if totalConnectionCount < maxPoolSize:
      if pendingConnectionCount < maxConnecting:
        connection = create connection
      else:
        # this waiting MUST NOT prevent other threads from checking Connections
        # back in to the pool.
        wait until pendingConnectionCount < maxConnecting or a connection is available
        continue

except pool is "closed":
  calculate checkout duration
  emit ConnectionCheckOutFailedEvent(reason="poolClosed", duration=duration) and equivalent log message
  throw PoolClosedError
except pool is "paused":
  calculate checkout duration
  emit ConnectionCheckOutFailedEvent(reason="connectionError", duration=duration) and equivalent log message
  throw PoolClearedError
except timeout:
  calculate checkout duration
  emit ConnectionCheckOutFailedEvent(reason="timeout", duration=duration) and equivalent log message
  throw WaitQueueTimeoutError
finally:
  # This must be done in all drivers
  leave wait queue

# If the Connection has not been established yet (TCP, TLS,
# handshake, compression, and auth), it must be established
# before it is returned.
# This MUST NOT block other threads from acquiring connections.
if connection state is "pending":
  try:
    establish connection
  except connection establishment error:
    calculate checkout duration
    emit ConnectionCheckOutFailedEvent(reason="connectionError", duration=duration) and equivalent log message
    decrement totalConnectionCount
    throw
  finally:
    decrement pendingConnectionCount
else:
    decrement availableConnectionCount
set connection state to "in use"

# If there is no background thread, the pool MUST ensure that
# there are at least minPoolSize total connections.
do asynchronously:
  while totalConnectionCount < minPoolSize:
    populate the pool with a connection

calculate checkout duration
emit ConnectionCheckedOutEvent(duration=duration) and equivalent log message
return connection

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated the pseudocode to reflect the added durations. Some lines are quite long. I'll split them if you think the code will be more readable that way.

stIncMale and others added 2 commits October 4, 2023 12:12
…-and-pooling.rst

Co-authored-by: Bailey Pearson <bailey.pearson@gmail.com>
@stIncMale stIncMale merged commit c6621d0 into mongodb:master Oct 10, 2023
@stIncMale stIncMale deleted the DRIVERS-2707 branch October 10, 2023 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants