Skip to content

Conversation

@baileympearson
Copy link
Contributor

@baileympearson baileympearson commented Dec 2, 2025

DRIVERS-3239

Overview

This PR adds support for a new class of errors (SystemOverloadedError) to drivers' operation retry logic, as outlined in the design document.

Additionally, it includes a new argument to the MongoDB handshake (also defined in the design document).

Python will be second implementer.
Node implementation: mongodb/node-mongodb-native#4806

Testing

The testing strategy is two-fold:

  • Building off of Ezra's work to generate unified tests for retryable handshake errors, this PR generates unified tests to confirm that:

    • operations are retried using the new SystemOverloadedError label
    • operations are retried no more than 5 (current MAX_ATTEMPTS, as defined in the spec) times
  • Following Iris's work in DRIVERS-1934: withTransaction API retries too frequently #1851, this PR adds a prose test that ensures drivers apply exponential backoff in the retryability loop.

  • Update changelog.

  • Test changes in at least one language driver.

  • Test these changes against all server versions and topologies (including standalone, replica set, and sharded
    clusters).

> - If the value is "stderr" (case-insensitive), log to stderr.
> - Else, if direct logging to files is supported, log to a file at the specified path. If the file already exists, it
> MUST be appended to.
> MUST be appended to.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is failing lint on main. I'll fix separately and rebase this PR.

@baileympearson baileympearson marked this pull request as ready for review December 2, 2025 18:59
@baileympearson baileympearson requested review from a team as code owners December 2, 2025 18:59
@baileympearson baileympearson requested review from jmikola and jyemin and removed request for a team December 2, 2025 18:59
@blink1073
Copy link
Member

It looks like you also need to bump the schema version:

source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3

@blink1073
Copy link
Member

WIP Python implementation: mongodb/mongo-python-driver#2635

@blink1073
Copy link
Member

blink1073 commented Dec 3, 2025

All unified and prose tests are passing in the Python implementation.

Edit: we're still failing one unified test, "client.clientBulkWrite retries using operation loop", investigating...

Edit 2: we're all good now

Copy link
Contributor

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the specification changes, not the pseudocode or tests. Those are best reviewed by implementers.

- This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the
timeout to avoid retry storms.
5. If the previous error includes the `SystemOverloadedError` label, the client MUST apply exponential backoff according
to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`


This specification expands the driver's retry ability to all commands, including those not currently considered
retryable such as updateMany, create collection, getMore, and generic runCommand. The new command execution method obeys
the following rules:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the rules include all the deposits into the token bucket, consider adding withdrawals as well.


## Q&A

TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything to add here, or just remove?


## Changelog

- 2025-XX-XX: Initial version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we handle the date... Is there an automation for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of. Usually the spec author fills it out before merging

I'll just leave this thread open to remind myself to add changelog dates before merging once all changes are completed.

Comment on lines 53 to 55
```python
assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts:
Could this stick to being javascript from top-to-bottom?
Can we also do BIG_TIME - SMALL_TIME >= 2.1?

  • To me that's more human-readable.
  • It maintains that BIG_TIME must always be bigger (removing the need for absolute_value).
  • It captures the 1 second variation whilst still being rigid to the 3.1 second window and stays minimally invasive.

the following rules:

1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.
- The value is 0.1 and non-configurable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the concerns for golang, I thought updating these values by a scalar of 10 in those cases was fine?
cc: @matthewdale

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to identify clients which do and do not support backpressure. Currently, this flag is unused but in the future the
server may offer different rate limiting behavior for clients that do not support backpressure.

##### Implementation notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### Implementation notes
#### Implementation notes

#### Goodput

The throughput of positive, useful output. In the context of drivers, this refers to the number of non-error results
that the driver processes per unit of time.
Copy link
Member

@stIncMale stIncMale Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the number of non-error results that the driver processes per unit of time" is neither throughput, nor the "good throughput" ("goodput"). Throughput is the characteristic of a system (the combination of the application, the driver, the DBMS, their configuration, the network connecting them, the hardware, etc.), which is a constant for a given system, and tells about system capacity at its peak. SPECjbb2012: Updated Metrics for a Business Benchmark explains nicely what a throughput is, and how it may be measured.

"the number of non-error results
that the driver processes per unit of time"
is not a characteristic of a system, but rather a metric whose value may vary. Trivially, if an application does not request any operations via the driver, then "the number of non-error results
that the driver processes per unit of time"
is zero, but the throughput is still not.

Copy link
Member

@stIncMale stIncMale Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, however, we want to define "throughput"/"goodput" the way it is currently proposed, then when we use the term in

  • "negatively affect goodput"
  • "stable but lowered throughput"

we have to say "max goodput" / "max throughput", or something like that, instead of just "goodput"/"throughput".

client:
id: &client client
useMultipleMongoses: false
observeEvents: [ 'commandStartedEvent', 'commandSucceededEvent', 'commandFailedEvent' ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to quote these. The formatting is also a bit inconssitent with the array in runOnRequirements[].topologies.

database:
id: &database database
client: *client
databaseName: &database_name retryable-writes-tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use *database_name here if it's the same namespace as the utilDb target.

collection:
id: &collection collection
database: *database
collectionName: &collection_name coll
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use *collection_name. I don't think the linter catches this, but we really shouldn't be redefining YAML anchors.

operations:
-
object: *utilCollection
name: deleteMany
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we starting with deleteMany after initialData populates the collection under test? Wouldn't it make more sense to just specify initialData with an empty array (assuming you just need the collection to exist and be empty)?


-
client:
id: &failPointClient failPointClient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an odd name for the entity since it's used for more than just configuring the fail point. The main difference seems to be that this client doesn't have monitoring enabled.

Would be nice if the names between the entities were more consistent. Perhaps client_internal and similar for the "utilDb" and "utilCollection" names (database_internal, collection_internal) to ensure consistency. The mixing of camelCase and snack_case for these values is throwing me.

configureFailPoint: 'failCommand',
mode: 'alwaysOn',
data: {
failCommands: ['insert'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation of the data object's fields is off here.

}
```

3. Execute the following command. Expect that the command errors. Measure the duration of the command execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "following command"? I assume it's an insertOne operation, but it'd be more clear to present that without the JS block, or explicitly suggest readers consider the following example code.


## Changelog

- 2025-XX-XX: Initial version.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO item you'll update before merging. I think most files just create an empty changelog for new copies.


CLIENT_BULK_WRITE_ARGUMENTS = '''models:
- insertOne:
namespace: retryable-writes-tests.coll
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this utilize a YAML reference so the names live all in once place (in the template)? See prior art in clientBulkWrite tests, as I don't believe they duplicated namespace lines like this.

_yamlAnchors would be the top-level key to use for this.


3. Assert that for every handshake document intercepted:

1. the document has a field `backpressure` whose value is `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. the document has a field `backpressure` whose value is `true`.
1. The document has a field `backpressure` whose value is `true`.

raise error
if exc.has_error_label("SystemOverloadedError"):
jitter = random.random() # Random float between [0.0, 1.0).
backoff = jitter * min(BASE_BACKOFF * (2 ** attempt), MAX_BACKOFF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specification says "i is the retry attempt (starting with 0 for the first retry)." The pseudocode deviates from it by using the next attempt number (attempt) here instead of using the next retry attempt number (all attempts but the first one are retry attempts). The minimal attempt value at this point in execution is 1, making the minimal backoff equal to BASE_BACKOFF * (2^1) = 200 ms (before accounting for jitter), instead of the expected 100 ms.

To correctly illustrate the specification, the pseudocode should be

backoff = jitter * min(BASE_BACKOFF * (2^(attempt - 1)), MAX_BACKOFF)

is safely retryable regardless of the type of operation, its metadata, or any of its arguments.

Note that for the initial draft of the spec, only errors that have both the RetryableError label and the
SystemOverloadedError label are eligible for the retry backoff loop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "initial draft of the spec"? As far as I understand, once the changed proposed in this PR are approved and merged, they become part of the driver specifications, and not "initial draft".


#### Overload retry policy

This specification expands the driver's retry ability to all commands if the error indicates that is both an overload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it" is missing

Suggested change
This specification expands the driver's retry ability to all commands if the error indicates that is both an overload
This specification expands the driver's retry ability to all commands if the error indicates that it is both an overload

except PyMongoError as exc:
# if a retry fails with a non-System overloaded error, deposit 1 token
if attempt > 0 and not exc.has_error_label("SystemOverloadedError"):
tokens += 1
Copy link
Member

@stIncMale stIncMale Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specification says

  1. "If a retry attempt fails with an error that does not include SystemOverloadedError label, drivers MUST deposit 1 token."
  2. "
    A retry attempt will only be permitted if the error includes the RetryableError label, the error has a SystemOverloadedError label ..."

Therefore, at this point we know that there will be no retry attempt, and tokens += 1 will have no effect because token_bucket.deposit(tokens) won't be called, as it is not called in the except PyMongoError as exc: block. Thus, this pseudocode is not correct.

P.S. Even if there were a retry attempt, it would have been incorrect to rely on it to deposit our +1 token when it succeeds.

- The value of `MAX_ATTEMPTS` is 5 and non-configurable.
- This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the
timeout to avoid retry storms.
- Note: Future work will add support for RetryableErrors to regular retryability logic (see the future work section).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an anchor link to that section.

7. If the previous error contained the `SystemOverloadedError` error label, the node will be added to the set of
deprioritized servers.
7. If the request is eligible for retry (as outlined in step 4), the client MUST add the server's address to the list of
deprioritized server address for server selection. This behavior is the same as existing behavior for retryable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

address -> addresses

deprioritized server address for server selection. This behavior is the same as existing behavior for retryable
reads and writes.

Note: drivers MUST share deprioritized servers between retries used for the exponential backoff loop and regular
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this note part of item 7 above.

if attempt >= MAX_ATTEMPTS:
raise

# Raise if the error if non retryable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Raise if the error if non retryable.
# Raise if the error is non-retryable.

backoff = jitter * min(BASE_BACKOFF * (2 ** attempt), MAX_BACKOFF)
# Raise if the error if non retryable.
if exc.has_error_label("RetryableError") and exc.has_error_label("SystemOverloadedError"):
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we continue preparing the next attempt instead of raising if both labels are present?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants