-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- add prose test - add assertions on the number of retries for maxAttempts tests - don't run clientBulkWrite tests on <8.0 servers
| > - If the value is "stderr" (case-insensitive), log to stderr. | ||
| > - Else, if direct logging to files is supported, log to a file at the specified path. If the file already exists, it | ||
| > MUST be appended to. | ||
| > MUST be appended to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is failing lint on main. I'll fix separately and rebase this PR.
|
It looks like you also need to bump the schema version: |
|
WIP Python implementation: mongodb/mongo-python-driver#2635 |
|
All unified and prose tests are passing in the Python implementation. Edit: we're still failing one unified test, "client.clientBulkWrite retries using operation loop", investigating... Edit 2: we're all good now |
jyemin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed the specification changes, not the pseudocode or tests. Those are best reviewed by implementers.
| - This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the | ||
| timeout to avoid retry storms. | ||
| 5. If the previous error includes the `SystemOverloadedError` label, the client MUST apply exponential backoff according | ||
| to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)` | |
| to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)` |
|
|
||
| This specification expands the driver's retry ability to all commands, including those not currently considered | ||
| retryable such as updateMany, create collection, getMore, and generic runCommand. The new command execution method obeys | ||
| the following rules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the rules include all the deposits into the token bucket, consider adding withdrawals as well.
|
|
||
| ## Q&A | ||
|
|
||
| TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything to add here, or just remove?
|
|
||
| ## Changelog | ||
|
|
||
| - 2025-XX-XX: Initial version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how we handle the date... Is there an automation for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of. Usually the spec author fills it out before merging
I'll just leave this thread open to remind myself to add changelog dates before merging once all changes are completed.
| ```python | ||
| assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts:
Could this stick to being javascript from top-to-bottom?
Can we also do BIG_TIME - SMALL_TIME >= 2.1?
- To me that's more human-readable.
- It maintains that
BIG_TIMEmust always be bigger (removing the need for absolute_value). - It captures the 1 second variation whilst still being rigid to the 3.1 second window and stays minimally invasive.
| the following rules: | ||
|
|
||
| 1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens. | ||
| - The value is 0.1 and non-configurable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the concerns for golang, I thought updating these values by a scalar of 10 in those cases was fine?
cc: @matthewdale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the outcome was the opposite: https://docs.google.com/document/d/1teqNgeWbW6dpRQOALrJTEBRoO6sYcrpq9T3_NJE0QfU/edit?disco=AAABtSVfUxI
| to identify clients which do and do not support backpressure. Currently, this flag is unused but in the future the | ||
| server may offer different rate limiting behavior for clients that do not support backpressure. | ||
|
|
||
| ##### Implementation notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ##### Implementation notes | |
| #### Implementation notes |
| #### Goodput | ||
|
|
||
| The throughput of positive, useful output. In the context of drivers, this refers to the number of non-error results | ||
| that the driver processes per unit of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the number of non-error results that the driver processes per unit of time" is neither throughput, nor the "good throughput" ("goodput"). Throughput is the characteristic of a system (the combination of the application, the driver, the DBMS, their configuration, the network connecting them, the hardware, etc.), which is a constant for a given system, and tells about system capacity at its peak. SPECjbb2012: Updated Metrics for a Business Benchmark explains nicely what a throughput is, and how it may be measured.
"the number of non-error results
that the driver processes per unit of time" is not a characteristic of a system, but rather a metric whose value may vary. Trivially, if an application does not request any operations via the driver, then "the number of non-error results
that the driver processes per unit of time" is zero, but the throughput is still not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If, however, we want to define "throughput"/"goodput" the way it is currently proposed, then when we use the term in
- "negatively affect goodput"
- "stable but lowered throughput"
we have to say "max goodput" / "max throughput", or something like that, instead of just "goodput"/"throughput".
…re handled separately
| client: | ||
| id: &client client | ||
| useMultipleMongoses: false | ||
| observeEvents: [ 'commandStartedEvent', 'commandSucceededEvent', 'commandFailedEvent' ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to quote these. The formatting is also a bit inconssitent with the array in runOnRequirements[].topologies.
| database: | ||
| id: &database database | ||
| client: *client | ||
| databaseName: &database_name retryable-writes-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use *database_name here if it's the same namespace as the utilDb target.
| collection: | ||
| id: &collection collection | ||
| database: *database | ||
| collectionName: &collection_name coll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use *collection_name. I don't think the linter catches this, but we really shouldn't be redefining YAML anchors.
| operations: | ||
| - | ||
| object: *utilCollection | ||
| name: deleteMany |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we starting with deleteMany after initialData populates the collection under test? Wouldn't it make more sense to just specify initialData with an empty array (assuming you just need the collection to exist and be empty)?
|
|
||
| - | ||
| client: | ||
| id: &failPointClient failPointClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an odd name for the entity since it's used for more than just configuring the fail point. The main difference seems to be that this client doesn't have monitoring enabled.
Would be nice if the names between the entities were more consistent. Perhaps client_internal and similar for the "utilDb" and "utilCollection" names (database_internal, collection_internal) to ensure consistency. The mixing of camelCase and snack_case for these values is throwing me.
| configureFailPoint: 'failCommand', | ||
| mode: 'alwaysOn', | ||
| data: { | ||
| failCommands: ['insert'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation of the data object's fields is off here.
| } | ||
| ``` | ||
|
|
||
| 3. Execute the following command. Expect that the command errors. Measure the duration of the command execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "following command"? I assume it's an insertOne operation, but it'd be more clear to present that without the JS block, or explicitly suggest readers consider the following example code.
|
|
||
| ## Changelog | ||
|
|
||
| - 2025-XX-XX: Initial version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a TODO item you'll update before merging. I think most files just create an empty changelog for new copies.
|
|
||
| CLIENT_BULK_WRITE_ARGUMENTS = '''models: | ||
| - insertOne: | ||
| namespace: retryable-writes-tests.coll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this utilize a YAML reference so the names live all in once place (in the template)? See prior art in clientBulkWrite tests, as I don't believe they duplicated namespace lines like this.
_yamlAnchors would be the top-level key to use for this.
|
|
||
| 3. Assert that for every handshake document intercepted: | ||
|
|
||
| 1. the document has a field `backpressure` whose value is `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. the document has a field `backpressure` whose value is `true`. | |
| 1. The document has a field `backpressure` whose value is `true`. |
| raise error | ||
| if exc.has_error_label("SystemOverloadedError"): | ||
| jitter = random.random() # Random float between [0.0, 1.0). | ||
| backoff = jitter * min(BASE_BACKOFF * (2 ** attempt), MAX_BACKOFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification says "i is the retry attempt (starting with 0 for the first retry)." The pseudocode deviates from it by using the next attempt number (attempt) here instead of using the next retry attempt number (all attempts but the first one are retry attempts). The minimal attempt value at this point in execution is 1, making the minimal backoff equal to BASE_BACKOFF * (2^1) = 200 ms (before accounting for jitter), instead of the expected 100 ms.
To correctly illustrate the specification, the pseudocode should be
backoff = jitter * min(BASE_BACKOFF * (2^(attempt - 1)), MAX_BACKOFF)
| is safely retryable regardless of the type of operation, its metadata, or any of its arguments. | ||
|
|
||
| Note that for the initial draft of the spec, only errors that have both the RetryableError label and the | ||
| SystemOverloadedError label are eligible for the retry backoff loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "initial draft of the spec"? As far as I understand, once the changed proposed in this PR are approved and merged, they become part of the driver specifications, and not "initial draft".
|
|
||
| #### Overload retry policy | ||
|
|
||
| This specification expands the driver's retry ability to all commands if the error indicates that is both an overload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it" is missing
| This specification expands the driver's retry ability to all commands if the error indicates that is both an overload | |
| This specification expands the driver's retry ability to all commands if the error indicates that it is both an overload |
| except PyMongoError as exc: | ||
| # if a retry fails with a non-System overloaded error, deposit 1 token | ||
| if attempt > 0 and not exc.has_error_label("SystemOverloadedError"): | ||
| tokens += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification says
- "If a retry attempt fails with an error that does not include SystemOverloadedError label, drivers MUST deposit 1 token."
- "
A retry attempt will only be permitted if the error includes the RetryableError label, the error has a SystemOverloadedError label ..."
Therefore, at this point we know that there will be no retry attempt, and tokens += 1 will have no effect because token_bucket.deposit(tokens) won't be called, as it is not called in the except PyMongoError as exc: block. Thus, this pseudocode is not correct.
P.S. Even if there were a retry attempt, it would have been incorrect to rely on it to deposit our +1 token when it succeeds.
| - The value of `MAX_ATTEMPTS` is 5 and non-configurable. | ||
| - This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the | ||
| timeout to avoid retry storms. | ||
| - Note: Future work will add support for RetryableErrors to regular retryability logic (see the future work section). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add an anchor link to that section.
| 7. If the previous error contained the `SystemOverloadedError` error label, the node will be added to the set of | ||
| deprioritized servers. | ||
| 7. If the request is eligible for retry (as outlined in step 4), the client MUST add the server's address to the list of | ||
| deprioritized server address for server selection. This behavior is the same as existing behavior for retryable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
address -> addresses
| deprioritized server address for server selection. This behavior is the same as existing behavior for retryable | ||
| reads and writes. | ||
|
|
||
| Note: drivers MUST share deprioritized servers between retries used for the exponential backoff loop and regular |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this note part of item 7 above.
| if attempt >= MAX_ATTEMPTS: | ||
| raise | ||
|
|
||
| # Raise if the error if non retryable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Raise if the error if non retryable. | |
| # Raise if the error is non-retryable. |
| backoff = jitter * min(BASE_BACKOFF * (2 ** attempt), MAX_BACKOFF) | ||
| # Raise if the error if non retryable. | ||
| if exc.has_error_label("RetryableError") and exc.has_error_label("SystemOverloadedError"): | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we continue preparing the next attempt instead of raising if both labels are present?
DRIVERS-3239
Overview
This PR adds support for a new class of errors (
SystemOverloadedError) to drivers' operation retry logic, as outlined in the design document.Additionally, it includes a new argument to the MongoDB handshake (also defined in the design document).
Python will be second implementer.
Node implementation: mongodb/node-mongodb-native#4806
Testing
The testing strategy is two-fold:
Building off of Ezra's work to generate unified tests for retryable handshake errors, this PR generates unified tests to confirm that:
SystemOverloadedErrorlabelFollowing Iris's work in DRIVERS-1934: withTransaction API retries too frequently #1851, this PR adds a prose test that ensures drivers apply exponential backoff in the retryability loop.
Update changelog.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, and sharded
clusters).