DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

baileympearson · 2025-12-02T17:15:50Z

Overview

This PR adds support for a new class of errors (SystemOverloadedError) to drivers' operation retry logic, as outlined in the design document.

Additionally, it includes a new argument to the MongoDB handshake (also defined in the design document).

Python will be second implementer.
Node implementation: mongodb/node-mongodb-native#4806

Testing

The testing strategy is two-fold:

Building off of Ezra's work to generate unified tests for retryable handshake errors, this PR generates unified tests to confirm that:
- operations are retried using the new SystemOverloadedError label
- operations are retried no more than 5 (current MAX_ATTEMPTS, as defined in the spec) times
Following Iris's work in DRIVERS-1934: withTransaction API retries too frequently #1851, this PR adds a prose test that ensures drivers apply exponential backoff in the retryability loop.
Update changelog.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, and sharded
clusters).

- add prose test - add assertions on the number of retries for maxAttempts tests - don't run clientBulkWrite tests on <8.0 servers

source/logging/logging.md

source/client-backpressure/client-backpressure.md

source/retryable-reads/retryable-reads.md

source/retryable-writes/retryable-writes.md

source/client-backpressure/client-backpressure.md

blink1073 · 2025-12-03T16:31:19Z

It looks like you also need to bump the schema version:

source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3

blink1073 · 2025-12-03T21:05:18Z

WIP Python implementation: mongodb/mongo-python-driver#2635

blink1073 · 2025-12-03T23:11:38Z

All unified and prose tests are passing in the Python implementation.

Edit: we're still failing one unified test, "client.clientBulkWrite retries using operation loop", investigating...

Edit 2: we're all good now

jyemin

I only reviewed the specification changes, not the pseudocode or tests. Those are best reviewed by implementers.

source/client-backpressure/client-backpressure.md

jyemin · 2025-12-10T19:36:28Z

source/client-backpressure/client-backpressure.md

+
+## Q&A
+
+TODO


Anything to add here, or just remove?

Nothing yet .. I'll remove it for now. It can always be added back if there are items worth mentioning here.

jyemin · 2025-12-10T19:37:07Z

source/client-backpressure/client-backpressure.md

+
+## Changelog
+
+- 2025-XX-XX: Initial version.


Not sure how we handle the date... Is there an automation for this?

Not that I know of. Usually the spec author fills it out before merging

I'll just leave this thread open to remind myself to add changelog dates before merging once all changes are completed.

source/client-backpressure/tests/README.md

source/client-backpressure/client-backpressure.md

source/retryable-writes/retryable-writes.md

baileympearson

submitting comments

source/client-backpressure/tests/README.md

baileympearson · 2025-12-18T22:25:45Z

source/client-backpressure/tests/backpressure-retry-loop.yml.template

+    operations:
+      -
+        object: *utilCollection
+        name: deleteMany


Sorry, missed this one. I like your suggestion - done.

source/client-backpressure/client-backpressure.md

baileympearson · 2025-12-18T22:47:43Z

source/client-backpressure/client-backpressure.md

+The following pseudocode describes the overload retry policy:
+
+```python
+BASE_BACKOFF = 0.1


I chose to leave it as-is; as the pseudocode is written in python and follows python conventions. Let me know if the clarified pseudocde works for you

source/client-backpressure/client-backpressure.md

baileympearson · 2025-12-18T22:48:29Z

source/client-backpressure/tests/README.md

+
+## Changelog
+
+- 2025-XX-XX: Initial version.


I removed the changelog

baileympearson · 2025-12-18T22:49:36Z

source/client-backpressure/tests/README.md

+            }
+        ```
+
+    3. Execute the following command. Expect that the command errors. Measure the duration of the command execution.


I've reworked the prose to be explicit about what the command is.

matthewdale

Looks good! 👍

sanych-sun · 2025-12-19T00:21:44Z

source/client-backpressure/client-backpressure.md

+5. A retry attempt consumes 1 token from the token bucket.
+6. If the request is eligible for retry (as outlined in step 4), the client MUST apply exponential backoff according to
+    the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
+    - `i` is the retry attempt number (starting with 0 for the first retry).


Why we decided to have retry number start we 0? It makes the requirements confusing. Why don't we start with 1 and in fact we will have the same formula as for withTransaction:

delayMS = j * min(maxBackoff, baseBackoff * 2^(i-1))

With the only difference here we have 2 as the base for pow function, in convinientTransaction API we have 1.5

Here is the formula from withTransaction:
jitter * min(BACKOFF_INITIAL * 1.5 ** (transactionAttempt - 1), BACKOFF_MAX)

Where transactionAttempt started with 0 and is being incremented AFTER the delay, but before executing the callback attempt. Which is also confusing... but in C# implementation we wait AFTER the attempt so it's more natural.

The phrasing, including "Retries start at 0", was just taken from the design. There's no need to keep it this way if it causes confusion.

I can adjust the phrasing to more closely align with the transaction spec, if that's preferable?

Can we use the following formula?
delayMS = j * min(maxBackoff, baseBackoff * 2^(i-1))

or

we can keep the formula as is, but adjust the baseBackoff :
delayMS = j * min(maxBackoff, baseBackoff * 2^i)
where baseBackoff is 50 instead of 100.

It produces the same results, but starts i with 1.

I think it's fine to use any of those three formulas in individual driver implementations if it increases readability so long as they result in the same outputs. As far as reducing confusion, I can say it didn't substantially change my understanding of the formula.

+1 to Bailey changing the phrasing, which I think would suffice.

@sanych-sun Can this thread be resolved?

…RS-3239

source/client-backpressure/client-backpressure.md

stIncMale · 2025-12-25T00:01:24Z

source/client-backpressure/client-backpressure.md

+# Note: the values below have been scaled down by a factor of 1000 because
+# Python's sleep API takes a duration in seconds, not milliseconds.
+BASE_BACKOFF = 0.1 # 100ms
+MAX_BACKOFF = 10   # 10s


The 100ms comment is helpful, because one can use it to find the same value in the prose part of the specification (though absolutely nothing prevents us from simply writing BASE_BACKOFF = 100ms).

However, the 10s comment is not helpful, because the prose part says 10000ms, and a reader won't find 10000ms in the pseudocode.

Suggested change

MAX_BACKOFF = 10 # 10s

MAX_BACKOFF = 10 # 10000ms

stIncMale · 2025-12-26T19:34:14Z

source/client-backpressure/client-backpressure.md

+        retry budget tracking, this counts as a success.
+4. A retry attempt will only be permitted if:
+    1. The error has both the `SystemOverloadedError` and the `RetryableError` label.
+    2. We have not reached `MAX_ATTEMPTS`.


Judging both from the name of MAX_ATTEMPTS, and this sentence, a reader develops understanding that MAX_ATTEMPTS is the maximum number of attempts, which, includes the first attempt that is not a retry attempt. However, below, the specification says "Any retryable error is retried at most MAX_ATTEMPTS (default=5) times", which means that MAX_ATTEMPTS is the maximum number of retry attempts, implying that the maximum number of attempts is actually 6.

If MAX_ATTEMPTS is meant to specify the maximum number of retry attempts, let's name it accordingly: MAX_RETRY_ATTEMPTS.

stIncMale · 2025-12-26T19:49:10Z

source/client-backpressure/client-backpressure.md

+    1. The error has both the `SystemOverloadedError` and the `RetryableError` label.
+    2. We have not reached `MAX_ATTEMPTS`.
+    3. (CSOT-only): `timeoutMS` has not expired.
+    4. (`SystemOverloadedError` errors only) a token can be acquired from the token bucket.


Item 4.1 requires the SystemOverloadedError label to be present. The highlighted item 4.4 restricts itself to SystemOverloadedError errors only, but no other errors can possibly be considered because of item 4.1¹. Given this, why does item 4.4 has this restriction?

No reason. Removed

stIncMale · 2025-12-26T19:50:51Z

source/client-backpressure/client-backpressure.md

+    3. (CSOT-only): `timeoutMS` has not expired.
+    4. (`SystemOverloadedError` errors only) a token can be acquired from the token bucket.
+    - The value of `MAX_ATTEMPTS` is 5 and non-configurable.
+    - This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the


Shouldn't these two sub-items about MAX_ATTEMPTS be under item 4.2, which talks about MAX_ATTEMPTS, rather than being under item 4.4, which talks about acquiring a token?

Sure; moved them to under 4.2.

stIncMale · 2025-12-27T19:48:15Z

source/client-backpressure/client-backpressure.md

+error and that it is retryable, including those not currently considered retryable such as updateMany, create
+collection, getMore, and generic runCommand. The new command execution method obeys the following rules:
+
+1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.


The specification currently uses various terms to refer to the same thing¹ when it talks about retry tokens (hopefully, I identified them all, but the author should do the exercise on his own to verify):

return, deposit, consume

deposit, return

We should pick a single term for each meaning and use it consistently. Having clear and consistent terminology is a necessity for clear communications, it also allows makes the specification searchable: if one wants to find all the places that talk about consuming retry tokens, one can search the page for a specific term. Currently, that is impossible.

¹ Our specifications in general, as well as the official documentation the company produces, suffer from this significant issue.

Good point - I've aligned the language in the spec on deposit and consume.

stIncMale · 2025-12-27T19:55:48Z

source/client-backpressure/client-backpressure.md

+
+#### Interaction with Existing Retry Behavior
+
+The retryability API defined in this specification is separate from the existing retryability behaviors defined in the


This sentence is the only place in the specification that suggests the specification defines "retryability API". Given that the specification does not define "retryability API", but it defines the "overload retry policy", which is what this sentence is talking about.

Interaction with Existing Retry Behavior
...
retryability behaviors

This specification already calls that behavior "policy", which means that the sentence should use the same term. It would have also been useful if we also named the existing policy, for example, the "read/write retry policy" (or, if we want to separate them, the "read retry policy" and the "write retry policy").

Sure. I've adjusted the phrasing here to use policy in place of behavior or api.

stIncMale · 2025-12-27T20:07:52Z

source/client-backpressure/client-backpressure.md

+- Only retryable errors with the `SystemOverloadedError` label apply backoff and jitter.
+- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:
+    - Errors defined as retryable in the retryable reads specification.
+    - Errors defined as retryable in the retryable writes specification.


Let's make "retryable reads specification" and "retryable writes specification" hyperlinks.

stIncMale · 2025-12-27T20:20:56Z

source/client-backpressure/client-backpressure.md

+- Only retryable errors with the `SystemOverloadedError` label apply backoff and jitter.
+- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:


Don't these two sentences say the same? If yes, let's leave only one of them.

No, I don't think so. The first says that only errors with that label backoff (no other errors backoff); the second mandates that all errors with apply the backoff.

stIncMale · 2025-12-27T21:35:59Z

source/client-backpressure/client-backpressure.md

+- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:
+    - Errors defined as retryable in the retryable reads specification.
+    - Errors defined as retryable in the retryable writes specification.
+    - Errors with the `RetryableError` label.


Given that a command may be retried either because of the read/write retry policy, or because of the overload retry policy, what does the i mean in the formula for delayMS? Is it the 1-based retry attempt number among all the retry attempts of a command, or only among the subset caused by the overload retry policy?

The answer to this question should be in the "Interaction with Existing Retry Behavior" section, not in the "Overload retry policy" section.

Okay, I've clarified this. But I chose to put it in the "overload retry policy" section because it makes sense to me to put this information with the definition of i and the exponential backoff formula.

stIncMale · 2025-12-27T21:50:57Z

source/client-backpressure/client-backpressure.md

+- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:
+    - Errors defined as retryable in the retryable reads specification.
+    - Errors defined as retryable in the retryable writes specification.
+    - Errors with the `RetryableError` label.
+- Any retryable error is retried at most MAX_ATTEMPTS (default=5) times, if any attempts has failed with a
+    `SystemOverloadedError`.


The "RetryableError label" section above says "An error is considered retryable if it includes the "RetryableError" label". After reading that, it is only reasonable to understand that a "retryable errror" is an error with the RetryableError label. However, in the selected items the specification seemingly uses the same term "retryable error" with a different meaning. This needs to be fixed.

I don't think that sections "RetryableError label", "SystemOverloadedError label" defining the terms "retryable error", "overloaded error" help a reader, neither do the the terms. Instead I propose the specification to always talk about an error having the RetryableError, SystemOverloadedError labels, which it already does in many places.

#1862 (comment) is related, but different.

I don't fully agree about the definitions not having value. I'll keep them, unless you feel strongly, because it familiarizes readers with new terminology introduced by this specification.

I have clarified the definition of RetryableError label and the phrasing here to avoid ambiguity though. I believe this addresses the concern.

stIncMale · 2025-12-27T22:11:35Z

source/client-backpressure/client-backpressure.md

+- Any retryable error is retried at most MAX_ATTEMPTS (default=5) times, if any attempts has failed with a
+    `SystemOverloadedError`.


I fail to understand what this requirement means. For example, "Any retryable error is retried" is very confusing, since we are trying to execute a command, not trying to execute/produce an error. The current wording also implies that the retry attempts are counted per each specific error, which can't be true, as they are counter per, for example, an execution of a command.

Does the following express the intended meaning?

Once we encounter such a command execution error with the SystemOverloadedError label that it permits a retry attempt according to the overload retry policy, the number of retry attempts for the particular command execution becomes limited by MAX_ATTEMPTS regardless of which retry policy the previous or the future retry attempts are caused by.

If "yes", then let's change the wording.

Note that the above means a command may still be retried more than MAX_ATTEMPTS times in the presence of errors having the SystemOverloadedError label. For example, if a command was retried 6 times according to the read/write retry policy with CSOT, and the 6th retry attempt (it's a 1-based index) failed with an error having the SystemOverloadedError label.

That's correct - done.

Note that the above means a command may still be retried more than MAX_ATTEMPTS times in the presence of errors having the SystemOverloadedError label. For example, if a command was retried 6 times according to the read/write retry policy with CSOT, and the 6th retry attempt (it's a 1-based index) failed with an error having the SystemOverloadedError label.

Yes, that's correct.

stIncMale · 2025-12-27T22:40:37Z

source/client-backpressure/client-backpressure.md

+#### RetryableError label
+
+An error is considered retryable if it includes the "RetryableError" label. This error label indicates that an operation
+is safely retryable regardless of the type of operation, its metadata, or any of its arguments.


crud/bulk-write.md also talks about a "retryable error", but means some thing very different from what "retryable error" means here.

Let's not overload this term by re-defining it with a different meaning here, as overloaded terms introduce confusion and have no benefits.
1.1. If a term has to be overloaded nonetheless, which I doubt, let's at the very least make the changes to crud/bulk-write.md necessary to reduce confusion.

The term "retryable error" is also defined in transactions/transactions.md. Fortunately, the meaning there does not need to be clarified.

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862 (comment) is related, but different.

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862 (comment) is related, but different.

Overloading this term makes sense imo because we now have a very concrete definition of the phrase "retryable error" (i.e., the error has the RetryableError label attached). I was not aware of the phrasing in the bulk write spec - I will update this spec as well.

stIncMale · 2025-12-28T10:52:10Z

source/client-backpressure/client-backpressure.md

+This specification expands the driver's retry ability to all commands if the error indicates that it is both an overload
+error and that it is retryable, including those not currently considered retryable such as updateMany, create
+collection, getMore, and generic runCommand. The new command execution method obeys the following rules:


Intro

Our specifications, as well as the official documentation, do not use the terms "operation" and "command" clearly, and neither does the specification introduced by this PR: it uses the terms seemingly interchangeably, despite them having very different meanings¹:

CSOT made an attempt to say what an operation is, but missed at least the client bulk write operation. Whatever "operation" is in our specifications, it is what identified by the operationId field in CommandStartedEvent, CommandSucceededEvent, CommandFailedEvent, and execution of an operation consists of executions of potentially multiple commands, as well as other potentially retryable activities like server selection.

A command is whatever CommandStartedEvent, CommandSucceededEvent, CommandFailedEvent describe, and execution of a command consists of potentially multiple attempts.

The concern

The overload retry policy seems to be specified as applied at the command level: execution of any command consists of up to (MAX_ATTEMPTS + 1) attempts (and the "If a command fails backpressure retries MAX_ATTEMPTS times, it MUST not be retried again" sentence in transactions/transactions.md supports this interpretation). Given that execution of an operation may consist of many command executions, the above may result in a really large total number of retry attempts of different commands per execution of an operation. Is this indeed intended? I have doubts.

We can take a look at what execution of a client bulk write operation includes, as an example demonstrating how many executions of different commands may constitute execution of an operation:

potentially connection establishment

execution of a hello command

potentially authentication (let's assume it's a SASL authentication)

execution of a saslStart command

execution of a saslContinue command

potentially more executions of a saslContinue command

execution of a bulkWrite command

execution of a getMore command

potentially more executions of a getMore command

and/or potentially execution of a killCursors command

potentially more executions of a bulkWrite command, if there are more batches

potentially more executions of a getMore command, if there are more batches

potentially reauthentication

I suspect, the retry attempts should be counted not per execution of a command, but somehow else. This, however, may further complicate already convoluted retry policies and implementations. Anyway,

the specification should clarify the scope within which retry attempts caused by the overload retry policy are limited by MAX_ATTEMPTS;

the specification should use the terms "operation", "command" such that they are appropriate and match the meaning outlined above, unless those outlined meanings are incorrect.

¹ I am aware that when trying to outline what an operation and a command are, as well as to clarify their executions, I am saying things that are not explicitly stated in the specifications. However, without introducing a vocabulary it is impossible to communicate, so I have to do that, since the specifications failed to (or I failed to find where they do that, despite spending significant amount of time).

I agree that the definition of "operation" is nebulous; but command has a very distinct meaning that I do not believe is ambiguous. See https://www.mongodb.com/docs/manual/reference/command/.

I'll look over the spec and see if I can remove usages of "operation". But I do believe that the retry policy explicitly uses command and it is unambiguous.

Alright; I've removed all usages of operation except for one, which I think makes sense preserve.

stIncMale · 2025-12-28T11:15:25Z

source/client-backpressure/client-backpressure.md

+This specification expands the driver's retry ability to all commands if the error indicates that it is both an overload
+error and that it is retryable, including those not currently considered retryable such as updateMany, create
+collection, getMore, and generic runCommand. The new command execution method obeys the following rules:


The overload retry policy applies to "all commands", requiring drivers to inspect the result of a command attempt. On the other hand, the client bulk write specification says "The result returned from the killCursors command MAY be ignored.". Let's clarify whether the drivers must apply the overload retry policy to killCursors, or if execution of that command should never contain more than one attempt.

Previously @baileympearson said the following in Slack about this: "Operations that clean up resources could be retried but seem less important, because resources will be pruned by the server automatically after their timeout expires server-side.".

Good catch. I've updated the bulk write specification to specify that the result from killCursors MUST NOT be ignored.

stIncMale · 2025-12-28T16:47:17Z

source/transactions/transactions.md


+#### Backpressure Error
+
+An error considered retryable by the [Client Backpressure Specification](../client-backpressure/client-backpressure.md).


The client backpressure specification says that "an error is considered retryable if it includes the "RetryableError" label". So there, a command with the "RetryableError" label is called retryable error, but here, the same error is called "backpressure error". Not only we are proposed to have the "retryable error" term overloaded (see #1862 (comment)), but we are also proposed to have multiple terms ("retryable error", "backpressure error") referring to the same exact thing. We should fix this:

we should use a single term to refer to the same thing;

we should avoid overloading terms.

P.S. I won't be surprised if the intent here was to express that backpressure error is an error that has both the RetryableError and the SystemOverloadError labels. But that is definitely not what the specification currently expresses.

No; backpressure errors are distinct from retryable backpressure errors. The error labels might always exist together now but the server has intentionally chosen to keep these error labels separate and might choose to only apply one label or the other in the future.

I've clarified this spec to use the phrase "retryable backpressure error".

stIncMale · 2025-12-28T16:57:17Z

source/transactions/transactions.md

+enabled on the MongoClient, unless the server response has backpressure error labels applied.
+
+In addition, drivers MUST NOT add the RetryableWriteError label to any error that occurs during a write command within a
+transaction (excepting commitTransation and abortTransaction), even when retryWrites has been enabled on the
+MongoClient, unless the server response has backpressure error labels applied.


This specification defines the term "backpressure error", not "backpressure error labels". Shouldn't these sentences (there are two that use "backpressure error labels") just say something like "unless the command fails with a backpressure error", "unless the error is a backpressure error"?

Sure - done.

stIncMale · 2025-12-28T17:53:25Z

source/transactions/transactions.md

+the `abortTransaction` and `commitTransaction` commands, as well as any read or write commands attempted during the
+transaction.
+
+If a command fails with backpressure labels and it includes `startTransaction:true`, the retried command MUST also


"If a command fails with backpressure labels" - is it the same as saying that a command fails with a "backpressure error" defined above? If yes, then the specification should use that term.

Sure - done.

stIncMale · 2025-12-28T17:55:55Z

source/transactions/transactions.md

+All commands in a transaction are subject to the
+[Client Backpressure Specification](../client-backpressure/client-backpressure.md), and MUST be retried accordingly when
+the appropriate error labels are added by the server. This includes the initial command with `startTransaction:true`,


Let's remove the "when the appropriate error labels are added by the server" part. This is just one of multiple conditions that must be met for a command to be retried according to the overload retry policy, and it is weird to mention one of them, while avoiding the rest (I am not proposing to reiterate all of them here).

Suggested change

All commands in a transaction are subject to the

[Client Backpressure Specification](../client-backpressure/client-backpressure.md), and MUST be retried accordingly when

the appropriate error labels are added by the server. This includes the initial command with `startTransaction:true`,

All commands in a transaction are subject to the

[Client Backpressure Specification](../client-backpressure/client-backpressure.md), and MUST be retried accordingly. This includes the initial command with `startTransaction:true`,

I like that phrasing better - done.

stIncMale · 2025-12-28T18:11:45Z

source/transactions/transactions.md

+Drivers SHOULD apply a majority write concern when retrying commitTransaction to guard against a transaction being
 applied twice.

+Drivers SHOULD NOT modify the write concern on commit transaction commands when retrying a backpressure error, since we
+are sure that the transaction has not been applied.


I don't think we know whether the transaction has or has not been applied. If, for example, in the scenario below step 4 fails with a backpressure error, it is obviously false that the transaction has not been applied.

I think, this change should be reverted.

I believe this change makes sense but I've adjusted the phrasing. I'm not sure it makes sense as it was written. Hopefully it's clearer now.

stIncMale

Notes for myself:

The last reviewed commit is 1b7f6df.
I reviewed only the changes in .md files, and do not plan to review the tests.
I have not re-reviewed the pseudocode illustrating the overload retry policy. I will do that after us settling on the requirements expressed in the prose part of the specification.

baileympearson

First round of comments addressed.

baileympearson · 2026-01-05T20:24:15Z

source/client-backpressure/client-backpressure.md

+        retry budget tracking, this counts as a success.
+4. A retry attempt will only be permitted if:
+    1. The error has both the `SystemOverloadedError` and the `RetryableError` label.
+    2. We have not reached `MAX_ATTEMPTS`.


baileympearson · 2026-01-05T20:26:46Z

source/client-backpressure/client-backpressure.md

+    3. (CSOT-only): `timeoutMS` has not expired.
+    4. (`SystemOverloadedError` errors only) a token can be acquired from the token bucket.
+    - The value of `MAX_ATTEMPTS` is 5 and non-configurable.
+    - This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the


Sure; moved them to under 4.2.

baileympearson · 2026-01-05T20:27:58Z

source/client-backpressure/client-backpressure.md

+
+#### Interaction with Existing Retry Behavior
+
+The retryability API defined in this specification is separate from the existing retryability behaviors defined in the


Sure. I've adjusted the phrasing here to use policy in place of behavior or api.

baileympearson · 2026-01-05T20:29:36Z

source/client-backpressure/client-backpressure.md

+#### Interaction with Existing Retry Behavior
+
+The retryability API defined in this specification is separate from the existing retryability behaviors defined in the
+retryable reads and retryable writes specifications. Drivers MUST ensure:


baileympearson · 2026-01-05T20:30:05Z

source/client-backpressure/client-backpressure.md

+retryable reads and retryable writes specifications. Drivers MUST ensure:
+
+- Only retryable errors with the `SystemOverloadedError` consume tokens from the token bucket before retrying.
+- Only retryable errors with the `SystemOverloadedError` label apply backoff and jitter.


Sure; done.

baileympearson · 2026-01-05T20:40:05Z

source/client-backpressure/client-backpressure.md

+- Any retryable error is retried at most MAX_ATTEMPTS (default=5) times, if any attempts has failed with a
+    `SystemOverloadedError`.


That's correct - done.

Note that the above means a command may still be retried more than MAX_ATTEMPTS times in the presence of errors having the SystemOverloadedError label. For example, if a command was retried 6 times according to the read/write retry policy with CSOT, and the 6th retry attempt (it's a 1-based index) failed with an error having the SystemOverloadedError label.

Yes, that's correct.

baileympearson · 2026-01-05T20:42:54Z

source/client-backpressure/client-backpressure.md

+error and that it is retryable, including those not currently considered retryable such as updateMany, create
+collection, getMore, and generic runCommand. The new command execution method obeys the following rules:
+
+1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.


Good point - I've aligned the language in the spec on deposit and consume.

baileympearson · 2026-01-05T20:45:53Z

source/client-backpressure/client-backpressure.md

+This specification expands the driver's retry ability to all commands if the error indicates that it is both an overload
+error and that it is retryable, including those not currently considered retryable such as updateMany, create
+collection, getMore, and generic runCommand. The new command execution method obeys the following rules:


I agree that the definition of "operation" is nebulous; but command has a very distinct meaning that I do not believe is ambiguous. See https://www.mongodb.com/docs/manual/reference/command/.

I'll look over the spec and see if I can remove usages of "operation". But I do believe that the retry policy explicitly uses command and it is unambiguous.

baileympearson · 2026-01-05T20:47:17Z

source/client-backpressure/client-backpressure.md

+#### RetryableError label
+
+An error is considered retryable if it includes the "RetryableError" label. This error label indicates that an operation
+is safely retryable regardless of the type of operation, its metadata, or any of its arguments.


Overloading this term makes sense imo because we now have a very concrete definition of the phrase "retryable error" (i.e., the error has the RetryableError label attached). I was not aware of the phrasing in the bulk write spec - I will update this spec as well.

baileympearson · 2026-01-05T20:48:51Z

source/client-backpressure/client-backpressure.md

+- Only retryable errors with the `SystemOverloadedError` label apply backoff and jitter.
+- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:
+    - Errors defined as retryable in the retryable reads specification.
+    - Errors defined as retryable in the retryable writes specification.


baileympearson added 8 commits December 1, 2025 13:44

initial commit

588f1f2

new files

e467f5b

add tests for handshake changes

d55fdb9

add generated tests

8e74b41

test fixes and add prose test

072b453

- add prose test - add assertions on the number of retries for maxAttempts tests - don't run clientBulkWrite tests on <8.0 servers

fix run on requirements

52e2a35

fix run on requirements?

391c951

fix CI

92501c0

baileympearson commented Dec 2, 2025

View reviewed changes

source/logging/logging.md Outdated Show resolved Hide resolved

baileympearson commented Dec 2, 2025

View reviewed changes

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

baileympearson marked this pull request as ready for review December 2, 2025 18:59

baileympearson requested review from a team as code owners December 2, 2025 18:59

baileympearson requested review from jmikola and jyemin and removed request for a team December 2, 2025 18:59

blink1073 reviewed Dec 3, 2025

View reviewed changes

source/retryable-reads/retryable-reads.md Outdated Show resolved Hide resolved

source/retryable-writes/retryable-writes.md Outdated Show resolved Hide resolved

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

baileympearson added 3 commits December 3, 2025 11:36

comments

0fdef39

Fix broken unified tests

82acab8

fix UTR linting failures

b3a7b6c

blink1073 mentioned this pull request Dec 3, 2025

PYTHON-5528 & PYTHON-5651 Add exponential backoff to operation retry loop for server overloaded errors mongodb/mongo-python-driver#2635

Draft

11 tasks

remove broken deleteMany() from unified tests

60a87b8

add backwards compat section

399a56b

jyemin requested changes Dec 10, 2025

View reviewed changes

Jibola requested changes Dec 10, 2025

View reviewed changes

source/client-backpressure/tests/README.md Show resolved Hide resolved

source/client-backpressure/client-backpressure.md Show resolved Hide resolved

source/client-backpressure/client-backpressure.md Outdated Show resolved Hide resolved

stIncMale reviewed Dec 11, 2025

View reviewed changes

source/retryable-writes/retryable-writes.md Show resolved Hide resolved

baileympearson requested a review from matthewdale December 18, 2025 22:50

baileympearson commented Dec 18, 2025

View reviewed changes

matthewdale approved these changes Dec 18, 2025

View reviewed changes

sanych-sun reviewed Dec 19, 2025

View reviewed changes

sanych-sun mentioned this pull request Dec 19, 2025

CSHARP-5712: withTransaction API retries too frequently mongodb/mongo-csharp-driver#1841

Open

blink1073 and others added 6 commits December 19, 2025 06:21

add runOnRequirement for getMore tests

747c18c

updated formula

be27bc0

Merge branch 'master' of github.com:mongodb/specifications into DRIVE…

92e479a

…RS-3239

lint

b674f13

address review

03065ad

fix link

1b7f6df

Jibola approved these changes Dec 23, 2025

View reviewed changes

source/client-backpressure/client-backpressure.md Show resolved Hide resolved

stIncMale reviewed Dec 25, 2025

View reviewed changes

stIncMale requested changes Dec 27, 2025

View reviewed changes

stIncMale reviewed Dec 27, 2025

View reviewed changes

stIncMale reviewed Dec 28, 2025

View reviewed changes

baileympearson commented Jan 5, 2026

View reviewed changes

first round of comments addressed

eac90fc

baileympearson requested a review from sanych-sun January 6, 2026 00:36

second round of comments addressed

0533acc

baileympearson requested a review from a team as a code owner January 6, 2026 00:59

baileympearson requested review from NoahStapp and stIncMale January 6, 2026 01:00


		#### Interaction with Existing Retry Behavior

		The retryability API defined in this specification is separate from the existing retryability behaviors defined in the

		- Only retryable errors with the `SystemOverloadedError` label apply backoff and jitter.
		- All retryable errors apply backoff if they also contain a `SystemOverloadedError` label. This includes:

		- Any retryable error is retried at most MAX_ATTEMPTS (default=5) times, if any attempts has failed with a
		`SystemOverloadedError`.


		#### Backpressure Error

		An error considered retryable by the [Client Backpressure Specification](../client-backpressure/client-backpressure.md).


		## Q&A

		TODO

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

Are you sure you want to change the base?

DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862

Uh oh!

Conversation

baileympearson commented Dec 2, 2025 • edited by blink1073 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blink1073 commented Dec 3, 2025

Uh oh!

blink1073 commented Dec 3, 2025

Uh oh!

blink1073 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jyemin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baileympearson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewdale left a comment

Choose a reason for hiding this comment

Uh oh!

sanych-sun Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanych-sun Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanych-sun Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stIncMale Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

baileympearson commented Dec 2, 2025 •

edited by blink1073

Loading

blink1073 commented Dec 3, 2025 •

edited

Loading

sanych-sun Dec 19, 2025 •

edited

Loading

sanych-sun Dec 19, 2025 •

edited

Loading

sanych-sun Dec 19, 2025 •

edited

Loading

stIncMale Dec 25, 2025 •

edited

Loading