Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions source/transactions-convenient-api/tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,72 @@ If possible, drivers should implement these tests without requiring the test run
the retry timeout. This might be done by internally modifying the timeout value used by `withTransaction` with some
private API or using a mock timer.

### Retry Backoff is Enforced

Drivers should test that retries within `withTransaction` do not occur immediately.

1. let `client` be a `MongoClient`
2. let `coll` be a collection
3. Now, run transactions without backoff:
1. Configure the random number generator used for jitter to always return `0` -- this effectively disables backoff.

2. Configure a fail point that forces 13 retries like so:

```python
set_fail_point(
{
"configureFailPoint": "failCommand",
"mode": {
"times": 13
}, # sufficiently high enough such that the time effect of backoff is noticeable
"data": {
"failCommands": ["commitTransaction"],
"errorCode": 251,
},
}
)
```

> Note: errorCode 251 is NoSuchTransaction.

3. Define the callback for the transaction as follows:

```python
def callback(session):
coll.insert_one({}, session=session)
```

4. Let `no_backoff_time` be the duration of the withTransaction API call:

```python
start = time.monotonic()
with client.start_session() as s:
s.with_transaction(callback)
end = time.monotonic()
no_backoff_time = end - start
```
4. Now run the command with backoff:
1. Configure the random number generator used for jitter to always return `1`.
2. Configure a fail point that forces 13 retries like in step 3.2.
3. Use the same callback defined in 3.3.
4. Let `with_backoff_time` be the duration of the withTransaction API call:
```python
start = time.monotonic()
with client.start_session() as s:
s.with_transaction(callback)
end = time.monotonic()
no_backoff_time = end - start
```
5. Compare the two time between the two runs.
```python
assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 2.2 seconds)) < 1)
```
The sum of 13 backoffs is roughly 2.2 seconds. There is a 1-second window to account for potential variance between
the two runs.

## Changelog

- 2025-11-18: Added Backoff test.
- 2024-09-06: Migrated from reStructuredText to Markdown.
- 2024-02-08: Converted legacy tests to unified format.
- 2021-04-29: Remove text about write concern timeouts from prose test.
67 changes: 56 additions & 11 deletions source/transactions-convenient-api/transactions-convenient-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,8 @@ has not been exceeded, the driver MUST retry a transaction that fails with an er
"TransientTransactionError" label. Since retrying the entire transaction will entail invoking the callback again,
drivers MUST document that the callback may be invoked multiple times (i.e. one additional time per retry attempt) and
MUST document the risk of side effects from using a non-idempotent callback. If the retry timeout has been exceeded,
drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller.
drivers MUST NOT retry the transaction and allow `withTransaction` to propagate the error to its caller. When retrying,
drivers MUST implement an exponential backoff with jitter following the algorithm described below.

If an error bearing neither the UnknownTransactionCommitResult nor the TransientTransactionError label is encountered at
any point, the driver MUST NOT retry and MUST allow `withTransaction` to propagate the error to its caller.
Expand All @@ -113,7 +114,11 @@ needed (e.g. user data to pass as a parameter to the callback).

This method should perform the following sequence of actions:

1. Record the current monotonic time, which will be used to enforce the 120-second timeout before later retry attempts.
1. Define the following:
1. Record the current monotonic time, which will be used to enforce the 120-second / CSOT timeout before later retry
attempts.
2. Set `retry` to `0`. This will be used for backoff later in step 7.
3. Set `TIMEOUT_MS` to be `timeoutMS` if given, otherwise 120-seconds.
2. Invoke [startTransaction](../transactions/transactions.md#starttransaction) on the session. If TransactionOptions
were specified in the call to `withTransaction`, those MUST be used for `startTransaction`. Note that
`ClientSession.defaultTransactionOptions` will be used in the absence of any explicit TransactionOptions.
Expand All @@ -128,23 +133,35 @@ This method should perform the following sequence of actions:
6. If the callback reported an error:
1. If the ClientSession is in the "starting transaction" or "transaction in progress" state, invoke
[abortTransaction](../transactions/transactions.md#aborttransaction) on the session.

2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is
less than 120 seconds, jump back to step two.
less than TIMEOUT_MS, calculate the backoffMS to be `jitter * min(BACKOFF_INITIAL * (1.5**retry), BACKOFF_MAX)`
where:

1. jitter is a random float between \[0, 1)
2. retry is the variable defined in step 1.
3. `BACKOFF_INITIAL` is 5ms
4. `BACKOFF_MAX` is 500ms

If elapsed time + `backoffMS` > `TIMEOUT_MS`, then raise last known error. Otherwise, sleep for `backoffMS`,
increment `retry`, and jump back to step two.

3. If the callback's error includes a "UnknownTransactionCommitResult" label, the callback must have manually
committed a transaction, propagate the callback's error to the caller of `withTransaction` and return
immediately.

4. Otherwise, propagate the callback's error to the caller of `withTransaction` and return immediately.
7. If the ClientSession is in the "no transaction", "transaction aborted", or "transaction committed" state, assume the
callback intentionally aborted or committed the transaction and return immediately.
8. Invoke [commitTransaction](../transactions/transactions.md#committransaction) on the session.
9. If `commitTransaction` reported an error:
1. If the `commitTransaction` error includes a "UnknownTransactionCommitResult" label and the error is not
MaxTimeMSExpired and the elapsed time of `withTransaction` is less than 120 seconds, jump back to step eight.
We will trust `commitTransaction` to apply a majority write concern on retry attempts (see:
MaxTimeMSExpired and the elapsed time of `withTransaction` is less than TIMEOUT_MS, jump back to step eight. We
will trust `commitTransaction` to apply a majority write concern on retry attempts (see:
[Majority write concern is used when retrying commitTransaction](#majority-write-concern-is-used-when-retrying-committransaction)).

2. If the `commitTransaction` error includes a "TransientTransactionError" label and the elapsed time of
`withTransaction` is less than 120 seconds, jump back to step two.
`withTransaction` is less than TIMEOUT_MS, jump back to step two.

3. Otherwise, propagate the `commitTransaction` error to the caller of `withTransaction` and return immediately.
10. The transaction was committed successfully. Return immediately.
Expand All @@ -154,23 +171,39 @@ This method should perform the following sequence of actions:
This method can be expressed by the following pseudo-code:

```typescript
var BACKOFF_INITIAL = 5 // 5ms initial backoff
var BACKOFF_MAX = 500 // 500ms max backoff
withTransaction(callback, options) {
// Note: drivers SHOULD use a monotonic clock to determine elapsed time
var startTime = Date.now(); // milliseconds since Unix epoch
// See the CSOT specification for information on calculating timeoutMS for a convenient transaction API call.
var timeout = getCSOTTimeoutIfSet() ?? 120_000;
var retry = 0;

retryTransaction: while (true) {
if (retry > 0) {
var backoff = Math.random() * min(BACKOFF_INITIAL * (1.5**retry),
BACKOFF_MAX);

if (Date.now() + backoff - startTime >= timeout) {
throw last_error;
}
sleep(backoff);
}
retry += 1
this.startTransaction(options); // may throw on error

try {
callback(this);
} catch (error) {
var last_error = error;
if (this.transactionState == STARTING ||
this.transactionState == IN_PROGRESS) {
this.abortTransaction();
}

if (error.hasErrorLabel("TransientTransactionError") &&
Date.now() - startTime < 120000) {
Date.now() - startTime < timeout) {
continue retryTransaction;
}

Expand Down Expand Up @@ -198,12 +231,12 @@ withTransaction(callback, options) {
*/
if (!isMaxTimeMSExpiredError(error) &&
error.hasErrorLabel("UnknownTransactionCommitResult") &&
Date.now() - startTime < 120000) {
Date.now() - startTime < timeout) {
continue retryCommit;
}

if (error.hasErrorLabel("TransientTransactionError") &&
Date.now() - startTime < 120000) {
Date.now() - startTime < timeout) {
continue retryTransaction;
}

Expand Down Expand Up @@ -324,8 +357,8 @@ exceed the user's original intention for `maxTimeMS`.
The callback may be executed any number of times. Drivers are free to encourage their users to design idempotent
callbacks.

A previous design had no limits for retrying commits or entire transactions. The callback is always able indicate that
`withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly;
A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate
that `withTransaction` should return to its caller (without future retry attempts) by aborting the transaction directly;
however, that puts the onus on avoiding very long (or infinite) retry loops on the application. We expect the most
common cause of retry loops will be due to TransientTransactionErrors caused by write conflicts, as those can occur
regularly in a healthy application, as opposed to UnknownTransactionCommitResult, which would typically be caused by an
Expand All @@ -338,6 +371,16 @@ non-configurable default and is intentionally twice the value of MongoDB 4.0's d
parameter (60 seconds). Applications that desire longer retry periods may call `withTransaction` additional times as
needed. Applications that desire shorter retry periods should not use this method.

### Backoff Benefits

Previously, the driver would retry transactions immediately, which is fine for low levels of contention. But, as the
server load increases, immediate retries can result in retry storms, unnecessarily further overloading the server.

Exponential backoff is well-researched and accepted backoff strategy that is simple to implement. A low initial backoff
(1-millisecond) and growth value (1.25x) were chosen specifically to mitigate latency in low levels of contention.
Empirical evidence suggests that 500-millisecond max backoff ensured that a transaction did not wait so long as to
exceed the 120-second timeout and reduced load spikes.

## Backwards Compatibility

The specification introduces a new method on the ClientSession class and does not introduce any backward breaking
Expand All @@ -357,6 +400,8 @@ provides an implementation of a technique already described in the MongoDB 4.0 d

## Changelog

- 2025-11-20: withTransaction applies exponential backoff when retrying.

- 2024-09-06: Migrated from reStructuredText to Markdown.

- 2023-11-22: Document error handling inside the callback.
Expand Down
Loading