Skip to content

Conversation

@baileympearson
Copy link
Contributor

@baileympearson baileympearson commented Dec 2, 2025

DRIVERS-3239

Overview

This PR adds support for a new class of errors (SystemOverloadedError) to drivers' operation retry logic, as outlined in the design document.

Additionally, it includes a new argument to the MongoDB handshake (also defined in the design document).

Python will be second implementer.
Node implementation: mongodb/node-mongodb-native#4806

Testing

The testing strategy is two-fold:

  • Building off of Ezra's work to generate unified tests for retryable handshake errors, this PR generates unified tests to confirm that:

    • operations are retried using the new SystemOverloadedError label
    • operations are retried no more than 5 (current MAX_ATTEMPTS, as defined in the spec) times
  • Following Iris's work in DRIVERS-1934: withTransaction API retries too frequently #1851, this PR adds a prose test that ensures drivers apply exponential backoff in the retryability loop.

  • Update changelog.

  • Test changes in at least one language driver.

  • Test these changes against all server versions and topologies (including standalone, replica set, and sharded
    clusters).

@baileympearson baileympearson marked this pull request as ready for review December 2, 2025 18:59
@baileympearson baileympearson requested review from a team as code owners December 2, 2025 18:59
@baileympearson baileympearson requested review from jmikola and jyemin and removed request for a team December 2, 2025 18:59
@blink1073
Copy link
Member

It looks like you also need to bump the schema version:

source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3source/client-backpressure/tests/backpressure-retry-loop.yml invalid
[
  {
    instancePath: '/tests/0/operations/3/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3
source/client-backpressure/tests/backpressure-retry-max-attempts.yml invalid
[
  {
    instancePath: '/tests/0/operations/1/expectError',
    schemaPath: '#/definitions/expectedError/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]
 using schema v1.3

@blink1073
Copy link
Member

WIP Python implementation: mongodb/mongo-python-driver#2635

@blink1073
Copy link
Member

blink1073 commented Dec 3, 2025

All unified and prose tests are passing in the Python implementation.

Edit: we're still failing one unified test, "client.clientBulkWrite retries using operation loop", investigating...

Edit 2: we're all good now

Copy link
Contributor

@jyemin jyemin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the specification changes, not the pseudocode or tests. Those are best reviewed by implementers.

- This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the
timeout to avoid retry storms.
5. If the previous error includes the `SystemOverloadedError` label, the client MUST apply exponential backoff according
to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


This specification expands the driver's retry ability to all commands, including those not currently considered
retryable such as updateMany, create collection, getMore, and generic runCommand. The new command execution method obeys
the following rules:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the rules include all the deposits into the token bucket, consider adding withdrawals as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Q&A

TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything to add here, or just remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing yet .. I'll remove it for now. It can always be added back if there are items worth mentioning here.


## Changelog

- 2025-XX-XX: Initial version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we handle the date... Is there an automation for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of. Usually the spec author fills it out before merging

I'll just leave this thread open to remind myself to add changelog dates before merging once all changes are completed.

Comment on lines 53 to 55
```python
assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts:
Could this stick to being javascript from top-to-bottom?
Can we also do BIG_TIME - SMALL_TIME >= 2.1?

  • To me that's more human-readable.
  • It maintains that BIG_TIME must always be bigger (removing the need for absolute_value).
  • It captures the 1 second variation whilst still being rigid to the 3.1 second window and stays minimally invasive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. This phrasing was taken from the withTransaction backoff tests. Would you want me to update it there as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jibola Anything remaining here?

the following rules:

1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.
- The value is 0.1 and non-configurable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the concerns for golang, I thought updating these values by a scalar of 10 in those cases was fine?
cc: @matthewdale

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to identify clients which do and do not support backpressure. Currently, this flag is unused but in the future the
server may offer different rate limiting behavior for clients that do not support backpressure.

##### Implementation notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### Implementation notes
#### Implementation notes

Copy link
Contributor Author

@baileympearson baileympearson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitting comments


## Q&A

TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing yet .. I'll remove it for now. It can always be added back if there are items worth mentioning here.

Comment on lines 53 to 55
```python
assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1)
```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. This phrasing was taken from the withTransaction backoff tests. Would you want me to update it there as well?

#### Goodput

The throughput of positive, useful output. In the context of drivers, this refers to the number of non-error results
that the driver processes per unit of time.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an alternative definition you think makes more sense here?


## Changelog

- 2025-XX-XX: Initial version.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intended as a TODO, yeah.

If I understand correctly: you're saying new specs generally just leave the changelog empty? That's fine with me as well if that's the usual practice with new specs and test readmes.


This specification expands the driver's retry ability to all commands if the error indicates that is both an overload
error and that it is retryable, including those not currently considered retryable such as updateMany, create
collection, getMore, and generic runCommand. The new command execution method obeys the following rules:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently (before back pressure), there seem to be no reason for a driver to inspect the server response to a command run via runCommand (unless we are talking about runCursorCommand, which I am not).

I thought that all drivers inspected the response to throw an exception if the response includes: ok: 0. Is this not the case in Java?

--

runCommand is mentioned here as if the overload retry policy is trivially applicable to it, but I am not sure that's the case
and
Retrying runCommand requires changing its implementation such that it uses whatever internal retry mechanisms a driver has.

I don't think the assumption is that it is trivial for drivers to implement for all commands, because we're explicitly adding support for commands that previously were not retryable in addition to runCommand (ex: getMore). Debating whether or not it is trivial to implement misses the point imo: this is a new feature that drivers are building, so it is understood that it requires code changes to implement.

r.e. including runCommand in this project: I'm going to keep it for now, unless there's a strong technical argument against including it. I don't think we can say that drivers handle backpressure if there is a user-facing API that doesn't include the backpressure retry logic.

if attempt > attempts:
raise

deprioritized_servers.append(server.address)
Copy link
Member

@vbabanin vbabanin Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be a potential inconsistency between the prose specification and the pseudocode regarding server deprioritization.

Step 7:

"If the request is eligible for retry (as outlined in step 4), the client MUST add the previously used server's address to the list of deprioritized server addresses"

Step 4 requires both SystemOverloadedError and RetryableError labels.

However, the pseudocode uses deprioritized_servers.append(server.address) unconditionally - which deprioritizes the server for all retries, including traditional retryable reads and retryable writes.

If the pseudocode in this specification is intended to show the unified retry behavior combining both the new overload retry policy and existing retryable reads/writes logic (as stated in retryable-writes.md:341-346), then we should note that the existing specifications (retryable-writes/reads) restrict server deprioritization to sharded clusters only.

Line 325-326 in retryable-writes.md:

"In a sharded cluster, the server on which the operation failed MUST be provided to the server selection mechanism as a deprioritized server."

My reading is that the unified behavior should:

  1. Deprioritize the server when the error has both SystemOverloadedError and RetryableError labels (matching Step 7), and
  2. Continue to deprioritize only mongos in sharded clusters for existing retryable reads/writes behavior.

Could you confirm if this understanding is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is incorrect. All topologies and all retry mechanisms now always deprioritize servers: 3e577e8

@NoahStapp We missed a reference specifying that sharded clusters behavior in our last PR - would you mind opening a follow up that removes the sentence that Slav pointed out?

Copy link
Contributor

@NoahStapp NoahStapp Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sentence was already fixed in the DRIVERS-3344 commit:

For each retry attempt, drivers MUST select a writable server. The server address on which the operation failed MUST be
provided to the server selection mechanism as a member of the deprioritized server address list.

I'd use the latest commit of master as a reference instead of Bailey's fork.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbabanin Anything else in this thread or can we resolve it?


#### Pseudocode

The following pseudocode describes the overload retry policy:
Copy link
Member

@vbabanin vbabanin Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryable-writes.md:341-346 states:

"[!NOTE]
The rules above and the pseudocode below only demonstrate the rules for retryable writes... the pseudocode was intentionally unmodified. For a pseudocode block that contains both retryable writes logic as defined in this specification and backoff retryabilitity as defined in the client backpressure specification, see the pseudocode in the Backpressure Specification."

So retryable-writes tells readers go to the backpressure spec for the unified pseudocode.

However, client-backpressure.md does not seem to say it's providing a unified pseudocode. It just:

  1. Says it's "separate from" existing behavior
  2. Says the pseudocode shows "overload retry policy", which reads as if it only describes the overload retry policy.

If the intent is for client-backpressure.md pseudo-code to present the unified pseudocode, I’d suggest making that explicit in the text and adding links back to the relevant specs, similar to retryable-writes.md:341-346. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pseudocode in the client backpressure specification is unified - it handles both existing retryable reads/writes, and the new client backpressure retryability. It doesn't include all of the same detail as the pseudocodes in each retryability spec though, but I intentionally omitted things like error handling for brevity.

ex:

            is_retryable = is_retryable_write() or is_retryable_read() or (exc.has_error_label("RetryableError") and exc.has_error_label("SystemOverloadedError"))

Is there something specific you feel is missing? I can add more detail to this pseudocode if there are things missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbabanin Just following up - any additional questions?

- The value of `MAX_ATTEMPTS` is 5 and non-configurable.
- This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the
timeout to avoid retry storms.
5. If a retry attempt is to be attempted, a token will be consumed from the token bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is difficult to understand. Consider a clearer sentence:

Suggested change
5. If a retry attempt is to be attempted, a token will be consumed from the token bucket.
5. A retry attempt consumes 1 token from the token bucket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done

Comment on lines +195 to +198
The overload retry policy introduces a per-client token bucket to limit SystemOverloaded retry attempts. Although the
server rejects excess operations as quickly as possible, doing so costs CPU and creates extra contention on the
connection pool which can eventually negatively affect goodput. To reduce this risk, the token bucket will limit retry
attempts during a prolonged overload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to directly contradict the assertions in the CSOT spec section Why don't drivers use backoff/jitter between retry attempts?.

  1. Should we remove that section from the CSOT spec?
  2. Is there a reason not to apply backoff+jitter to all retry attempts, not just those with the SystemOverloadedError label?

P.S. I realize both of these are arguably out of the scope of this change, mostly wanted to know if there are existing answers to these questions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re 1. - I think the section is still accurate because we only apply backoff and jitter to a particular subset of errors (system overloaded errors). I could update the phrasing to mention the client backpressure specification if you'd like.
re 2. - Not that I can think of but I also don't know that it is necessary. Especially with Noah's server selection changes, because now there's no chance of selecting the same server again.

The exception would be really extreme server overload (what this feature addresses) or other extreme driver scenarios where requests to all nodes are failing, exhausting the deprioritized server list. The only such non-overload scenario I can think of is a complete network outage to all nodes but I don't think backoff and jitter would help much in this scenario at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the responses! Consider this resolved.

@blink1073 blink1073 requested a review from a team as a code owner December 17, 2025 12:41
@blink1073 blink1073 requested review from durran and removed request for a team December 17, 2025 12:41

An error considered retryable by the [Retryable Writes Specification](../retryable-writes/retryable-writes.md).

#### Backpressure Error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec should also get a changelog entry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@baileympearson baileympearson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitting comments

Comment on lines 53 to 55
```python
assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1)
```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jibola Anything remaining here?

operations:
-
object: *utilCollection
name: deleteMany
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed this one. I like your suggestion - done.

- This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the
timeout to avoid retry storms.
5. If the previous error includes the `SystemOverloadedError` label, the client MUST apply exponential backoff according
to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


This specification expands the driver's retry ability to all commands, including those not currently considered
retryable such as updateMany, create collection, getMore, and generic runCommand. The new command execution method obeys
the following rules:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if attempt > attempts:
raise

deprioritized_servers.append(server.address)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vbabanin Anything else in this thread or can we resolve it?

Comment on lines 176 to 179
Some drivers might not have retryability implementations that allow easy separation of the existing retryable
reads/writes mechanisms from the exponential backoff and jitter retry algorithm. An example pseudocode is defined below
that demonstrates a combined retryable reads/writes implementation with the corresponding backpressure changes (adapted
from the Node driver's implementation):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this example implementation

The following pseudocode describes the overload retry policy:

```python
BASE_BACKOFF = 0.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to leave it as-is; as the pseudocode is written in python and follows python conventions. Let me know if the clarified pseudocde works for you

backoff = jitter * min(BASE_BACKOFF * (2 ** attempt), MAX_BACKOFF)
# Raise if the error if non retryable.
if exc.has_error_label("RetryableError") and exc.has_error_label("SystemOverloadedError"):
raise
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


## Changelog

- 2025-XX-XX: Initial version.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the changelog

}
```

3. Execute the following command. Expect that the command errors. Measure the duration of the command execution.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworked the prose to be explicit about what the command is.

Copy link
Contributor

@matthewdale matthewdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 👍

5. A retry attempt consumes 1 token from the token bucket.
6. If the request is eligible for retry (as outlined in step 4), the client MUST apply exponential backoff according to
the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)`
- `i` is the retry attempt number (starting with 0 for the first retry).
Copy link
Member

@sanych-sun sanych-sun Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we decided to have retry number start we 0? It makes the requirements confusing. Why don't we start with 1 and in fact we will have the same formula as for withTransaction:

delayMS = j * min(maxBackoff, baseBackoff * 2^(i-1))

With the only difference here we have 2 as the base for pow function, in convinientTransaction API we have 1.5

Copy link
Member

@sanych-sun sanych-sun Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the formula from withTransaction:
jitter * min(BACKOFF_INITIAL * 1.5 ** (transactionAttempt - 1), BACKOFF_MAX)

Where transactionAttempt started with 0 and is being incremented AFTER the delay, but before executing the callback attempt. Which is also confusing... but in C# implementation we wait AFTER the attempt so it's more natural.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants