-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-3239: Add exponential backoff to operation retry loop for server overloaded errors #1862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- add prose test - add assertions on the number of retries for maxAttempts tests - don't run clientBulkWrite tests on <8.0 servers
|
It looks like you also need to bump the schema version: |
|
WIP Python implementation: mongodb/mongo-python-driver#2635 |
|
All unified and prose tests are passing in the Python implementation. Edit: we're still failing one unified test, "client.clientBulkWrite retries using operation loop", investigating... Edit 2: we're all good now |
jyemin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed the specification changes, not the pseudocode or tests. Those are best reviewed by implementers.
| - This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the | ||
| timeout to avoid retry storms. | ||
| 5. If the previous error includes the `SystemOverloadedError` label, the client MUST apply exponential backoff according | ||
| to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)` | |
| to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| This specification expands the driver's retry ability to all commands, including those not currently considered | ||
| retryable such as updateMany, create collection, getMore, and generic runCommand. The new command execution method obeys | ||
| the following rules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the rules include all the deposits into the token bucket, consider adding withdrawals as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| ## Q&A | ||
|
|
||
| TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything to add here, or just remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing yet .. I'll remove it for now. It can always be added back if there are items worth mentioning here.
|
|
||
| ## Changelog | ||
|
|
||
| - 2025-XX-XX: Initial version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how we handle the date... Is there an automation for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of. Usually the spec author fills it out before merging
I'll just leave this thread open to remind myself to add changelog dates before merging once all changes are completed.
| ```python | ||
| assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts:
Could this stick to being javascript from top-to-bottom?
Can we also do BIG_TIME - SMALL_TIME >= 2.1?
- To me that's more human-readable.
- It maintains that
BIG_TIMEmust always be bigger (removing the need for absolute_value). - It captures the 1 second variation whilst still being rigid to the 3.1 second window and stays minimally invasive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure. This phrasing was taken from the withTransaction backoff tests. Would you want me to update it there as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jibola Anything remaining here?
| the following rules: | ||
|
|
||
| 1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens. | ||
| - The value is 0.1 and non-configurable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the concerns for golang, I thought updating these values by a scalar of 10 in those cases was fine?
cc: @matthewdale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the outcome was the opposite: https://docs.google.com/document/d/1teqNgeWbW6dpRQOALrJTEBRoO6sYcrpq9T3_NJE0QfU/edit?disco=AAABtSVfUxI
| to identify clients which do and do not support backpressure. Currently, this flag is unused but in the future the | ||
| server may offer different rate limiting behavior for clients that do not support backpressure. | ||
|
|
||
| ##### Implementation notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ##### Implementation notes | |
| #### Implementation notes |
baileympearson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submitting comments
|
|
||
| ## Q&A | ||
|
|
||
| TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing yet .. I'll remove it for now. It can always be added back if there are items worth mentioning here.
| ```python | ||
| assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure. This phrasing was taken from the withTransaction backoff tests. Would you want me to update it there as well?
| #### Goodput | ||
|
|
||
| The throughput of positive, useful output. In the context of drivers, this refers to the number of non-error results | ||
| that the driver processes per unit of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an alternative definition you think makes more sense here?
|
|
||
| ## Changelog | ||
|
|
||
| - 2025-XX-XX: Initial version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intended as a TODO, yeah.
If I understand correctly: you're saying new specs generally just leave the changelog empty? That's fine with me as well if that's the usual practice with new specs and test readmes.
|
|
||
| This specification expands the driver's retry ability to all commands if the error indicates that is both an overload | ||
| error and that it is retryable, including those not currently considered retryable such as updateMany, create | ||
| collection, getMore, and generic runCommand. The new command execution method obeys the following rules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently (before back pressure), there seem to be no reason for a driver to inspect the server response to a command run via runCommand (unless we are talking about runCursorCommand, which I am not).
I thought that all drivers inspected the response to throw an exception if the response includes: ok: 0. Is this not the case in Java?
--
runCommand is mentioned here as if the overload retry policy is trivially applicable to it, but I am not sure that's the case
and
Retrying runCommand requires changing its implementation such that it uses whatever internal retry mechanisms a driver has.
I don't think the assumption is that it is trivial for drivers to implement for all commands, because we're explicitly adding support for commands that previously were not retryable in addition to runCommand (ex: getMore). Debating whether or not it is trivial to implement misses the point imo: this is a new feature that drivers are building, so it is understood that it requires code changes to implement.
r.e. including runCommand in this project: I'm going to keep it for now, unless there's a strong technical argument against including it. I don't think we can say that drivers handle backpressure if there is a user-facing API that doesn't include the backpressure retry logic.
source/client-backpressure/tests/backpressure-retry-max-attempts.yml.template
Show resolved
Hide resolved
source/client-backpressure/tests/backpressure-retry-max-attempts.yml.template
Show resolved
Hide resolved
| if attempt > attempts: | ||
| raise | ||
|
|
||
| deprioritized_servers.append(server.address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a potential inconsistency between the prose specification and the pseudocode regarding server deprioritization.
Step 7:
"If the request is eligible for retry (as outlined in step 4), the client MUST add the previously used server's address to the list of deprioritized server addresses"
Step 4 requires both SystemOverloadedError and RetryableError labels.
However, the pseudocode uses deprioritized_servers.append(server.address) unconditionally - which deprioritizes the server for all retries, including traditional retryable reads and retryable writes.
If the pseudocode in this specification is intended to show the unified retry behavior combining both the new overload retry policy and existing retryable reads/writes logic (as stated in retryable-writes.md:341-346), then we should note that the existing specifications (retryable-writes/reads) restrict server deprioritization to sharded clusters only.
Line 325-326 in retryable-writes.md:
"In a sharded cluster, the server on which the operation failed MUST be provided to the server selection mechanism as a deprioritized server."
My reading is that the unified behavior should:
- Deprioritize the server when the error has both
SystemOverloadedErrorandRetryableErrorlabels (matching Step 7), and - Continue to deprioritize only mongos in sharded clusters for existing retryable reads/writes behavior.
Could you confirm if this understanding is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that is incorrect. All topologies and all retry mechanisms now always deprioritize servers: 3e577e8
@NoahStapp We missed a reference specifying that sharded clusters behavior in our last PR - would you mind opening a follow up that removes the sentence that Slav pointed out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sentence was already fixed in the DRIVERS-3344 commit:
specifications/source/retryable-writes/retryable-writes.md
Lines 320 to 321 in 3e577e8
| For each retry attempt, drivers MUST select a writable server. The server address on which the operation failed MUST be | |
| provided to the server selection mechanism as a member of the deprioritized server address list. |
I'd use the latest commit of master as a reference instead of Bailey's fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbabanin Anything else in this thread or can we resolve it?
|
|
||
| #### Pseudocode | ||
|
|
||
| The following pseudocode describes the overload retry policy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retryable-writes.md:341-346 states:
"[!NOTE]
The rules above and the pseudocode below only demonstrate the rules for retryable writes... the pseudocode was intentionally unmodified. For a pseudocode block that contains both retryable writes logic as defined in this specification and backoff retryabilitity as defined in the client backpressure specification, see the pseudocode in the Backpressure Specification."
So retryable-writes tells readers go to the backpressure spec for the unified pseudocode.
However, client-backpressure.md does not seem to say it's providing a unified pseudocode. It just:
- Says it's "separate from" existing behavior
- Says the pseudocode shows "overload retry policy", which reads as if it only describes the overload retry policy.
If the intent is for client-backpressure.md pseudo-code to present the unified pseudocode, I’d suggest making that explicit in the text and adding links back to the relevant specs, similar to retryable-writes.md:341-346. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pseudocode in the client backpressure specification is unified - it handles both existing retryable reads/writes, and the new client backpressure retryability. It doesn't include all of the same detail as the pseudocodes in each retryability spec though, but I intentionally omitted things like error handling for brevity.
ex:
is_retryable = is_retryable_write() or is_retryable_read() or (exc.has_error_label("RetryableError") and exc.has_error_label("SystemOverloadedError"))
Is there something specific you feel is missing? I can add more detail to this pseudocode if there are things missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbabanin Just following up - any additional questions?
| - The value of `MAX_ATTEMPTS` is 5 and non-configurable. | ||
| - This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the | ||
| timeout to avoid retry storms. | ||
| 5. If a retry attempt is to be attempted, a token will be consumed from the token bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is difficult to understand. Consider a clearer sentence:
| 5. If a retry attempt is to be attempted, a token will be consumed from the token bucket. | |
| 5. A retry attempt consumes 1 token from the token bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done
| The overload retry policy introduces a per-client token bucket to limit SystemOverloaded retry attempts. Although the | ||
| server rejects excess operations as quickly as possible, doing so costs CPU and creates extra contention on the | ||
| connection pool which can eventually negatively affect goodput. To reduce this risk, the token bucket will limit retry | ||
| attempts during a prolonged overload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to directly contradict the assertions in the CSOT spec section Why don't drivers use backoff/jitter between retry attempts?.
- Should we remove that section from the CSOT spec?
- Is there a reason not to apply backoff+jitter to all retry attempts, not just those with the
SystemOverloadedErrorlabel?
P.S. I realize both of these are arguably out of the scope of this change, mostly wanted to know if there are existing answers to these questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re 1. - I think the section is still accurate because we only apply backoff and jitter to a particular subset of errors (system overloaded errors). I could update the phrasing to mention the client backpressure specification if you'd like.
re 2. - Not that I can think of but I also don't know that it is necessary. Especially with Noah's server selection changes, because now there's no chance of selecting the same server again.
The exception would be really extreme server overload (what this feature addresses) or other extreme driver scenarios where requests to all nodes are failing, exhausting the deprioritized server list. The only such non-overload scenario I can think of is a complete network outage to all nodes but I don't think backoff and jitter would help much in this scenario at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the responses! Consider this resolved.
source/transactions/tests/unified/backpressure-retryable-commit.yml
Outdated
Show resolved
Hide resolved
source/transactions/tests/unified/backpressure-retryable-abort.yml
Outdated
Show resolved
Hide resolved
source/transactions/tests/unified/backpressure-retryable-reads.yml
Outdated
Show resolved
Hide resolved
source/transactions/tests/unified/backpressure-retryable-writes.yml
Outdated
Show resolved
Hide resolved
|
|
||
| An error considered retryable by the [Retryable Writes Specification](../retryable-writes/retryable-writes.md). | ||
|
|
||
| #### Backpressure Error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spec should also get a changelog entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
baileympearson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submitting comments
| ```python | ||
| assertTrue(absolute_value(with_backoff_time - (no_backoff_time + 3.1 seconds)) < 1) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jibola Anything remaining here?
| operations: | ||
| - | ||
| object: *utilCollection | ||
| name: deleteMany |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed this one. I like your suggestion - done.
| - This intentionally changes the behavior of CSOT which otherwise would retry an unlimited number of times within the | ||
| timeout to avoid retry storms. | ||
| 5. If the previous error includes the `SystemOverloadedError` label, the client MUST apply exponential backoff according | ||
| to according to the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| This specification expands the driver's retry ability to all commands, including those not currently considered | ||
| retryable such as updateMany, create collection, getMore, and generic runCommand. The new command execution method obeys | ||
| the following rules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| if attempt > attempts: | ||
| raise | ||
|
|
||
| deprioritized_servers.append(server.address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vbabanin Anything else in this thread or can we resolve it?
| Some drivers might not have retryability implementations that allow easy separation of the existing retryable | ||
| reads/writes mechanisms from the exponential backoff and jitter retry algorithm. An example pseudocode is defined below | ||
| that demonstrates a combined retryable reads/writes implementation with the corresponding backpressure changes (adapted | ||
| from the Node driver's implementation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this example implementation
| The following pseudocode describes the overload retry policy: | ||
|
|
||
| ```python | ||
| BASE_BACKOFF = 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to leave it as-is; as the pseudocode is written in python and follows python conventions. Let me know if the clarified pseudocde works for you
| backoff = jitter * min(BASE_BACKOFF * (2 ** attempt), MAX_BACKOFF) | ||
| # Raise if the error if non retryable. | ||
| if exc.has_error_label("RetryableError") and exc.has_error_label("SystemOverloadedError"): | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
| ## Changelog | ||
|
|
||
| - 2025-XX-XX: Initial version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the changelog
| } | ||
| ``` | ||
|
|
||
| 3. Execute the following command. Expect that the command errors. Measure the duration of the command execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworked the prose to be explicit about what the command is.
matthewdale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍
| 5. A retry attempt consumes 1 token from the token bucket. | ||
| 6. If the request is eligible for retry (as outlined in step 4), the client MUST apply exponential backoff according to | ||
| the following formula: `delayMS = j * min(maxBackoff, baseBackoff * 2^i)` | ||
| - `i` is the retry attempt number (starting with 0 for the first retry). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we decided to have retry number start we 0? It makes the requirements confusing. Why don't we start with 1 and in fact we will have the same formula as for withTransaction:
delayMS = j * min(maxBackoff, baseBackoff * 2^(i-1))
With the only difference here we have 2 as the base for pow function, in convinientTransaction API we have 1.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the formula from withTransaction:
jitter * min(BACKOFF_INITIAL * 1.5 ** (transactionAttempt - 1), BACKOFF_MAX)
Where transactionAttempt started with 0 and is being incremented AFTER the delay, but before executing the callback attempt. Which is also confusing... but in C# implementation we wait AFTER the attempt so it's more natural.
DRIVERS-3239
Overview
This PR adds support for a new class of errors (
SystemOverloadedError) to drivers' operation retry logic, as outlined in the design document.Additionally, it includes a new argument to the MongoDB handshake (also defined in the design document).
Python will be second implementer.
Node implementation: mongodb/node-mongodb-native#4806
Testing
The testing strategy is two-fold:
Building off of Ezra's work to generate unified tests for retryable handshake errors, this PR generates unified tests to confirm that:
SystemOverloadedErrorlabelFollowing Iris's work in DRIVERS-1934: withTransaction API retries too frequently #1851, this PR adds a prose test that ensures drivers apply exponential backoff in the retryability loop.
Update changelog.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, and sharded
clusters).