-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per Partition Circuit Breaker #40302
Draft
tvaron3
wants to merge
81
commits into
Azure:main
Choose a base branch
from
tvaron3:tvaron3/ppcb
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…into tvaron3/readtimeout
Fixed the timeout logic
Fixed the timeout retry policy
…e-sdk-for-python into users/fabianm/tests
…into users/fabianm/tests
…into users/fabianm/tests
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
There are certain issues that hard to diagnose from the client side if these are transient or if they are terminal availability issues. These could be network issues, partition upgrades, partition migrations, etc. For these issues, the sdk would retry the requests on another region, but would never mark the region as unavailable unless the failures were seen in the sdk health check.
Goal
Per partition circuit breaker is meant to lower the granularity down of a failover to the partition level for 408, 5xx status codes and connection issues. The sdk should also now not only failover the requests but mark the partition as unavailable. This should prevent future requests for a time period from trying on the affected partition.
Solution
Scope
Per partition circuit breaker is applicable for
New State
Partitions will now have 4 health states tracked by a new class ParitionHealthTracker. The failure rate and consecutive failures will be tracked for partition. These will be tracked for one minute and then reset for a partition. Once the partition reaches a threshold it will be marked as unavailable. Requests will not be routed to partitions marked as unhealthy or unhealthy tentative for a region. The unavailable regions will be appended to the excluded locations from the user.
Service request errors will be more aggressive in marking a partition as unavailable for a region. For service request errors, the request did not reach the service. Previous sdk behavior would retry three times in region and then mark the region as unavailable. Now the sdk will do three in region retries and mark the partition as unavailable for the region. Service request errors will still be used for failure tracking as well.
New Environment Variables
"AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER": Default will be false.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_READ": Default will be 10 errors.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_WRITE": Default will be 5 errors.
"AZURE_COSMOS_FAILURE_PERCENTAGE_TOLERATED": Default would be 90 percent.
"AZURE_COSMOS_STALE_PARTITION_UNAVAILABILITY_CHECK_IN_SECONDS" = Defaul will 120 seconds.
Other Implementations
Azure/azure-sdk-for-java#39265
Azure/azure-cosmos-dotnet-v3#5023
Still to be done
Relevant Issue
#39687