Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per Partition Circuit Breaker #40302

Draft
wants to merge 81 commits into
base: main
Choose a base branch
from
Draft

Per Partition Circuit Breaker #40302

wants to merge 81 commits into from

Conversation

tvaron3
Copy link
Member

@tvaron3 tvaron3 commented Mar 31, 2025

Problem

There are certain issues that hard to diagnose from the client side if these are transient or if they are terminal availability issues. These could be network issues, partition upgrades, partition migrations, etc. For these issues, the sdk would retry the requests on another region, but would never mark the region as unavailable unless the failures were seen in the sdk health check.

Goal

Per partition circuit breaker is meant to lower the granularity down of a failover to the partition level for 408, 5xx status codes and connection issues. The sdk should also now not only failover the requests but mark the partition as unavailable. This should prevent future requests for a time period from trying on the affected partition.

Solution

Scope

Per partition circuit breaker is applicable for

  • any consistency level
  • document operations
  • single write region accounts with multiple read regions
  • multiple write region accounts

New State

Partitions will now have 4 health states tracked by a new class ParitionHealthTracker. The failure rate and consecutive failures will be tracked for partition. These will be tracked for one minute and then reset for a partition. Once the partition reaches a threshold it will be marked as unavailable. Requests will not be routed to partitions marked as unhealthy or unhealthy tentative for a region. The unavailable regions will be appended to the excluded locations from the user.

stateDiagram
   state "Healthy" as Healthy
   state "Unhealthy" as Unhealthy
   state "Healthy Tentative" as HealthyTentative
   state "Unhealthy Tentative" as UnhealthyTentative
   Start --> Healthy
   Healthy --> UnhealthyTentative : Failure breaking threshold
   HealthyTentative --> Healthy : Success
   HealthyTentative --> Unhealthy : Failure
   UnhealthyTentative --> HealthyTentative: After 60s (configurable)
   Unhealthy --> HealthyTentative : After 120s (configurable)
Loading

Service request errors will be more aggressive in marking a partition as unavailable for a region. For service request errors, the request did not reach the service. Previous sdk behavior would retry three times in region and then mark the region as unavailable. Now the sdk will do three in region retries and mark the partition as unavailable for the region. Service request errors will still be used for failure tracking as well.

New Environment Variables

"AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER": Default will be false.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_READ": Default will be 10 errors.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_WRITE": Default will be 5 errors.
"AZURE_COSMOS_FAILURE_PERCENTAGE_TOLERATED": Default would be 90 percent.
"AZURE_COSMOS_STALE_PARTITION_UNAVAILABILITY_CHECK_IN_SECONDS" = Defaul will 120 seconds.

Other Implementations

Azure/azure-sdk-for-java#39265
Azure/azure-cosmos-dotnet-v3#5023

Still to be done

  • live testing with multi write region account
  • fault injection testing
  • sync version
  • perf testing

Relevant Issue

#39687

tvaron3 and others added 28 commits February 5, 2025 19:03
Fixed the timeout retry policy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants