Per Partition Circuit Breaker #40302

tvaron3 · 2025-03-31T22:51:15Z

Problem

There are certain issues that hard to diagnose from the client side if these are transient or if they are terminal availability issues. These could be network issues, partition upgrades, partition migrations, etc. For these issues, the sdk would retry the requests on another region, but would never mark the region as unavailable unless the failures were seen in the sdk health check.

Goal

Per partition circuit breaker is meant to lower the granularity down of a failover to the partition level for 408, 5xx status codes and connection issues. The sdk should also now not only failover the requests but mark the partition as unavailable. This should prevent future requests for a time period from trying on the affected partition.

Solution

Scope

Per partition circuit breaker is applicable for

any consistency level
document operations
single write region accounts with multiple read regions
multiple write region accounts

New State

Partitions will now have 4 health states tracked by a new class ParitionHealthTracker. The failure rate and consecutive failures will be tracked for partition. These will be tracked for one minute and then reset for a partition. Once the partition reaches a threshold it will be marked as unavailable. Requests will not be routed to partitions marked as unhealthy or unhealthy tentative for a region. The unavailable regions will be appended to the excluded locations from the user.

stateDiagram
   state "Healthy" as Healthy
   state "Unhealthy" as Unhealthy
   state "Healthy Tentative" as HealthyTentative
   state "Unhealthy Tentative" as UnhealthyTentative
   Start --> Healthy
   Healthy --> UnhealthyTentative : Failure breaking threshold
   HealthyTentative --> Healthy : Success
   HealthyTentative --> Unhealthy : Failure
   UnhealthyTentative --> HealthyTentative: After 60s (configurable)
   Unhealthy --> HealthyTentative : After 120s (configurable)

Service request errors will be more aggressive in marking a partition as unavailable for a region. For service request errors, the request did not reach the service. Previous sdk behavior would retry three times in region and then mark the region as unavailable. Now the sdk will do three in region retries and mark the partition as unavailable for the region. Service request errors will still be used for failure tracking as well.

New Environment Variables

"AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER": Default will be false.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_READ": Default will be 10 errors.
"AZURE_COSMOS_CONSECUTIVE_ERROR_COUNT_TOLERATED_FOR_WRITE": Default will be 5 errors.
"AZURE_COSMOS_FAILURE_PERCENTAGE_TOLERATED": Default would be 90 percent.
"AZURE_COSMOS_STALE_PARTITION_UNAVAILABILITY_CHECK_IN_SECONDS" = Defaul will 120 seconds.

Other Implementations

Azure/azure-sdk-for-java#39265
Azure/azure-cosmos-dotnet-v3#5023

Still to be done

live testing with multi write region account
fault injection testing
sync version
perf testing

Relevant Issue

#39687

…into tvaron3/readtimeout

Fixed the timeout logic

Fixed the timeout retry policy

…e-sdk-for-python into users/fabianm/tests

…into users/fabianm/tests

…e-sdk-for-python into users/fabianm/tests

…nkel/azure-sdk-for-python into tvaron3/ppcb

…into tvaron3/ppcb

…ithub.com/allenkim0129/azure-sdk-for-python into tvaron3/ppcb

tvaron3 and others added 28 commits February 5, 2025 19:03

change default read timeout

ff20cf9

fix tests

40e43c4

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

faf6c27

…into tvaron3/readtimeout

Add read timeout tests for database account calls

aefe30b

fix timeout retry policy

9a234f8

Fixed the timeout logic

8859c9f

Merge pull request #2 from tvaron3/tvaron3/readTimeout

8b166fc

Fixed the timeout logic

Fixed the timeout retry policy

ac78da9

Merge pull request #3 from tvaron3/readtimeout

e8bc02e

Fixed the timeout retry policy

Mock tests for timeout and failover retry policy

09aac90

Merge branch 'tvaron3/readtimeout' of https://github.com/tvaron3/azur…

48a20fa

…e-sdk-for-python into users/fabianm/tests

Create test_dummy.py

f22e7d2

Update test_dummy.py

dd8a466

Update test_dummy.py

8ac11c5

Update test_dummy.py

b53e2e9

Iterating on fault injection tooling

973ec44

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

f25af53

…into users/fabianm/tests

Refactoring to have FaultInjectionTransport in its own file

5d72848

Update test_dummy.py

8c9aa4b

Reafctoring FaultInjectionTransport

7260e9d

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

bf3e60b

…into users/fabianm/tests

Iterating on tests

0705aeb

Prettifying tests

baf7aea

small refactoring

e90b722

Adding MM topology on Emulator

cb58896

Adding cross region retry tests

46ec31c

Add Excluded Locations Feature

f03f51f

initial ppcb changes

cf42098

github-actions bot added the Cosmos label Mar 31, 2025

github-project-automation bot added this to CosmosDB Python Eco-System Mar 31, 2025

tvaron3 and others added 30 commits April 3, 2025 11:12

fix test

93c2d7d

fix tests

345f390

fix async in test

fe74aa0

Added multi-region tests

5bb9f1f

Fix _AddParitionKey to pass options to sub methods

996217a

Added initial live tests

41fc917

Updated live-platform-matrix for multi-region tests

07b8f39

initial sync version of fault injection

1b09739

Merge branch 'users/fabianm/tests' of https://github.com/tvaron3/azur…

0f0a991

…e-sdk-for-python into users/fabianm/tests

add all sync tests

2fb3dc9

add new error and fix logs

7b81482

fix test

f355e30

Merge branch 'users/fabianm/tests' of https://github.com/FabianMeiswi…

3056787

…nkel/azure-sdk-for-python into tvaron3/ppcb

Add cosmosQuery mark to TestQuery

8495c51

Correct spelling

b29980c

Fixed live platform matrix syntax

5e79172

Changed Multi-regions

fd40cd7

first ppcb test

85e1206

merge with main

96124fe

fix test

34e3d82

refactor due to pk range wrapper needing io call and pylint

ce14666

Merge branch 'main' of https://github.com/Azure/azure-sdk-for-python …

7b939e8

…into tvaron3/ppcb

Merge branch 'user/allekim/feature/addExcludedLocations' of https://g…

b33cfb6

…ithub.com/allenkim0129/azure-sdk-for-python into tvaron3/ppcb

add test for failure_rate threshold

e98ab57

fix pylint and cspell

36407c6

fix pylint

1baf872

fix and add tests

739e090

add collection rid to batch

d5c380a

add partition key range id to partition key range to cache

e7f7265

address failures

38f8033

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per Partition Circuit Breaker #40302

Per Partition Circuit Breaker #40302

tvaron3 commented Mar 31, 2025 •

edited

Loading

Per Partition Circuit Breaker #40302

Are you sure you want to change the base?

Per Partition Circuit Breaker #40302

Conversation

tvaron3 commented Mar 31, 2025 • edited Loading

Problem

Goal

Solution

Scope

New State

New Environment Variables

Other Implementations

Still to be done

Relevant Issue

tvaron3 commented Mar 31, 2025 •

edited

Loading