Cosmos: Add Basic Retry Policy for Throtteled Requests #3230

kundadebdatta · 2025-10-18T01:00:34Z

Description

Adds structs and traits to define the structure of the cross regional retry logic. Adds a basic implementation of a ResourceThrottleRetryPolicy to retry the throtteled requests.

Changes

Added handler module and defined the structure of RetryHandler.
Added an initial version of BackoffRetryHandler.
Added retry_policies module and implemented ResourceThrottleRetryPolicy.
Added comprehensive unit tests covering functionality and method chaining.
Updated documentation with usage examples
Updated CHANGELOG.md

All existing functionality remains unchanged and fully backward compatible.

Additional instructions:

Make sure you structure your commits in a sensible way and include a concise but descriptive PR description. PREFER concise descriptions and code. The team knows how this code should work, they don't need explanations of it. The code should be largely self-explanatory.

Fixes #3170

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

… policy.

github-actions · 2025-10-18T01:07:14Z

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure_data_cosmos

…rgets

…_policy

Copilot

Pull Request Overview

This PR implements a basic retry policy infrastructure for handling throttled (429 status) requests in Azure Cosmos DB, including an exponential backoff mechanism and comprehensive test coverage.

Key changes:

Added retry policy framework with RetryPolicy trait and ResourceThrottleRetryPolicy implementation
Implemented BackOffRetryHandler to wrap requests with automatic retry logic
Integrated retry handler into CosmosPipeline for transparent retry behavior

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`sdk/cosmos/azure_data_cosmos/src/retry_policies/mod.rs`	Defines core retry policy infrastructure including `RetryPolicy` trait, `ShouldRetryResult`, `BaseRetryPolicy`, and configuration
`sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs`	Implements `ResourceThrottleRetryPolicy` for handling 429 errors with exponential backoff
`sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs`	Defines `AbstractRetryHandler` trait and `BackOffRetryHandler` implementation for wrapping requests with retry logic
`sdk/cosmos/azure_data_cosmos/src/handler/mod.rs`	Module definition for handler types
`sdk/cosmos/azure_data_cosmos/src/pipeline/mod.rs`	Integrates `BackOffRetryHandler` into `CosmosPipeline` request flow
`sdk/cosmos/azure_data_cosmos/tests/retry_policy_429_test.rs`	Comprehensive unit tests validating retry behavior, backoff calculation, and limit enforcement
`sdk/cosmos/azure_data_cosmos/src/routing/mod.rs`	Adds type aliases for routing-related collections
`sdk/cosmos/azure_data_cosmos/src/routing/location_cache.rs`	Adds Debug derives and helper methods for endpoint management
`sdk/cosmos/azure_data_cosmos/src/options/mod.rs`	Adds Debug derive to `CosmosClientOptions`
`sdk/cosmos/azure_data_cosmos/src/lib.rs`	Exposes new public modules and reorganizes location_cache
`sdk/cosmos/azure_data_cosmos/Cargo.toml`	Adds tokio dependency with time feature
`sdk/cosmos/.dict.txt`	Adds spelling dictionary entries for "backoff" and "pluggable"

sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs

analogrelay

Definitely some things to look in to. In general, there's a fair bit of "Object-Oriented" design in here that I don't think translates well into Rust. We should try to avoid translating C#/Java concepts in to Rust and look at a Rust-first approach. Rust code typically avoids a lot of virtual dispatch and high-level abstractions.

analogrelay · 2025-10-20T21:57:36Z

sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs

+#[allow(dead_code)]
+#[cfg_attr(not(target_arch = "wasm32"), async_trait)]
+#[cfg_attr(target_arch = "wasm32", async_trait(?Send))]
+pub trait AbstractRetryHandler: Send + Sync {


Just RetryHandler would be the style here. Rust code generally doesn't refer to OOP terms like Abstract.

Agreed. Updated the trait naming.

analogrelay · 2025-10-20T22:01:05Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/mod.rs

+/// Result of a retry policy decision
+///
+/// This struct encapsulates the decision made by a retry policy about whether
+/// an operation should be retried and how long to wait before the retry attempt.
+///
+/// # Fields
+/// * `should_retry` - Whether the operation should be retried
+/// * `backoff_time` - Duration to wait before retrying (meaningful only if `should_retry` is true)
+#[derive(Debug, Clone)]
+pub struct ShouldRetryResult {
+    pub should_retry: bool,
+    pub back_off_time: Duration,
+}


The more idiomatic way to implement this in Rust is to use an enum. In Rust, enum variants can carry values, which you can use to implement the back-off time:

pub enum RetryResult { /// Indicates that the operation should not be retried. DoNotRetry, /// Indicates that the operation should be retried after waiting for the duration specified in `after`. Retry { after: Duration } }

Agreed. Refactored the code.

analogrelay · 2025-10-20T22:02:02Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/mod.rs

+impl ShouldRetryResult {
+    /// Creates a result indicating the operation should not be retried
+    ///
+    /// # Example
+    /// ```
+    /// use azure_data_cosmos::retry_policies::ShouldRetryResult;
+    ///
+    /// let result = ShouldRetryResult::no_retry();
+    /// assert!(!result.should_retry);
+    /// ```
+    pub fn no_retry() -> Self {
+        Self {
+            should_retry: false,
+            back_off_time: Duration::ZERO,
+        }
+    }
+
+    /// Creates a result indicating the operation should be retried after a delay
+    ///
+    /// # Arguments
+    /// * `backoff` - The duration to wait before retrying
+    ///
+    /// # Example
+    /// ```
+    /// use azure_data_cosmos::retry_policies::ShouldRetryResult;
+    /// use std::time::Duration;
+    ///
+    /// let result = ShouldRetryResult::retry_after(Duration::from_secs(5));
+    /// assert!(result.should_retry);
+    /// assert_eq!(result.back_off_time, Duration::from_secs(5));
+    /// ```
+    pub fn retry_after(backoff: Duration) -> Self {
+        Self {
+            should_retry: true,
+            back_off_time: backoff,
+        }
+    }
+}


You wouldn't really need this if you're using an enum, you'd have it already in the standard construction syntax:

RetryResult::DoNotRetry RetryResult::Retry { after: backoff }

Agreed. Refactored the code.

analogrelay · 2025-10-20T22:05:25Z

sdk/cosmos/azure_data_cosmos/src/pipeline/mod.rs

-        Ok(r)
+        // Clone pipeline and convert context to owned so the closure can be Fn
+        let pipeline = self.pipeline.clone();
+        let ctx_owned = ctx.with_value(resource_link).into_owned();


I think you could avoid this .into_owned() (which may force an allocation) by having the retry handler (and the sender closure) take it as a parameter, and flowing it down that way rather than closing over it.

analogrelay · 2025-10-20T22:11:34Z

sdk/cosmos/azure_data_cosmos/src/routing/mod.rs

+/// Type alias matching C# `ReadOnlyCollection<Uri>` semantics in a lightweight form.
+/// We use `Arc<[Url]>` for cheap cloning and slice immutability.
+pub type ReadOnlyUrlCollection = Arc<[Url]>;
+
+/// Type alias representing `ReadOnlyDictionary<string, Uri>`.
+pub type ReadOnlyLocationMap = Arc<HashMap<String, Url>>;


I don't think we need these. First off, we shouldn't just be porting C# concepts over to Rust since we're aiming to make an idiomatic Rust SDK. Second, the idea of separate "read-only collection" types in Rust isn't really necessary. Rust already enforces that only the owner of a collection, or the holder of a &mut reference, can modify it, so the only place you really need to enforce "read-only" behavior in a collection type is if you want to restrict even the owner from modifying the collection and that's not usually necessary.

Yes yuo are right. These are not needed anymore. Removed.

sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs

analogrelay · 2025-10-20T22:44:35Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs

+    max_attempt_count: usize,
+    backoff_delay_factor: u32,
+    max_wait_time: Duration,
+    current_attempt_count: AtomicUsize,
+    cumulative_retry_delay: Arc<AtomicUsize>,


The state management here isn't quite right. The ResourceThrottleRetryPolicy is a singleton for the entire application, but most of this state is per-request.

Now that you're creating an instance of this per-request, I don't think these need to be Arcs or AtomicUsizes at all. Regular usize should be fine, since the instance is being used entirely within a single request "thread"

Ah, and because all the RetryPolicy methods take &self, you end up having to use atomic types. That's the Rust compiler giving you a hint that something is a little wrong about how you're handling mutability :).

If a method can mutate fields on self, it needs to take it in as &mut self. Rust will then guarantee that as long as the caller is using safe Rust (i.e. not inside an unsafe block), NO other running code is holding a mutable reference.

I suggest a quick review of References and Borrowing in the Rust Book! Getting the hang of these is a key part of understanding the magic of Rust :)

analogrelay · 2025-10-20T22:53:51Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/mod.rs

+    /// retry policy should be used for this specific request.
+    /// # Arguments
+    /// * `request` - The HTTP request to analyze
+    pub fn get_policy_for_request(&self, _request: &Request) -> Arc<dyn RetryPolicy> {


I think a RetryPolicy should be constructed for each request. That seems to me like the cleanest way to have clean per-request state. In Rust, most allocations will be done on the stack, and are thus very cheap, so don't be afraid to create new instances of structs. Heap allocations (which are what we generally try to avoid in managed languages like C#/Java) only happen when you're using wrappers like Box, Arc, or Vec, or for very specific types, like String, where heap-allocation is essential. Creating a new value of a struct containing only primitive (number) types and returning it is pretty cheap.

Yes - And you are right. I was not in a favor of creating the retry policy per request to avoid any memory allocation. However, reading further, I think this is safe to do.

Refactored the code to reflect the changes.

sdk/cosmos/azure_data_cosmos/src/retry_policies/mod.rs

heaths

How is this different than the retry policy we already have in ClientOptions for all HTTP clients? It's already handles 429 and clients can customize the behavior in several ways.

sdk/cosmos/azure_data_cosmos/src/routing/location_cache.rs

tvaron3 · 2025-10-22T00:37:09Z

sdk/cosmos/azure_data_cosmos/src/pipeline/mod.rs

+            None,
+        );
+
        CosmosPipeline {


How does this fit into the azure core retry policies? We should also double check these as the behavior from core is not necessarily what we want. We have overriden some of the retry configurations from core in python sdk

We intend to have the same conceptual design as other Azure SDK languages. If you can customize retry behavior in other languages, within reason we'd prefer to replicate that in core as well. Adding policies in clients should be relatively rare.

The reason we wanted to keep the cosmos retry policies separate from azure_core general retry policies are:

We have a lot of cross regional routing logic, based of different status codes.

We will need to inject our endpoint manager, location caches, pk-range based routing into the retry layer to route the request correctly.

I understand that there is already a base retry policy that exists today in azure_core, and let me explore a bit more to check if they can be leveraged and overridden to modify the behaviors.

Thing is, you can't remove the default retry policy. At best, you could have your client set it to none - overriding any policy options a user might set (or honor them as a copy, but then overwrite them) - and inject your own.

We have the same design as any other Azure SDK language, though. Did you need a custom retry policy for other languages? Are there gaps in our core design that necessitate some additional options perhaps?

analogrelay · 2025-10-22T19:42:39Z

How is this different than the retry policy we already have in ClientOptions for all HTTP clients? It's already handles 429 and clients can customize the behavior in several ways.

You're correct that the current behavior is essentially the same as the existing policy. Cosmos has a lot more retry logic that we'll need to add in the future though (for partition topology changes, cross-region retry, cross-region failover/circuit breakers, etc.), and right now it isn't really possible to re-use the ExponentialRetryPolicy (though I started an offline conversation to consider opening that up).

@kundadebdatta and I did chat about plugging our retry policy in to the RetryMode::Custom hook to allow us to at least put our custom logic in the "right place" in the HTTP pipeline.

analogrelay

Definitely better state management. We talked about seeing if we can rework this to using azure_core::http::policies::Policy. I'd still like you to look at that, but the rest looks good aside from some style/idiomatic suggestions.

analogrelay · 2025-10-22T21:10:28Z

sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs

+
+// Helper trait to conditionally require Send on non-WASM targets
+#[cfg(not(target_arch = "wasm32"))]
+pub trait CosmosConditionalSend: Send {}


This should just be ConditionalSend. It's private to our crate, and even if it was a public API, we generally avoid prefixing types with the crate name (i.e. Cosmos) in Rust. There are exceptions (CosmosClient) but it's not like in .NET where we tend to prefix a lot of types with a product/package name like Cosmos

analogrelay · 2025-10-22T21:11:29Z

sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs

+/// a pluggable retry policy system. It wraps HTTP requests with intelligent retry logic
+/// that handles both transient network errors and HTTP error responses.
+#[derive(Debug, Clone)]
+pub struct BackOffRetryHandler {}


nit: It's generally idiomatic to just remove the {} entirely if you're writing a type with no fields:

Suggested change

pub struct BackOffRetryHandler {}

pub struct BackOffRetryHandler;

analogrelay · 2025-10-22T21:12:47Z

sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs

+    /// retry policy should be used for this specific request.
+    /// # Arguments
+    /// * `request` - The HTTP request to analyze
+    pub fn get_policy_for_request(&self, _request: &Request) -> Arc<dyn RetryPolicy> {


nit: policy_for_request. "Getters" are usually just unprefixed in Rust.

analogrelay · 2025-10-22T21:23:02Z

sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs

+#[allow(dead_code)]
+#[cfg_attr(not(target_arch = "wasm32"), async_trait)]
+#[cfg_attr(target_arch = "wasm32", async_trait(?Send))]
+pub trait RetryHandler: Send + Sync {


As mentioned offline, I think we can just have our retry handler be an azure_core::http::policies::Policy. Even if we don't want it to be a Policy for some reason, we generally avoid introducing traits until we actually have a significant need for them (it's not like in .NET/Java where interfaces/abstract classes get used often for "testability")

analogrelay · 2025-10-22T21:26:12Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs

+    max_attempt_count: usize,
+    backoff_delay_factor: u32,
+    max_wait_time: Duration,
+    current_attempt_count: AtomicUsize,
+    cumulative_retry_delay: Arc<AtomicUsize>,


Now that you're creating an instance of this per-request, I don't think these need to be Arcs or AtomicUsizes at all. Regular usize should be fine, since the instance is being used entirely within a single request "thread"

analogrelay · 2025-10-22T21:43:17Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs

+        let backoff1 = match result1 {
+            RetryResult::Retry { after } => after,
+            _ => panic!("Expected retry result"),
+        };


There are a few more compact ways to express "two-arm" matches (where one arm is just _). I'd suggest this:

let RetryResult::Retry { after: backoff1 } = result1 else { panic!("Expected retry result") }

That will match on RetryResult::Retry and then bind the after field to the local variable backoff1. If the result doesn't match that pattern, the else block will be taken. It's the same code as you have above, just a slightly more compact and idiomatic way to write it.

There's also a similar inverted syntax:

let backoff1 = if let RetryResult::Retry { after } = value { after } else { panic!("Expected retry result"); }

That one is just FYI, I think the first syntax is preferable in this case.

analogrelay · 2025-10-22T21:47:34Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs

+        match result4 {
+            RetryResult::DoNotRetry => {
+                println!("Fourth attempt - should not retry");
+            }
+            RetryResult::Retry { .. } => panic!("Should NOT retry after exceeding max attempts"),
+        }


This can be an assert_eq! I think:

Suggested change

match result4 {

RetryResult::DoNotRetry => {

println!("Fourth attempt - should not retry");

}

RetryResult::Retry { .. } => panic!("Should NOT retry after exceeding max attempts"),

}

assert_eq!(result4, RetryResult::DoNotRetry);

analogrelay · 2025-10-22T21:50:38Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs

+            match result {
+                RetryResult::DoNotRetry => {
+                    // Success - should not retry on client errors
+                }
+                RetryResult::Retry { .. } => panic!("Should NOT retry on {}", description),
+            }


Another place for an assert_eq!

analogrelay · 2025-10-22T21:50:41Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs

+        match result {
+            RetryResult::DoNotRetry => {
+                // Success - should not retry
+            }
+            RetryResult::Retry { .. } => panic!("Should NOT retry on successful response"),
+        }


Another place for an assert_eq!

analogrelay · 2025-10-22T21:54:20Z

sdk/cosmos/azure_data_cosmos/src/retry_policies/resource_throttle_retry_policy.rs

+        )))
+    }
+    #[tokio::test]
+    async fn test_retry_policy_handles_429_status() {


nit (across all these tests): The test_ prefix on test functions isn't usually used in Rust.

kundadebdatta added 9 commits October 11, 2025 10:23

Initial commit to add global endpoint manager.

2616c75

Code changes to add retry policy and handlers.

f740ad0

Code changes to create a base retry policy.

f249d29

Code changes to update retry handler.

0dc70ef

Initial commit to introduce retry handler and resource throttle retry…

7fa5bf5

… policy.

Refactor code to use a delegate for the retry handler.

82295ad

Code changes to clean up.

b0e3a04

Updated cargo

281a957

Code changes to add documentation. More refactoring.

93321dd

kundadebdatta self-assigned this Oct 18, 2025

kundadebdatta added the Cosmos The azure_cosmos crate label Oct 18, 2025

Code changes to apply formatting and stlyecorp changes.

3452d37

kundadebdatta changed the title ~~Cosmos: Add Basic Retry Policy for Throtteled Reuests~~ Cosmos: Add Basic Retry Policy for Throtteled Requests Oct 18, 2025

kundadebdatta added 10 commits October 18, 2025 21:36

Code changes to fix CI build failures.

fe73bf8

Code changes to fix formatting error.

cc2c1b5

Code changes to fix stylecorp

b144368

Code changes to fix formatting.

945187c

Code changes to add conditional trait to support WASM and non-WASM ta…

b644f40

…rgets

Formatting changes

361ebbd

Code changes to update cargo toml

1f62b51

Code changes to fix code comment spell check.

7b97662

Adding valid words.

8a964bd

Merge branch 'main' into users/kundadebdatta/3170_build_initial_retry…

f137f7b

…_policy

kundadebdatta marked this pull request as ready for review October 20, 2025 22:32

Copilot AI review requested due to automatic review settings October 20, 2025 22:32

kundadebdatta requested a review from a team as a code owner October 20, 2025 22:32

kundadebdatta enabled auto-merge (squash) October 20, 2025 22:32

Copilot AI reviewed Oct 20, 2025

View reviewed changes

sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs Outdated Show resolved Hide resolved

sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs Outdated Show resolved Hide resolved

sdk/cosmos/azure_data_cosmos/src/handler/retry_handler.rs Show resolved Hide resolved

analogrelay requested changes Oct 20, 2025

View reviewed changes

kundadebdatta added 2 commits October 21, 2025 15:54

Code changes to address review comments.

b356da1

Code changes to fix formatting.

c04ae7a

heaths requested changes Oct 21, 2025

View reviewed changes

Code changes to move unit tests.

4b88a7f

tvaron3 reviewed Oct 22, 2025

View reviewed changes

sdk/cosmos/azure_data_cosmos/src/routing/location_cache.rs Show resolved Hide resolved

tvaron3 reviewed Oct 22, 2025

View reviewed changes

Update cargo format

e97078d

analogrelay reviewed Oct 22, 2025

View reviewed changes

	pub struct BackOffRetryHandler {}
	pub struct BackOffRetryHandler;

Cosmos: Add Basic Retry Policy for Throtteled Requests #3230

Are you sure you want to change the base?

Cosmos: Add Basic Retry Policy for Throtteled Requests #3230

Conversation

kundadebdatta commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Uh oh!

github-actions bot commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API Change Check

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

analogrelay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

heaths left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

analogrelay commented Oct 22, 2025

Uh oh!

analogrelay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kundadebdatta commented Oct 18, 2025 •

edited

Loading

github-actions bot commented Oct 18, 2025 •

edited

Loading

heaths left a comment •

edited

Loading