-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
area/cloudCloud provider integrationsCloud provider integrationsarea/storageStorage layer and backendsStorage layer and backendspriority/highHigh priorityHigh prioritysize/SSmall: 1-2 daysSmall: 1-2 daystype/featureNew feature or functionalityNew feature or functionality
Description
Summary
Implement robust retry logic with exponential backoff for transient cloud storage failures.
Parent Epic
- [Epic] Distributed Roboflow with Alibaba Cloud (OSS + ACK) #9 Distributed Roboflow with Alibaba Cloud
Dependencies
- Depends on: [Phase 2.1] Implement OSS/S3 backend using object_store #13 (OSS backend)
- Can be done in parallel with [Phase 2.2] Implement multipart upload for large files #12 (Multipart upload)
Tasks
- Create
src/storage/retry.rs - Define
RetryConfigstruct with:max_retries: u32(default 5)initial_backoff_ms: u64(default 100)max_backoff_ms: u64(default 30000)backoff_multiplier: f64(default 2.0)jitter_enabled: bool(default true)
- Define
RetryableErrortrait with method:fn is_retryable(&self) -> bool
- Implement
RetryableErrorforStorageError:- Retryable: NetworkError, Timeout, and specific HTTP codes
- Non-retryable: NotFound, PermissionDenied, InvalidPath
- Implement error classification for HTTP status codes:
- 429 (Rate Limited): Retryable
- 500, 502, 503, 504: Retryable
- 400, 401, 403, 404: Non-retryable
- Implement
retry_with_backoff<T, F>()function:- Generic over return type and closure
- Calculate backoff with exponential increase
- Add jitter (±15%) to prevent thundering herd
- Log retry attempts with attempt number and backoff
- Create
RetryingStoragewrapper struct:- Wraps any
Storageimplementation - Adds retry logic to all operations
- Configurable retry settings
- Wraps any
- Implement circuit breaker (optional):
- Track failure rate
- Open circuit after threshold
- Half-open state for recovery testing
- Add logging for all retry attempts
- Write unit tests for retry behavior
- Write tests for backoff timing
- Write tests for jitter distribution
Error Classification
| Error Type | Retryable | Reason |
|---|---|---|
NetworkError |
Yes | Transient connectivity |
Timeout |
Yes | Temporary slowness |
IoError(ConnectionReset) |
Yes | Connection dropped |
NotFound |
No | Object doesn't exist |
PermissionDenied |
No | Auth issue |
InvalidPath |
No | Client error |
AlreadyExists |
No | Conflict |
Acceptance Criteria
-
RetryConfigstruct defined with defaults -
RetryableErrortrait implemented -
retry_with_backofffunction works correctly -
RetryingStoragewrapper implementsStorage - Exponential backoff calculated correctly
- Jitter prevents synchronized retries
- All unit tests pass
Files to Create
src/storage/retry.rs
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/cloudCloud provider integrationsCloud provider integrationsarea/storageStorage layer and backendsStorage layer and backendspriority/highHigh priorityHigh prioritysize/SSmall: 1-2 daysSmall: 1-2 daystype/featureNew feature or functionalityNew feature or functionality