Skip to content

[Phase 2.3] Add retry logic and error handling for cloud operations #14

@zhexuany

Description

@zhexuany

Summary

Implement robust retry logic with exponential backoff for transient cloud storage failures.

Parent Epic

Dependencies

Tasks

  1. Create src/storage/retry.rs
  2. Define RetryConfig struct with:
    • max_retries: u32 (default 5)
    • initial_backoff_ms: u64 (default 100)
    • max_backoff_ms: u64 (default 30000)
    • backoff_multiplier: f64 (default 2.0)
    • jitter_enabled: bool (default true)
  3. Define RetryableError trait with method:
    • fn is_retryable(&self) -> bool
  4. Implement RetryableError for StorageError:
    • Retryable: NetworkError, Timeout, and specific HTTP codes
    • Non-retryable: NotFound, PermissionDenied, InvalidPath
  5. Implement error classification for HTTP status codes:
    • 429 (Rate Limited): Retryable
    • 500, 502, 503, 504: Retryable
    • 400, 401, 403, 404: Non-retryable
  6. Implement retry_with_backoff<T, F>() function:
    • Generic over return type and closure
    • Calculate backoff with exponential increase
    • Add jitter (±15%) to prevent thundering herd
    • Log retry attempts with attempt number and backoff
  7. Create RetryingStorage wrapper struct:
    • Wraps any Storage implementation
    • Adds retry logic to all operations
    • Configurable retry settings
  8. Implement circuit breaker (optional):
    • Track failure rate
    • Open circuit after threshold
    • Half-open state for recovery testing
  9. Add logging for all retry attempts
  10. Write unit tests for retry behavior
  11. Write tests for backoff timing
  12. Write tests for jitter distribution

Error Classification

Error Type Retryable Reason
NetworkError Yes Transient connectivity
Timeout Yes Temporary slowness
IoError(ConnectionReset) Yes Connection dropped
NotFound No Object doesn't exist
PermissionDenied No Auth issue
InvalidPath No Client error
AlreadyExists No Conflict

Acceptance Criteria

  • RetryConfig struct defined with defaults
  • RetryableError trait implemented
  • retry_with_backoff function works correctly
  • RetryingStorage wrapper implements Storage
  • Exponential backoff calculated correctly
  • Jitter prevents synchronized retries
  • All unit tests pass

Files to Create

  • src/storage/retry.rs

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/cloudCloud provider integrationsarea/storageStorage layer and backendspriority/highHigh prioritysize/SSmall: 1-2 daystype/featureNew feature or functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions