Skip to content

Latest commit

 

History

History
520 lines (391 loc) · 12.2 KB

File metadata and controls

520 lines (391 loc) · 12.2 KB

Advanced Retry Configuration

Cortex supports advanced retry logic with exponential backoff, jitter, and Retry-After header support.

Table of Contents

Overview

The retry system automatically retries failed requests for transient errors (network issues, rate limits, temporary server errors) while respecting Retry-After headers and applying intelligent backoff strategies.

Key Features

  • Exponential Backoff: Progressively increase delay between retries
  • Jitter: Randomize delays to prevent thundering herd
  • Retry-After Support: Honor server-provided retry timing
  • Error Classification: Automatic detection of retryable vs non-retryable errors
  • Configurable Policies: Preset and custom retry configurations
  • Per-Provider Settings: Different retry behavior for different providers

Automatically Retried Status Codes

  • 408 - Request Timeout
  • 429 - Too Many Requests (respects Retry-After header)
  • 500 - Internal Server Error
  • 502 - Bad Gateway
  • 503 - Service Unavailable (respects Retry-After header)
  • 504 - Gateway Timeout

Quick Start

Using Preset Policies

oauth:
  retry:
    policy: "conservative"  # Recommended default

providers:
  anthropic:
    oauth:
      enabled: true
      retry:
        policy: "aggressive"  # Override for specific provider

Custom Configuration

providers:
  openai:
    oauth:
      enabled: true
      retry:
        policy: "custom"
        max_attempts: 5
        base_delay: 1s
        max_delay: 30s
        multiplier: 2.0
        backoff_strategy: "exponential"
        jitter_type: "full"
        respect_retry_after: true

Retry Policy Presets

Conservative (Recommended Default)

Best for production environments. Balances reliability with responsiveness.

retry:
  policy: "conservative"

Parameters:

  • Max attempts: 3
  • Base delay: 1s
  • Max delay: 30s
  • Multiplier: 2.0
  • Jitter: Full
  • Backoff: Exponential

Retry Timeline:

  1. Initial request fails
  2. Wait ~1s (0.5s - 1.5s with jitter)
  3. Retry 1 fails
  4. Wait ~2s (1s - 3s with jitter)
  5. Retry 2 fails
  6. Wait ~4s (2s - 6s with jitter)
  7. Retry 3 (final attempt)

Aggressive

Use for unreliable networks or critical requests where retries are preferred.

retry:
  policy: "aggressive"

Parameters:

  • Max attempts: 5
  • Base delay: 500ms
  • Max delay: 30s
  • Multiplier: 2.0
  • Jitter: Full
  • Backoff: Exponential

Retry Timeline:

  1. Initial request fails
  2. Wait ~500ms (250ms - 750ms with jitter)
  3. Retry 1 fails
  4. Wait ~1s (500ms - 1.5s with jitter)
  5. Retry 2 fails
  6. Wait ~2s (1s - 3s with jitter)
  7. Retry 3 fails
  8. Wait ~4s (2s - 6s with jitter)
  9. Retry 4 fails
  10. Wait ~8s (4s - 12s with jitter)
  11. Retry 5 (final attempt)

None

Disable retries completely. Use for testing or fast-fail scenarios.

retry:
  policy: "none"

Parameters:

  • Max attempts: 1 (no retries)

Custom Retry Configuration

For specific requirements, define your own retry policy.

Configuration Options

Option Type Description Default
policy string Preset policy name conservative
max_attempts int Maximum number of retry attempts 3
base_delay duration Initial delay between retries 1s
max_delay duration Maximum delay between retries 30s
multiplier float Backoff multiplier (≥ 1.0) 2.0
backoff_strategy string Backoff algorithm exponential
jitter_type string Jitter algorithm full
respect_retry_after bool Honor Retry-After headers true

Example: High-Priority with Quick Retries

retry:
  policy: "custom"
  max_attempts: 7
  base_delay: 100ms
  max_delay: 10s
  multiplier: 1.5
  backoff_strategy: "exponential"
  jitter_type: "decorrelated"
  respect_retry_after: true

Example: Linear Backoff

retry:
  policy: "custom"
  max_attempts: 5
  base_delay: 2s
  max_delay: 10s
  multiplier: 1.0
  backoff_strategy: "linear"
  jitter_type: "equal"
  respect_retry_after: true

Configuration Levels

Retry configuration can be specified at multiple levels with the following precedence (highest to lowest):

  1. Provider OAuth Level - Most specific, applies only to that provider's OAuth token refresh
  2. Global OAuth Level - Applies to all providers using OAuth
  3. Default - Conservative policy if nothing is specified

Example: Multi-Level Configuration

# Global default for all OAuth providers
oauth:
  retry:
    policy: "conservative"

providers:
  # Uses global conservative policy
  anthropic:
    oauth:
      enabled: true

  # Overrides with aggressive policy
  openai:
    oauth:
      enabled: true
      retry:
        policy: "aggressive"

  # Completely custom configuration
  gemini:
    oauth:
      enabled: true
      retry:
        policy: "custom"
        max_attempts: 5
        base_delay: 1s
        max_delay: 60s
        multiplier: 2.0
        backoff_strategy: "exponential"
        jitter_type: "full"

Backoff Strategies

Exponential (Recommended)

Delay doubles with each retry (modified by multiplier).

backoff_strategy: "exponential"
multiplier: 2.0  # Each retry waits 2x longer than previous

Delay Sequence (base=1s, multiplier=2.0):

  • Retry 1: 1s
  • Retry 2: 2s
  • Retry 3: 4s
  • Retry 4: 8s
  • Retry 5: 16s (capped at max_delay)

Linear

Delay increases by a constant amount.

backoff_strategy: "linear"
base_delay: 2s  # Increment for each retry

Delay Sequence (base=2s):

  • Retry 1: 2s
  • Retry 2: 4s
  • Retry 3: 6s
  • Retry 4: 8s
  • Retry 5: 10s (capped at max_delay)

Constant

Same delay for all retries.

backoff_strategy: "constant"
base_delay: 3s

Delay Sequence:

  • All retries: 3s

Jitter Types

Jitter randomizes delays to prevent synchronized retries across clients (thundering herd problem).

Full Jitter (Recommended)

Random delay between 0 and calculated backoff.

jitter_type: "full"

Example: If backoff = 4s, actual delay will be random between 0s and 4s.

Equal Jitter

Half the backoff plus random value up to half.

jitter_type: "equal"

Example: If backoff = 4s, actual delay will be 2s + random(0, 2s) = between 2s and 4s.

Decorrelated Jitter

Delay based on previous delay, not just the base backoff.

jitter_type: "decorrelated"

More sophisticated algorithm that considers the previous actual delay.

No Jitter

Use exact calculated backoff with no randomization.

jitter_type: "none"

Use only when: Testing or debugging, not recommended for production.

Retryable Errors

Automatically Retried

The following conditions trigger automatic retries:

  1. Network errors - Connection timeouts, DNS failures, etc.
  2. HTTP 408 - Request Timeout
  3. HTTP 429 - Too Many Requests (with Retry-After support)
  4. HTTP 500 - Internal Server Error
  5. HTTP 502 - Bad Gateway
  6. HTTP 503 - Service Unavailable (with Retry-After support)
  7. HTTP 504 - Gateway Timeout

Never Retried

The following errors are considered non-retryable:

  1. HTTP 4xx (except 408, 429) - Client errors
  2. HTTP 401 - Unauthorized
  3. HTTP 403 - Forbidden
  4. HTTP 404 - Not Found
  5. HTTP 400 - Bad Request
  6. Validation errors - Invalid input, parse errors

Retry-After Header Support

When a server returns HTTP 429 or 503 with a Retry-After header, Cortex will:

  1. Parse the header value (seconds or HTTP date)
  2. Use the specified delay instead of calculated backoff
  3. Continue with normal retry logic after the specified delay
HTTP/1.1 429 Too Many Requests
Retry-After: 10

# Cortex will wait 10 seconds before retrying

Database Schema

Retry configuration is stored in the database (migration 009) for persistence.

Providers Table

ALTER TABLE providers ADD COLUMN retry_policy TEXT DEFAULT 'conservative';
ALTER TABLE providers ADD COLUMN retry_max_attempts INTEGER DEFAULT 0;
ALTER TABLE providers ADD COLUMN retry_base_delay_ms INTEGER DEFAULT 0;
ALTER TABLE providers ADD COLUMN retry_max_delay_ms INTEGER DEFAULT 0;
ALTER TABLE providers ADD COLUMN retry_multiplier REAL DEFAULT 0.0;
ALTER TABLE providers ADD COLUMN retry_backoff_strategy TEXT DEFAULT '';
ALTER TABLE providers ADD COLUMN retry_jitter_type TEXT DEFAULT '';
ALTER TABLE providers ADD COLUMN retry_respect_retry_after BOOLEAN DEFAULT TRUE;

Provider OAuth Table

ALTER TABLE provider_oauth ADD COLUMN retry_policy TEXT DEFAULT 'conservative';
-- (same columns as providers table)

Server Config (Global Defaults)

INSERT INTO server_config (key, value) VALUES ('retry.policy', 'conservative');
-- (additional retry config keys)

Examples

Example 1: Production Setup

Conservative defaults with aggressive retry for critical provider:

oauth:
  retry:
    policy: "conservative"

providers:
  anthropic:
    oauth:
      enabled: true
      # Uses global conservative policy

  critical-service:
    oauth:
      enabled: true
      retry:
        policy: "aggressive"  # More retries for critical service

Example 2: Development/Testing

Fast-fail for quick feedback:

providers:
  all-providers:
    oauth:
      enabled: true
      retry:
        policy: "none"  # No retries during development

Example 3: Unreliable Network

Custom configuration optimized for poor connectivity:

providers:
  mobile-provider:
    oauth:
      enabled: true
      retry:
        policy: "custom"
        max_attempts: 7
        base_delay: 500ms
        max_delay: 60s
        multiplier: 1.5
        backoff_strategy: "exponential"
        jitter_type: "decorrelated"
        respect_retry_after: true

Example 4: Rate-Limited API

Respect rate limits with longer delays:

providers:
  rate-limited-api:
    oauth:
      enabled: true
      retry:
        policy: "custom"
        max_attempts: 3
        base_delay: 5s
        max_delay: 120s
        multiplier: 3.0
        backoff_strategy: "exponential"
        jitter_type: "full"
        respect_retry_after: true

Best Practices

  1. Use Conservative as Default: The conservative preset works well for most production scenarios
  2. Enable Retry-After: Always set respect_retry_after: true to honor server guidance
  3. Add Jitter in Production: Use full or equal jitter to prevent thundering herd
  4. Don't Over-Retry: More retries != better; 3-5 attempts is usually sufficient
  5. Monitor Retry Metrics: Track retry rates to identify problematic providers
  6. Test Both Paths: Test both success and retry scenarios
  7. Consider Costs: Each retry consumes resources; balance reliability with efficiency

Troubleshooting

Retries Not Working

  1. Check policy is not set to none
  2. Verify error is retryable (check status code)
  3. Ensure max_attempts > 1
  4. Check logs for retry attempts

Too Many Retries

  1. Reduce max_attempts
  2. Increase base_delay and max_delay
  3. Consider using none policy for non-critical operations

Long Delays

  1. Reduce max_delay
  2. Use linear or constant backoff instead of exponential
  3. Reduce multiplier

Rate Limits Still Occurring

  1. Ensure respect_retry_after: true
  2. Increase base_delay
  3. Use larger multiplier (3.0 or higher)
  4. Consider reducing concurrent requests

Migration

Existing configurations without retry settings will automatically use the conservative policy. No migration is required, but you can opt-in to custom policies by adding the retry configuration.

Upgrading from Previous Versions

Migration 009 automatically adds retry configuration columns with conservative defaults. Existing OAuth providers will use conservative retry policy unless explicitly configured otherwise.

Related