-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Background
On Jan 9-10, 2026, the BGE Server hit OpenAI API rate limits (429 errors) due to an event flood. While retry logic was added (18b197c), a circuit breaker would provide better protection and faster failure.
Problem
Currently, when OpenAI returns 429 errors:
- Each request retries up to 5 times with exponential backoff
- During a flood, hundreds of requests queue up, all retrying
- This creates a "thundering herd" when the rate limit clears
Proposed Solution
Implement circuit breaker pattern for OpenAI API calls:
States:
┌────────┐ failures > threshold ┌────────┐
│ CLOSED │ ──────────────────────────▶ │ OPEN │
└────────┘ └────────┘
▲ │
│ success │ timeout
│ ┌─────────────┐ │
└────│ HALF-OPEN │◀───────────────────┘
└─────────────┘
CLOSED: Normal operation, requests go through
OPEN: All requests fail immediately (no API call), return cached/error
HALF-OPEN: Allow one test request, if success → CLOSED, if fail → OPEN
Configuration
CIRCUIT_BREAKER_FAILURE_THRESHOLD = 5 # failures before opening
CIRCUIT_BREAKER_SUCCESS_THRESHOLD = 2 # successes to close
CIRCUIT_BREAKER_TIMEOUT = 60 # seconds before half-openBenefits
- Fast failure - Don't waste time on doomed requests
- Reduced load - Stop hammering rate-limited API
- Graceful degradation - Return cached embeddings or skip
Implementation Options
- pybreaker - Python circuit breaker library
- Custom implementation - Simple state machine
- tenacity - Already handles retries, can add circuit breaker
Files to Modify
/opt/projects/koi-processor/src/core/bge_server.py- Possibly event bridge if it makes direct API calls
Related
- BGE Server retry logic: 18b197c
- Rate limiting issue: Add rate limiting to KOI Coordinator #6
- Event flood detection: c3c0366
Labels
enhancement, resilience
Metadata
Metadata
Assignees
Labels
No labels