fix(auth): return 429 instead of 503 when disabled accounts mix with cooldown accounts#3489
Conversation
…oldown mix When an auth pool contains a mix of permanently-blocked (disabled/expired) and temporarily-cooldown (rate-limited) accounts, the selector incorrectly returned 503 (auth_unavailable) instead of 429 (model_cooldown). The old logic required ALL auths to be in cooldown before returning 429. With disabled accounts mixed in, the cooldown count never equaled the total, causing a fatal 503 even though the cooldown accounts would recover. Fix: track permanently-blocked count separately from cooldown count. When all non-permanently-blocked auths are in cooldown, return 429 with the earliest retry time, since those accounts will recover after cooldown. Affected locations (same pattern in each): - selector.go: getAvailableAuths - conductor.go: availableAuthsForRouteModel - scheduler.go: modelScheduler.unavailableErrorLocked - scheduler.go: authScheduler.mixedUnavailableErrorLocked Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
This pull request targeted The base branch has been automatically changed to |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 70aa8dd00b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } else { | ||
| permanentlyBlocked++ |
There was a problem hiding this comment.
Avoid classifying transient blocks as permanent cooldown-exempt
This branch counts every non-blockReasonCooldown auth as permanentlyBlocked, but isAuthBlockedForModel returns blockReasonOther for temporary outages with a future retry (for example Unavailable=true with NextRetryAfter and Quota.Exceeded=false in selector.go). With a mixed pool (e.g., one 429 cooldown auth plus one transient 503 auth), recoverableCount becomes 1 and the code now returns model_cooldown (429) even though not all credentials are cooling down. That misclassifies the failure mode, emits an inaccurate "All credentials ... are cooling down" message, and can drive incorrect client retry/backoff behavior; only truly permanent states (like disabled) should be excluded from the cooldown quorum.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch, thanks! Fixed in c69dc41.
The updated logic now only excludes blockReasonDisabled / scheduledStateDisabled from the cooldown quorum. blockReasonOther (transient 503s, 401s with future retry) still counts toward the total, so they must genuinely transition to cooldown before the quorum triggers 429.
There was a problem hiding this comment.
Code Review
This pull request improves the error reporting logic for authentication availability by introducing a 'permanentlyBlocked' counter to distinguish between auths in temporary cooldown and those that are disabled or blocked. This change ensures that a 429 (model_cooldown) error is correctly returned when all recoverable auths are in cooldown, preventing incorrect 503 (auth_unavailable) errors in mixed-state pools. I have no feedback to provide as there were no review comments to evaluate.
…sient blocks Previously the fix treated all non-cooldown blocked states (blockReasonOther, scheduledStateBlocked) as permanently unavailable. But blockReasonOther includes temporary outages like 503 with a 1-minute NextRetryAfter — those auths will recover and should participate in the cooldown count. Now only blockReasonDisabled / scheduledStateDisabled is excluded from the cooldown quorum. Transient blocks still count toward total, so a pool with one 429-cooldown and one 503-transient correctly returns 503 (not all cooling down) rather than an inaccurate 429 "All credentials are cooling down". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Good catch, thanks! Fixed in c69dc41. The updated logic now only excludes Updated scenarios:
|
Problem
When an auth pool contains both permanently-blocked accounts (disabled or expired tokens) and temporarily-cooldown accounts (rate-limited), the selector returns 503
auth_unavailableinstead of 429model_cooldown.Root cause
The old condition in all 4 selection paths:
This only returns 429 when ALL auths are in cooldown. With disabled accounts in the pool,
cooldownCountcan never equallen(auths), so it always falls through to 503 — even though the cooldown accounts will recover after their cooldown expires.Real-world scenario
Fix
Track permanently-blocked count (disabled + non-cooldown blocked) separately from cooldown count. When all non-permanently-blocked accounts are in cooldown, return 429
model_cooldownbecause they will recover.Files changed
selector.gogetAvailableAuths+collectAvailableByPriorityconductor.goavailableAuthsForRouteModelscheduler.gounavailableErrorLocked+mixedUnavailableErrorLockedscheduler.goavailabilitySummaryLockedAll existing tests pass.
Test plan
go test ./sdk/cliproxy/auth/...— all passgo test ./internal/runtime/executor/...— all passgo test ./sdk/api/handlers/...— all pass🤖 Generated with Claude Code