Skip to content

fix(auth): return 429 instead of 503 when disabled accounts mix with cooldown accounts#3489

Open
ZHOUKAILIAN wants to merge 2 commits into
router-for-me:devfrom
ZHOUKAILIAN:fix/auth-unavailable-with-disabled-and-cooldown
Open

fix(auth): return 429 instead of 503 when disabled accounts mix with cooldown accounts#3489
ZHOUKAILIAN wants to merge 2 commits into
router-for-me:devfrom
ZHOUKAILIAN:fix/auth-unavailable-with-disabled-and-cooldown

Conversation

@ZHOUKAILIAN
Copy link
Copy Markdown

Problem

When an auth pool contains both permanently-blocked accounts (disabled or expired tokens) and temporarily-cooldown accounts (rate-limited), the selector returns 503 auth_unavailable instead of 429 model_cooldown.

Root cause

The old condition in all 4 selection paths:

if cooldownCount == len(auths) && !earliest.IsZero() {
    return 429 // model_cooldown
}
return 503 // auth_unavailable

This only returns 429 when ALL auths are in cooldown. With disabled accounts in the pool, cooldownCount can never equal len(auths), so it always falls through to 503 — even though the cooldown accounts will recover after their cooldown expires.

Real-world scenario

  • 18 accounts total: 12 disabled (expired tokens), 6 healthy but rate-limited (429)
  • Round-robin hits the 6 healthy accounts until they're all rate-limited
  • Cooldown represents "recoverable" state but the 503 is fatal
  • Client sees 503 and gives up, instead of 429 with Retry-After

Fix

Track permanently-blocked count (disabled + non-cooldown blocked) separately from cooldown count. When all non-permanently-blocked accounts are in cooldown, return 429 model_cooldown because they will recover.

recoverableCount := len(auths) - permanentlyBlocked
if cooldownCount > 0 && cooldownCount == recoverableCount && !earliest.IsZero() {
    return 429 // model_cooldown with Retry-After header
}
return 503 // auth_unavailable (only when truly no auth can recover)

Files changed

File Function Change
selector.go getAvailableAuths + collectAvailableByPriority Track permanentlyBlocked, use recoverableCount
conductor.go availableAuthsForRouteModel Same pattern
scheduler.go unavailableErrorLocked + mixedUnavailableErrorLocked Same pattern
scheduler.go availabilitySummaryLocked Count disabled+blocked states

All existing tests pass.

Test plan

  • go test ./sdk/cliproxy/auth/... — all pass
  • go test ./internal/runtime/executor/... — all pass
  • go test ./sdk/api/handlers/... — all pass

🤖 Generated with Claude Code

…oldown mix

When an auth pool contains a mix of permanently-blocked (disabled/expired)
and temporarily-cooldown (rate-limited) accounts, the selector incorrectly
returned 503 (auth_unavailable) instead of 429 (model_cooldown).

The old logic required ALL auths to be in cooldown before returning 429.
With disabled accounts mixed in, the cooldown count never equaled the total,
causing a fatal 503 even though the cooldown accounts would recover.

Fix: track permanently-blocked count separately from cooldown count. When
all non-permanently-blocked auths are in cooldown, return 429 with the
earliest retry time, since those accounts will recover after cooldown.

Affected locations (same pattern in each):
- selector.go: getAvailableAuths
- conductor.go: availableAuthsForRouteModel
- scheduler.go: modelScheduler.unavailableErrorLocked
- scheduler.go: authScheduler.mixedUnavailableErrorLocked

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions github-actions Bot changed the base branch from main to dev May 20, 2026 03:06
@github-actions
Copy link
Copy Markdown

This pull request targeted main.

The base branch has been automatically changed to dev.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 70aa8dd00b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread sdk/cliproxy/auth/selector.go Outdated
Comment on lines +215 to +216
} else {
permanentlyBlocked++
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid classifying transient blocks as permanent cooldown-exempt

This branch counts every non-blockReasonCooldown auth as permanentlyBlocked, but isAuthBlockedForModel returns blockReasonOther for temporary outages with a future retry (for example Unavailable=true with NextRetryAfter and Quota.Exceeded=false in selector.go). With a mixed pool (e.g., one 429 cooldown auth plus one transient 503 auth), recoverableCount becomes 1 and the code now returns model_cooldown (429) even though not all credentials are cooling down. That misclassifies the failure mode, emits an inaccurate "All credentials ... are cooling down" message, and can drive incorrect client retry/backoff behavior; only truly permanent states (like disabled) should be excluded from the cooldown quorum.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! Fixed in c69dc41.

The updated logic now only excludes blockReasonDisabled / scheduledStateDisabled from the cooldown quorum. blockReasonOther (transient 503s, 401s with future retry) still counts toward the total, so they must genuinely transition to cooldown before the quorum triggers 429.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the error reporting logic for authentication availability by introducing a 'permanentlyBlocked' counter to distinguish between auths in temporary cooldown and those that are disabled or blocked. This change ensures that a 429 (model_cooldown) error is correctly returned when all recoverable auths are in cooldown, preventing incorrect 503 (auth_unavailable) errors in mixed-state pools. I have no feedback to provide as there were no review comments to evaluate.

…sient blocks

Previously the fix treated all non-cooldown blocked states (blockReasonOther,
scheduledStateBlocked) as permanently unavailable. But blockReasonOther includes
temporary outages like 503 with a 1-minute NextRetryAfter — those auths will
recover and should participate in the cooldown count.

Now only blockReasonDisabled / scheduledStateDisabled is excluded from the
cooldown quorum. Transient blocks still count toward total, so a pool with
one 429-cooldown and one 503-transient correctly returns 503 (not all cooling
down) rather than an inaccurate 429 "All credentials are cooling down".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ZHOUKAILIAN
Copy link
Copy Markdown
Author

Good catch, thanks! Fixed in c69dc41.

The updated logic now only excludes blockReasonDisabled / scheduledStateDisabled from the cooldown quorum. blockReasonOther (transient 503s, 401s with future retry) still counts toward the total, so they must genuinely transition to cooldown before the quorum triggers 429.

Updated scenarios:

Pool disabled cooldown transient-other recoverable Result
12 disabled + 6 429-cooldown 12 6 0 6 6==6 → 429
1 429-cooldown + 1 503-transient 0 1 1 18 1≠18 → 503

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant