claude: short-circuit probe requests & drop context-1m beta on OAuth#3567
claude: short-circuit probe requests & drop context-1m beta on OAuth#3567lyzgeorge wants to merge 5 commits into
Conversation
Clients (e.g. Claude Desktop App) periodically probe the messages endpoint with max_tokens=1 to check liveness/auth. Forwarding these to upstream wastes a token quota and surfaces upstream auth failures on otherwise idle UIs. Detect them in ClaudeMessages and synthesize a local message with stop_reason=max_tokens, supporting both JSON and SSE responses. Adds dedicated tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Anthropic OAuth subscription accounts (Max/Pro) reject requests carrying the context-1m-2025-08-07 beta with HTTP 429 "Usage credits are required for long context requests" unless pay-as-you-go credits are enabled. Cowork-style clients (Claude Desktop App) advertise the 1M beta unconditionally, breaking otherwise valid sub-200k requests. When the resolved auth is OAuth (no api_key attribute), filter the beta out so requests fall back to the standard 200k window. API-key upstreams keep the beta and full 1M capability. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Bare max_tokens=1 is a legitimate single-token classification setting, so the previous predicate over-matched. Restrict the short-circuit to the actual probe fingerprint sent by Claude Desktop App / Claude Code: - max_tokens == 1 - messages contain a pre-filled assistant turn (real classifier inputs are user-only) - no tools / tool_choice (tool-gated classifiers won't be probed) Adds tests for the false-positive shapes the reviewer raised: single-token user-only classifier and tool-gated classifier both pass through to upstream. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces logic to strip the context-1m-2025-08-07 beta header for OAuth upstreams to prevent 429 errors, and implements a short-circuit mechanism for Claude Desktop App/Claude Code liveness probes. However, the current implementation of shouldShortCircuitClaudeSample checks if any message in the array has the assistant role, which incorrectly flags legitimate multi-turn conversations where the last message is from the user. It is recommended to only check if the last message in the array is an assistant message, and to add a corresponding test case to prevent regressions.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e4bd686c5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| gjson.GetBytes(rawJSON, "messages").ForEach(func(_, msg gjson.Result) bool { | ||
| if msg.Get("role").String() == "assistant" { | ||
| hasAssistant = true |
There was a problem hiding this comment.
Require stricter probe fingerprint before short-circuiting
The current predicate treats any request with max_tokens == 1 plus an assistant message (and no tools) as a probe, but that shape is also valid for real Claude workloads (e.g., single-token classification or constrained continuation with prior assistant turns). In those cases this handler now returns a synthetic empty response locally instead of sending the request upstream, which can silently break legitimate production requests whenever callers intentionally use max_tokens=1 with multi-turn context.
Useful? React with 👍 / 👎.
The previous predicate scanned for any assistant turn in the history, which would incorrectly short-circuit multi-turn classifier requests that happen to set max_tokens=1 on a follow-up question (real conversations naturally accumulate assistant turns). Switch to gjson's messages.-1 to assert the last message is an assistant turn, which is the actual hallmark of the Claude Desktop App / Claude Code liveness probe (probes pre-fill an assistant turn to coax the model into emitting one token immediately). Adds a test for the multi-turn user-ending case. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3da6162448
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| lastMsg := gjson.GetBytes(rawJSON, "messages.-1") | ||
| return lastMsg.Exists() && lastMsg.Get("role").String() == "assistant" |
There was a problem hiding this comment.
Require a stricter probe fingerprint before short-circuiting
The short-circuit predicate still matches normal Claude requests that intentionally end with an assistant prefill and use max_tokens: 1, so those real requests are answered locally with a synthetic empty message instead of being sent upstream. Fresh evidence in this commit is TestShouldShortCircuitClaudeSample ("probe shape shorts") which explicitly marks a standard user -> assistant conversation as a probe, so this false-positive path is now codified and can silently break legitimate single-token continuation/classification workloads.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
applied in 3da6162. The predicate now uses messages.-1 and checks only the last message is assistant. Added the multi-turn-ending-on-user test case alongside it.
| if !useAPIKey && strings.Contains(baseBetas, "context-1m-2025-08-07") { | ||
| parts := strings.Split(baseBetas, ",") |
There was a problem hiding this comment.
Avoid stripping 1M context beta from all OAuth traffic
This removes context-1m-2025-08-07 for every non-api_key auth path, which forces all OAuth requests down to the 200k window. That means OAuth accounts that are actually eligible for long-context will now fail requests above 200k tokens because the required beta is always deleted, creating a functional regression for valid long-context workloads instead of only handling the failing subset.
Useful? React with 👍 / 👎.
Summary
api_key), stripcontext-1m-2025-08-07fromAnthropic-Beta. Anthropic returns429 Usage credits are required for long context requestsfor subscription accounts whose pay-as-you-go credits are not enabled; clients like the Claude Desktop App in cowork mode advertise the beta unconditionally and break sub-200k requests. API-key paths are unchanged and keep the full 1M window./v1/messageslocally when the request matches the actual Claude Desktop / Claude Code liveness-probe fingerprint (not baremax_tokens=1, which is a legitimate single-token classifier setting). The synthesized response carriesstop_reason=max_tokensfor both JSON and SSE. Probe predicate requires:max_tokens == 1messagescontain a pre-filledassistantturn (real classifier inputs are user-only)tools/tool_choice(tool-gated classifiers won't be probed)code_handlers_probe_test.gocovering the JSON path, SSE path, the positive predicate, and the false-positive shapes (single-token user-only classifier, tool-gated classifier viatoolsand viatool_choice).History
Continuation of #3566 (closed); same branch, no force-pushes. Resolved feedback already incorporated:
gemini-code-assist[bot]:message_deltausagenow carries onlyoutput_tokens(commite17b18d).chatgpt-codex-connector[bot]P1: probe predicate tightened to avoid false positives on real classification workloads (commit5e4bd68).Test plan
🤖 Generated with Claude Code