fix(llm): remove request timeout from LLM HTTP client#955
Merged
Conversation
Claude and OpenAI providers shared the same 60s request timeout as other HTTP clients. LLM responses with tool use can take longer, causing CloseNotify + request failure after exactly 60s. Introduce llm_client() in zeph-llm::http with only a connect timeout (30s). Claude and OpenAI providers now use llm_client(); the unused default_client() in zeph-llm is removed (other crates use zeph_core::http::default_client()).
The previous 60s reqwest timeout caused request failures for long LLM responses (tool use, large context). Removing the timeout entirely would allow runaway requests to hang indefinitely. Set request timeout to 600s as a safety backstop. The agent-level TimeoutConfig.llm_seconds (default 120s) fires first in normal operation; 600s only catches cases where the tokio-layer is bypassed or llm_seconds is set unusually high.
Add TimeoutConfig.llm_request_timeout_secs (default 600s) as the configurable HTTP-layer backstop for LLM provider requests. Exposed via ZEPH_TIMEOUT_LLM_REQUEST env variable. llm_client() now accepts the timeout as a parameter. Bootstrap passes config.timeouts.llm_request_timeout_secs via with_client() to Claude and OpenAI providers; provider::new() retains a hardcoded 600s default for non-bootstrap construction paths (tests, orchestrator).
Summarization paths (maybe_summarize_tool_pair, try_summarize_with_llm, summarize_tool_output) called .chat() directly without any timeout, causing hangs when Claude was slow. Wrap all summarization .chat() futures with tokio::time::timeout(llm_seconds). Add LlmError::Timeout variant used when the deadline is exceeded; summarization callers treat it the same as an LLM error (warn + skip or fallback to truncation).
dba0ab7 to
9aca28e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
llm_client(request_timeout_secs)inzeph-llm::httpwith 30s connect timeout and configurable request timeoutClaudeProviderandOpenAiProviderfromdefault_client()(60s) tollm_client(600)default_client()fromzeph-llm::httpTimeoutConfig.llm_request_timeout_secs(default 600s) to[timeouts]config section, exposed viaZEPH_TIMEOUT_LLM_REQUESTwith_client(llm_client(...))Root cause
default_client()applied a 60s request timeout. LLM calls with tool use routinely exceed 60s, triggeringCloseNotify+HTTP request failederrors in ACP sessions.Two-layer timeout design
TimeoutConfig.llm_secondsTimeoutConfig.llm_request_timeout_secsTest plan
cargo +nightly fmt --check— passcargo clippy --workspace -- -D warnings— passcargo nextest run --workspace --lib --bins— 2894 tests passed