Skip to content

fix(llm): remove request timeout from LLM HTTP client#955

Merged
bug-ops merged 8 commits intomainfrom
fix/llm-client-no-request-timeout
Feb 26, 2026
Merged

fix(llm): remove request timeout from LLM HTTP client#955
bug-ops merged 8 commits intomainfrom
fix/llm-client-no-request-timeout

Conversation

@bug-ops
Copy link
Owner

@bug-ops bug-ops commented Feb 26, 2026

Summary

  • Introduce llm_client(request_timeout_secs) in zeph-llm::http with 30s connect timeout and configurable request timeout
  • Switch ClaudeProvider and OpenAiProvider from default_client() (60s) to llm_client(600)
  • Remove now-unused default_client() from zeph-llm::http
  • Add TimeoutConfig.llm_request_timeout_secs (default 600s) to [timeouts] config section, exposed via ZEPH_TIMEOUT_LLM_REQUEST
  • Bootstrap passes config value to providers via with_client(llm_client(...))

Root cause

default_client() applied a 60s request timeout. LLM calls with tool use routinely exceed 60s, triggering CloseNotify + HTTP request failed errors in ACP sessions.

Two-layer timeout design

Layer Mechanism Default Purpose
tokio TimeoutConfig.llm_seconds 120s agent-level, user-configurable
reqwest TimeoutConfig.llm_request_timeout_secs 600s HTTP backstop, fires only if tokio layer is bypassed

Test plan

  • cargo +nightly fmt --check — pass
  • cargo clippy --workspace -- -D warnings — pass
  • cargo nextest run --workspace --lib --bins — 2894 tests passed

Claude and OpenAI providers shared the same 60s request timeout as
other HTTP clients. LLM responses with tool use can take longer,
causing CloseNotify + request failure after exactly 60s.

Introduce llm_client() in zeph-llm::http with only a connect timeout
(30s). Claude and OpenAI providers now use llm_client(); the unused
default_client() in zeph-llm is removed (other crates use
zeph_core::http::default_client()).
The previous 60s reqwest timeout caused request failures for long
LLM responses (tool use, large context). Removing the timeout entirely
would allow runaway requests to hang indefinitely.

Set request timeout to 600s as a safety backstop. The agent-level
TimeoutConfig.llm_seconds (default 120s) fires first in normal
operation; 600s only catches cases where the tokio-layer is bypassed
or llm_seconds is set unusually high.
Add TimeoutConfig.llm_request_timeout_secs (default 600s) as the
configurable HTTP-layer backstop for LLM provider requests. Exposed
via ZEPH_TIMEOUT_LLM_REQUEST env variable.

llm_client() now accepts the timeout as a parameter. Bootstrap passes
config.timeouts.llm_request_timeout_secs via with_client() to Claude
and OpenAI providers; provider::new() retains a hardcoded 600s default
for non-bootstrap construction paths (tests, orchestrator).
Summarization paths (maybe_summarize_tool_pair, try_summarize_with_llm,
summarize_tool_output) called .chat() directly without any timeout,
causing hangs when Claude was slow. Wrap all summarization .chat()
futures with tokio::time::timeout(llm_seconds).

Add LlmError::Timeout variant used when the deadline is exceeded;
summarization callers treat it the same as an LLM error (warn + skip
or fallback to truncation).
@bug-ops bug-ops force-pushed the fix/llm-client-no-request-timeout branch from dba0ab7 to 9aca28e Compare February 26, 2026 14:16
@github-actions github-actions bot added the tests label Feb 26, 2026
@bug-ops bug-ops enabled auto-merge (squash) February 26, 2026 14:58
@bug-ops bug-ops merged commit 2262957 into main Feb 26, 2026
41 of 56 checks passed
@bug-ops bug-ops deleted the fix/llm-client-no-request-timeout branch February 26, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core dependencies llm LLM provider related rust size/L tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant