Skip to content

Commit b2c118c

Browse files
dbschmigelskipgrayy
authored andcommitted
fix(mcp): protect connection on non-fatal client side timeout error (strands-agents#1231)
* fix(mcp): protect connection on non-fatal client side timeout error * remove empty _MCP_CLIENT.md * remove print statements * remove test
1 parent 7fce03e commit b2c118c

File tree

4 files changed

+234
-7
lines changed

4 files changed

+234
-7
lines changed

_MCP_CLIENT_ARCHITECTURE.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# MCP Client Architecture
2+
3+
## Overview
4+
5+
The MCPClient enables developers to use MCP tools in Strands agents without dealing with async complexity. Since MCP requires async operations but Strands aims for simple synchronous usage (`agent = Agent(); agent("Do something")`), the client uses a background thread with its own event loop to handle MCP communication. This creates challenges around thread synchronization, hanging prevention, and connection stability that this architecture addresses.
6+
7+
## Background Thread Flow
8+
9+
```mermaid
10+
sequenceDiagram
11+
participant Dev as Developer
12+
participant Main as Main Thread
13+
participant BG as Background Thread
14+
participant MCP as MCP Server
15+
16+
Dev->>Main: with MCPClient() as client:
17+
Main->>BG: start() - create thread
18+
BG->>BG: _background_task() - setup event loop
19+
BG->>BG: _async_background_thread() - establish transport
20+
BG->>MCP: ClientSession.initialize()
21+
MCP-->>BG: initialization response
22+
BG->>Main: _init_future.set_result() - signal ready
23+
Dev->>Main: client.call_tool_sync()
24+
Main->>BG: tool calls via _invoke_on_background_thread()
25+
BG->>MCP: tool requests
26+
MCP-->>BG: tool responses
27+
28+
alt Normal response
29+
BG-->>Main: tool response via Future.set_result()
30+
Main-->>Dev: return tool result
31+
else Fatal error in tool response
32+
BG->>BG: _handle_error_message() - raise exception
33+
BG->>BG: Background thread exits
34+
Note over BG: Connection collapses
35+
BG-->>Main: exception via Future.set_exception()
36+
Main-->>Dev: raise exception
37+
end
38+
39+
Note over MCP,BG: Separate flow - server can send unexpected messages anytime
40+
MCP-->>BG: orphaned response (unknown request id)
41+
BG->>BG: _handle_error_message() - log & continue
42+
Note over BG: Connection stays alive (non-fatal error)
43+
44+
Dev->>Main: exit context manager
45+
Main->>BG: stop() - signal close
46+
BG->>BG: _close_future.set_result() - cleanup
47+
```
48+
49+
## Thread Synchronization & Event Loop Management
50+
51+
### Why Two Different Future Types?
52+
53+
**The challenge is synchronizing between the main thread (no event loop) and background thread (with event loop).**
54+
55+
**Main Thread Problem**:
56+
```python
57+
self._init_future: futures.Future[None] = futures.Future()
58+
```
59+
When `MCPClient.__init__()` runs, no event loop exists yet. The background thread hasn't started, so we cannot use `asyncio.Future`. We must use `concurrent.futures.Future` which works without an event loop. This allows the main thread to block safely on `self._init_future.result(timeout=startup_timeout)` until the background thread signals readiness.
60+
61+
**Background Thread Solution**:
62+
```python
63+
self._close_future: asyncio.futures.Future[None] | None = None
64+
# Later in _async_background_thread:
65+
self._close_future = asyncio.Future() # Created inside event loop
66+
```
67+
Once the background thread's event loop is running, we can create `asyncio.Future` objects. The background thread needs to `await self._close_future` to stay alive because we want to keep the MCP connection running on this dedicated event loop. The session must remain active to handle incoming messages and process tool calls. We cannot use `concurrent.futures.Future` here because blocking on `.result()` would freeze the event loop, preventing it from processing MCP messages. Using `asyncio.Future` with `await` keeps the event loop responsive while waiting for the shutdown signal.
68+
69+
## Exception Handling, Hanging, & Connection Termination
70+
71+
### Hanging Scenarios & Defenses
72+
73+
**Hanging Scenario 1: Silent Exception Swallowing** ([PR #922](https://github.com/strands-agents/sdk-python/pull/922))
74+
75+
*Problem*: MCP SDK silently swallows server exceptions (HTTP timeouts, connection errors) without a message handler. Tool calls timeout on server side but client waits indefinitely for responses that never arrive.
76+
77+
*Defense*: `message_handler=self._handle_error_message` in ClientSession
78+
```python
79+
async with ClientSession(
80+
read_stream,
81+
write_stream,
82+
message_handler=self._handle_error_message, # Prevents hanging
83+
elicitation_callback=self._elicitation_callback,
84+
) as session:
85+
```
86+
87+
*How it works in Strands' threaded setup*:
88+
89+
1. **Main thread calls** `client.call_tool_sync()` and blocks on `invoke_future.result()`
90+
2. **Background thread** submits the tool request to MCP server via `asyncio.run_coroutine_threadsafe()`
91+
3. **Server times out** and sends an exception message back to the MCP client
92+
4. **Without message handler**: MCP SDK silently ignores the exception, never calls `Future.set_result()` or `Future.set_exception()`
93+
5. **Main thread hangs forever** waiting for `invoke_future.result()` that will never complete
94+
6. **With `_handle_error_message`**: Exception is raised in background thread, propagates to `Future.set_exception()`, unblocks main thread
95+
96+
The threading architecture makes this particularly problematic because the main thread has no way to detect that the background thread received an error - it can only wait for the Future to complete. Without the message handler, that Future never gets resolved.
97+
98+
**Hanging Scenario 2: 5xx Server Errors** ([PR #1169](https://github.com/strands-agents/sdk-python/pull/1169))
99+
100+
*Problem*: When MCP servers return 5xx errors, the underlying client raises an exception that cancels all TaskGroup tasks and tears down the entire asyncio background thread. Pending tool calls hang forever waiting for responses from a dead connection.
101+
102+
*Defense*: Session closure detection in `_invoke_on_background_thread`
103+
```python
104+
async def run_async() -> T:
105+
invoke_event = asyncio.create_task(coro)
106+
tasks = [invoke_event, close_future]
107+
done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
108+
109+
if done.pop() == close_future:
110+
raise RuntimeError("Connection to the MCP server was closed")
111+
else:
112+
return await invoke_event
113+
```
114+
115+
*How it works*: All tool calls race against `close_future`. When the background thread dies from 5xx errors, `close_future` completes and pending operations immediately fail with a clear error message instead of hanging.
116+
117+
### Defense Against Premature Connection Collapse
118+
119+
Before [PR #922](https://github.com/strands-agents/sdk-python/pull/922), the MCP client would never collapse connections because exceptions were silently ignored. After adding `_handle_error_message`, we introduced the risk of collapsing connections on client-side errors that should be recoverable. The challenge is ensuring we:
120+
121+
1. **DO collapse** when we want to (fatal server errors)
122+
2. **DO NOT collapse** when we don't want to (client-side errors, orphaned responses)
123+
3. **DO notify users** when collapse occurs ([PR #1169](https://github.com/strands-agents/sdk-python/pull/1169) detection)
124+
125+
**Non-Fatal Error Patterns**:
126+
```python
127+
# Errors that should NOT terminate the connection
128+
_NON_FATAL_ERROR_PATTERNS = ["unknown request id"]
129+
```
130+
131+
**Why "unknown request id" is Non-Fatal**:
132+
Client receives a response from server with an ID it doesn't recognize (orphaned response). This happens when responses arrive after their corresponding requests have timed out or been cancelled. More broadly, once a connection is established, the server can send whatever it wants - the client should generally remain stable and not collapse the connection over unexpected messages. "Unknown request id" is just one example of server behavior that shouldn't terminate an otherwise healthy connection.
133+
134+
**Connection Decision Flow**:
135+
1. MCP server sends error message to client
136+
2. `ClientSession` calls `message_handler=self._handle_error_message`
137+
3. **Decision point**: Is error in `_NON_FATAL_ERROR_PATTERNS`?
138+
- **Yes**: Log and continue (connection stays alive)
139+
- **No**: Raise exception (connection collapses)
140+
4. If collapsed: Exception propagates to `_async_background_thread`
141+
5. Background thread exits, `_close_exception` set for main thread
142+
6. Pending operations detect collapse via `close_future` and fail with clear error
143+
144+
**Why This Strategy Works**:
145+
We get the benefits of [PR #922](https://github.com/strands-agents/sdk-python/pull/922) (no hanging) while avoiding unnecessary connection collapse from recoverable client-side errors. When collapse does occur, [PR #1169](https://github.com/strands-agents/sdk-python/pull/1169) ensures users get clear error messages instead of hanging.

src/strands/tools/mcp/mcp_client.py

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,14 @@ class ToolFilters(TypedDict, total=False):
7575
"https://strandsagents.com/latest/user-guide/concepts/tools/mcp-tools/#mcpclientinitializationerror"
7676
)
7777

78+
# Non-fatal error patterns that should not cause connection collapse
79+
_NON_FATAL_ERROR_PATTERNS = [
80+
# Occurs when client receives response with unrecognized ID
81+
# Can occur after a client-side timeout
82+
# See: https://github.com/modelcontextprotocol/python-sdk/blob/c51936f61f35a15f0b1f8fb6887963e5baee1506/src/mcp/shared/session.py#L421
83+
"unknown request id",
84+
]
85+
7886

7987
class MCPClient(ToolProvider):
8088
"""Represents a connection to a Model Context Protocol (MCP) server.
@@ -558,13 +566,6 @@ def _handle_tool_result(self, tool_use_id: str, call_tool_result: MCPCallToolRes
558566

559567
return result
560568

561-
# Raise an exception if the underlying client raises an exception in a message
562-
# This happens when the underlying client has an http timeout error
563-
async def _handle_error_message(self, message: Exception | Any) -> None:
564-
if isinstance(message, Exception):
565-
raise message
566-
await anyio.lowlevel.checkpoint()
567-
568569
async def _async_background_thread(self) -> None:
569570
"""Asynchronous method that runs in the background thread to manage the MCP connection.
570571
@@ -616,6 +617,17 @@ async def _async_background_thread(self) -> None:
616617
"encountered exception on background thread after initialization %s", str(e)
617618
)
618619

620+
# Raise an exception if the underlying client raises an exception in a message
621+
# This happens when the underlying client has an http timeout error
622+
async def _handle_error_message(self, message: Exception | Any) -> None:
623+
if isinstance(message, Exception):
624+
error_msg = str(message).lower()
625+
if any(pattern in error_msg for pattern in _NON_FATAL_ERROR_PATTERNS):
626+
self._log_debug_with_thread("ignoring non-fatal MCP session error", message)
627+
else:
628+
raise message
629+
await anyio.lowlevel.checkpoint()
630+
619631
def _background_task(self) -> None:
620632
"""Sets up and runs the event loop in the background thread.
621633

tests/strands/tools/mcp/test_mcp_client.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -688,3 +688,38 @@ def __init__(self):
688688
mock_session.call_tool.assert_called_once_with("get_file_contents", {}, None)
689689
assert result["status"] == "success"
690690
assert len(result["content"]) == 0 # Unknown resource type should be dropped
691+
692+
693+
@pytest.mark.asyncio
694+
async def test_handle_error_message_non_fatal_error():
695+
"""Test that _handle_error_message ignores non-fatal errors and logs them."""
696+
client = MCPClient(MagicMock())
697+
698+
# Test the message handler directly with a non-fatal error
699+
with patch.object(client, "_log_debug_with_thread") as mock_log:
700+
# This should not raise an exception
701+
await client._handle_error_message(Exception("unknown request id: abc123"))
702+
703+
# Verify the non-fatal error was logged as ignored
704+
assert mock_log.called
705+
call_args = mock_log.call_args[0]
706+
assert "ignoring non-fatal MCP session error" in call_args[0]
707+
708+
709+
@pytest.mark.asyncio
710+
async def test_handle_error_message_fatal_error():
711+
"""Test that _handle_error_message raises fatal errors."""
712+
client = MCPClient(MagicMock())
713+
714+
# This should raise the exception
715+
with pytest.raises(Exception, match="connection timeout"):
716+
await client._handle_error_message(Exception("connection timeout"))
717+
718+
719+
@pytest.mark.asyncio
720+
async def test_handle_error_message_non_exception():
721+
"""Test that _handle_error_message handles non-exception messages."""
722+
client = MCPClient(MagicMock())
723+
724+
# This should not raise an exception
725+
await client._handle_error_message("normal message")

tests_integ/mcp/test_mcp_client.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -487,3 +487,38 @@ def transport_callback() -> MCPTransport:
487487

488488
assert result["status"] == "error"
489489
assert result["content"][0]["text"] == "Tool execution failed: Connection to the MCP server was closed"
490+
491+
492+
def test_mcp_client_connection_stability_with_client_timeout():
493+
"""Integration test to verify connection remains stable with very small timeouts."""
494+
from datetime import timedelta
495+
from unittest.mock import patch
496+
497+
stdio_mcp_client = MCPClient(
498+
lambda: stdio_client(StdioServerParameters(command="python", args=["tests_integ/mcp/echo_server.py"]))
499+
)
500+
501+
with stdio_mcp_client:
502+
# Spy on the logger to capture non-fatal error messages
503+
with patch.object(stdio_mcp_client, "_log_debug_with_thread") as mock_log:
504+
# Make multiple calls with very small timeout to trigger "unknown request id" errors
505+
for i in range(3):
506+
try:
507+
result = stdio_mcp_client.call_tool_sync(
508+
tool_use_id=f"test_{i}",
509+
name="echo",
510+
arguments={"to_echo": f"test_{i}"},
511+
read_timeout_seconds=timedelta(milliseconds=0), # Very small timeout
512+
)
513+
except Exception:
514+
pass # Ignore exceptions, we're testing connection stability
515+
516+
# Verify connection is still alive by making a successful call
517+
result = stdio_mcp_client.call_tool_sync(
518+
tool_use_id="final_test", name="echo", arguments={"to_echo": "connection_alive"}
519+
)
520+
assert result["status"] == "success"
521+
assert result["content"][0]["text"] == "connection_alive"
522+
523+
# Verify that non-fatal error messages were logged
524+
assert any("ignoring non-fatal MCP session error" in str(call) for call in mock_log.call_args_list)

0 commit comments

Comments
 (0)