Skip to content

Conversation

@davidroberts-merlyn
Copy link
Owner

Enable Session Roaming Across Multiple Server Instances

Problem

When deploying MCP servers across multiple instances (Kubernetes pods, Docker containers, worker processes), sessions are tied to the specific instance that created them. This requires sticky sessions at the load balancer level and prevents true horizontal scaling. Users are currently forced to choose between:

  1. Sticky sessions - Suboptimal load distribution, sessions lost on pod failure
  2. Single worker - Wastes resources, limits throughput
  3. Stateless mode - Loses session continuity and event replay

This limitation is documented in multiple issues: modelcontextprotocol#520 (multi-worker sessions), modelcontextprotocol#692 (session reuse across instances), modelcontextprotocol#880 (horizontal scalability), and modelcontextprotocol#1350 (sticky session problems).

Solution

This PR enables session roaming - allowing sessions to seamlessly move between server instances without requiring sticky sessions. The key insight is that EventStore already serves as proof of session existence.

When a request arrives with a session ID that's not in an instance's local memory, if an EventStore is configured, the instance can safely:

  1. Create a transport for that session ID (session roaming)
  2. Let EventStore replay any missed events (continuity)
  3. Handle the request seamlessly

What Changed

Modified streamable_http_manager.py (~50 lines):

  • Added session roaming logic in _handle_stateful_request()
  • When unknown session ID + EventStore exists → create transport (roaming!)
  • Extracted duplicate server task code into reusable methods
  • Updated docstrings to document session roaming capability

Added comprehensive tests (test_session_roaming.py, 510 lines):

  • Session roaming with EventStore
  • Rejection without EventStore
  • Concurrent request handling
  • Exception cleanup behavior
  • Fast path verification
  • Logging verification

Added production-ready example (simple-streamablehttp-roaming/, 13 files):

  • Complete working example with Redis EventStore
  • Multi-instance deployment support
  • Docker Compose configuration (3 instances + Redis + NGINX)
  • Kubernetes deployment example
  • Automated test script demonstrating roaming
  • Comprehensive documentation (README, QUICKSTART, implementation details)

Why This Approach

Previous Attempts

We explored two other approaches before arriving at this solution:

  1. Custom Session Store (outside SDK) - Created our own session validation in the application layer, but this didn't solve the core problem and required every user to implement their own solution.

  2. SessionStore ABC (in SDK) - Added a new SessionStore interface requiring both EventStore + SessionStore parameters. While functional, this approach required two separate storage backends and was more complex than necessary.

Current Approach: EventStore-Only

The key insight: EventStore already proves sessions existed. If events exist for a session ID, that session must have existed to create those events. No separate SessionStore needed.

Benefits:

  • ✅ One store instead of two (simpler)
  • ✅ Reuses existing EventStore interface (no new APIs)
  • ✅ Impossible to misconfigure (EventStore = both events + proof)
  • ✅ Aligns with SEP-1359 (sessions are conversation context, not auth)
  • ✅ Minimal code changes (~50 lines)
  • ✅ 100% backward compatible (behavior enhancement only)

Usage

Before (Requires Sticky Sessions)

# Without EventStore - sessions in memory only
manager = StreamableHTTPSessionManager(app=app)
# Deployment: requires sticky sessions for multi-instance

After (No Sticky Sessions Needed)

# With EventStore - sessions roam freely
event_store = RedisEventStore(redis_url="redis://redis:6379")
manager = StreamableHTTPSessionManager(
    app=app,
    event_store=event_store  # Enables session roaming!
)
# Deployment: load balancer can route freely, no sticky sessions

How It Works

Client → Instance 1 (creates session "abc123", stores events in Redis)
Client → Instance 2 (with session "abc123")
  ↓
Instance 2 checks memory → not found
Instance 2 sees EventStore exists
Instance 2 creates transport for "abc123" (roaming!)
EventStore replays events from Redis
Session continues seamlessly ✅

Testing

All tests pass, including:

  • ✅ Existing test suite (no regressions)
  • ✅ 8 new tests for session roaming
  • ✅ Automated roaming test script in example
  • ✅ Type checking (pyright)
  • ✅ Linting (ruff)

Production Deployment

The included example demonstrates:

  • Multi-instance deployment with Docker Compose
  • Kubernetes manifests (3 replicas, no sessionAffinity needed)
  • NGINX load balancing without sticky sessions
  • Redis EventStore for shared state
  • Automated testing and verification

Breaking Changes

None. This is a pure behavior enhancement:

  • ✅ Existing code works unchanged
  • ✅ No API changes
  • ✅ No new required parameters
  • ✅ Backward compatible

Related Issues

Closes modelcontextprotocol#520, modelcontextprotocol#692, modelcontextprotocol#880, modelcontextprotocol#1350

This implementation addresses the core limitation described in all these issues: the inability to run stateful MCP servers across multiple instances without sticky sessions.

Add session roaming support to StreamableHTTPSessionManager, allowing
sessions to move freely between server instances without requiring
sticky sessions. This enables true horizontal scaling and high
availability for stateful MCP servers.

When a request arrives with a session ID not found in local memory,
the presence of an EventStore allows creating a transport for that
session. EventStore serves dual purposes: storing events (existing)
and proving session existence (new). This eliminates the need for
separate session validation storage.

Changes:
- Add session roaming logic in _handle_stateful_request()
- Extract duplicate server task code into reusable methods
- Update docstrings to document session roaming capability
- Add 8 comprehensive tests for session roaming scenarios
- Add production-ready example with Redis EventStore
- Include Kubernetes and Docker Compose deployment examples

Benefits:
- One store instead of two (EventStore serves both purposes)
- No new APIs or interfaces required
- Minimal code changes (~50 lines in manager)
- 100% backward compatible
- Enables multi-instance deployments without sticky sessions

Example usage:
  event_store = RedisEventStore(redis_url="redis://redis:6379")
  manager = StreamableHTTPSessionManager(
      app=app,
      event_store=event_store  # Enables session roaming
  )

Github-Issue: modelcontextprotocol#520
Github-Issue: modelcontextprotocol#692
Github-Issue: modelcontextprotocol#880
Github-Issue: modelcontextprotocol#1350
Change single quotes to double quotes to comply with prettier formatting requirements.
- Add language specifiers to all code blocks
- Fix heading hierarchy (bold text to proper headings)
- Add blank lines after headings for better readability
- Escape underscores in file paths (__init__.py -> **init**.py)
The transport could be removed from _server_instances by the cleanup
task if it crashed immediately after being started. This caused a
KeyError when trying to access it from the dictionary.

Fixed by keeping a local reference to the transport instead of looking
it up again from the dictionary after starting the server task.
Use @contextlib.asynccontextmanager decorator instead of manual
__aenter__/__aexit__ implementation for mock_connect functions.

Fixes test failures in:
- test_transport_server_task_cleanup_on_exception
- test_transport_server_task_no_cleanup_on_terminated
Add AsyncIterator import and use proper return type annotation for
mock_connect functions: AsyncIterator[tuple[AsyncMock, AsyncMock]]
instead of Any.
The tests were failing because AsyncMock(return_value=None) caused
app.run to complete immediately, which closed the transport streams
and triggered cleanup that removed transports from _server_instances
before assertions could check for them.

Now using mock_app_run that calls anyio.sleep_forever() and blocks
until the test context cancels it. This keeps transports alive during
the test assertions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP Server Session Lost in Multi-Worker Environment

3 participants