Enable session roaming across multiple server instances #1

davidroberts-merlyn · 2025-10-27T11:51:32Z

Enable Session Roaming Across Multiple Server Instances

Problem

When deploying MCP servers across multiple instances (Kubernetes pods, Docker containers, worker processes), sessions are tied to the specific instance that created them. This requires sticky sessions at the load balancer level and prevents true horizontal scaling. Users are currently forced to choose between:

Sticky sessions - Suboptimal load distribution, sessions lost on pod failure
Single worker - Wastes resources, limits throughput
Stateless mode - Loses session continuity and event replay

This limitation is documented in multiple issues: modelcontextprotocol#520 (multi-worker sessions), modelcontextprotocol#692 (session reuse across instances), modelcontextprotocol#880 (horizontal scalability), and modelcontextprotocol#1350 (sticky session problems).

Solution

This PR enables session roaming - allowing sessions to seamlessly move between server instances without requiring sticky sessions. The key insight is that EventStore already serves as proof of session existence.

When a request arrives with a session ID that's not in an instance's local memory, if an EventStore is configured, the instance can safely:

Create a transport for that session ID (session roaming)
Let EventStore replay any missed events (continuity)
Handle the request seamlessly

What Changed

Modified streamable_http_manager.py (~50 lines):

Added session roaming logic in _handle_stateful_request()
When unknown session ID + EventStore exists → create transport (roaming!)
Extracted duplicate server task code into reusable methods
Updated docstrings to document session roaming capability

Added comprehensive tests (test_session_roaming.py, 510 lines):

Session roaming with EventStore
Rejection without EventStore
Concurrent request handling
Exception cleanup behavior
Fast path verification
Logging verification

Added production-ready example (simple-streamablehttp-roaming/, 13 files):

Complete working example with Redis EventStore
Multi-instance deployment support
Docker Compose configuration (3 instances + Redis + NGINX)
Kubernetes deployment example
Automated test script demonstrating roaming
Comprehensive documentation (README, QUICKSTART, implementation details)

Why This Approach

Previous Attempts

We explored two other approaches before arriving at this solution:

Custom Session Store (outside SDK) - Created our own session validation in the application layer, but this didn't solve the core problem and required every user to implement their own solution.
SessionStore ABC (in SDK) - Added a new SessionStore interface requiring both EventStore + SessionStore parameters. While functional, this approach required two separate storage backends and was more complex than necessary.

Current Approach: EventStore-Only

The key insight: EventStore already proves sessions existed. If events exist for a session ID, that session must have existed to create those events. No separate SessionStore needed.

Benefits:

✅ One store instead of two (simpler)
✅ Reuses existing EventStore interface (no new APIs)
✅ Impossible to misconfigure (EventStore = both events + proof)
✅ Aligns with SEP-1359 (sessions are conversation context, not auth)
✅ Minimal code changes (~50 lines)
✅ 100% backward compatible (behavior enhancement only)

Usage

Before (Requires Sticky Sessions)

# Without EventStore - sessions in memory only
manager = StreamableHTTPSessionManager(app=app)
# Deployment: requires sticky sessions for multi-instance

After (No Sticky Sessions Needed)

# With EventStore - sessions roam freely
event_store = RedisEventStore(redis_url="redis://redis:6379")
manager = StreamableHTTPSessionManager(
    app=app,
    event_store=event_store  # Enables session roaming!
)
# Deployment: load balancer can route freely, no sticky sessions

How It Works

Client → Instance 1 (creates session "abc123", stores events in Redis)
Client → Instance 2 (with session "abc123")
  ↓
Instance 2 checks memory → not found
Instance 2 sees EventStore exists
Instance 2 creates transport for "abc123" (roaming!)
EventStore replays events from Redis
Session continues seamlessly ✅

Testing

All tests pass, including:

✅ Existing test suite (no regressions)
✅ 8 new tests for session roaming
✅ Automated roaming test script in example
✅ Type checking (pyright)
✅ Linting (ruff)

Production Deployment

The included example demonstrates:

Multi-instance deployment with Docker Compose
Kubernetes manifests (3 replicas, no sessionAffinity needed)
NGINX load balancing without sticky sessions
Redis EventStore for shared state
Automated testing and verification

Breaking Changes

None. This is a pure behavior enhancement:

✅ Existing code works unchanged
✅ No API changes
✅ No new required parameters
✅ Backward compatible

Related Issues

Closes modelcontextprotocol#520, modelcontextprotocol#692, modelcontextprotocol#880, modelcontextprotocol#1350

This implementation addresses the core limitation described in all these issues: the inability to run stateful MCP servers across multiple instances without sticky sessions.

Add session roaming support to StreamableHTTPSessionManager, allowing sessions to move freely between server instances without requiring sticky sessions. This enables true horizontal scaling and high availability for stateful MCP servers. When a request arrives with a session ID not found in local memory, the presence of an EventStore allows creating a transport for that session. EventStore serves dual purposes: storing events (existing) and proving session existence (new). This eliminates the need for separate session validation storage. Changes: - Add session roaming logic in _handle_stateful_request() - Extract duplicate server task code into reusable methods - Update docstrings to document session roaming capability - Add 8 comprehensive tests for session roaming scenarios - Add production-ready example with Redis EventStore - Include Kubernetes and Docker Compose deployment examples Benefits: - One store instead of two (EventStore serves both purposes) - No new APIs or interfaces required - Minimal code changes (~50 lines in manager) - 100% backward compatible - Enables multi-instance deployments without sticky sessions Example usage: event_store = RedisEventStore(redis_url="redis://redis:6379") manager = StreamableHTTPSessionManager( app=app, event_store=event_store # Enables session roaming ) Github-Issue: modelcontextprotocol#520 Github-Issue: modelcontextprotocol#692 Github-Issue: modelcontextprotocol#880 Github-Issue: modelcontextprotocol#1350

Change single quotes to double quotes to comply with prettier formatting requirements.

- Add language specifiers to all code blocks - Fix heading hierarchy (bold text to proper headings) - Add blank lines after headings for better readability - Escape underscores in file paths (__init__.py -> **init**.py)

The transport could be removed from _server_instances by the cleanup task if it crashed immediately after being started. This caused a KeyError when trying to access it from the dictionary. Fixed by keeping a local reference to the transport instead of looking it up again from the dictionary after starting the server task.

Use @contextlib.asynccontextmanager decorator instead of manual __aenter__/__aexit__ implementation for mock_connect functions. Fixes test failures in: - test_transport_server_task_cleanup_on_exception - test_transport_server_task_no_cleanup_on_terminated

Add AsyncIterator import and use proper return type annotation for mock_connect functions: AsyncIterator[tuple[AsyncMock, AsyncMock]] instead of Any.

The tests were failing because AsyncMock(return_value=None) caused app.run to complete immediately, which closed the transport streams and triggered cleanup that removed transports from _server_instances before assertions could check for them. Now using mock_app_run that calls anyio.sleep_forever() and blocks until the test context cancels it. This keeps transports alive during the test assertions.

…ntextprotocol#1518)

davidroberts-merlyn added 8 commits October 27, 2025 18:23

Fix prettier formatting in docker-compose.yml

5e75bec

Change single quotes to double quotes to comply with prettier formatting requirements.

Fix markdownlint issues in example documentation

07cb1dd

- Add language specifiers to all code blocks - Fix heading hierarchy (bold text to proper headings) - Add blank lines after headings for better readability - Escape underscores in file paths (__init__.py -> **init**.py)

Add missing contextlib import for async context manager decorator

ea7813f

Fix pyright type errors for asynccontextmanager decorators

33f14f0

Add AsyncIterator import and use proper return type annotation for mock_connect functions: AsyncIterator[tuple[AsyncMock, AsyncMock]] instead of Any.

davidroberts-merlyn force-pushed the session-roaming branch from 1c9b3ce to ce114b2 Compare October 27, 2025 18:23

yukuanj and others added 2 commits October 28, 2025 14:09

fix: replace deprecated dev-dependencies in examples/clients (modelco…

673423d

…ntextprotocol#1518)

Merge branch 'main' into session-roaming

64d7f4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable session roaming across multiple server instances #1

Enable session roaming across multiple server instances #1

Uh oh!

davidroberts-merlyn commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enable session roaming across multiple server instances #1

Are you sure you want to change the base?

Enable session roaming across multiple server instances #1

Uh oh!

Conversation

davidroberts-merlyn commented Oct 27, 2025

Enable Session Roaming Across Multiple Server Instances

Problem

Solution

What Changed

Why This Approach

Previous Attempts

Current Approach: EventStore-Only

Usage

Before (Requires Sticky Sessions)

After (No Sticky Sessions Needed)

How It Works

Testing

Production Deployment

Breaking Changes

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants