Skip to content

Conversation

@Dumbris
Copy link
Contributor

@Dumbris Dumbris commented Oct 28, 2025

Summary

Implements upstream stability improvements from STABILITY_UPSTREAM.md to fix JetBrains MCP and other SSE-based server connection issues, particularly for remote/systemd deployments.

Changes

Phase 2 - SSE Transport Hardening

1. Fixed SSE Timeout Issue (internal/transport/http.go)

  • Problem: Hard-coded http.Client.Timeout = 180s killed healthy SSE streams after 3 minutes
  • Solution: Removed Timeout field entirely for SSE connections
  • 🔧 New approach:
    • IdleConnTimeout: 300s (5 minutes) - detects dead connections
    • ResponseHeaderTimeout: 30s - only for initial connection
    • Allows SSE streams to run indefinitely
  • 📍 Location: internal/transport/http.go:225-279

2. Added SSE Connection Loss Detection (internal/upstream/core/connection.go)

  • ✅ Registered OnConnectionLost handlers for all SSE transport types (headers-auth, no-auth, OAuth)
  • ✅ Detects GOAWAY frames and HTTP/2 idle resets from JetBrains servers
  • ✅ Logs connection drops with server name, error, and transport type
  • ✅ Enables proactive reconnection through managed client's health monitoring
  • 📍 Locations: internal/upstream/core/connection.go:1133-1140, 1170-1177, 1255-1262

Phase 3 - OAuth Token Lifecycle

3. Extended OAuth Callback Timeout (internal/upstream/core/connection.go)

  • Problem: 30-second timeout too aggressive for remote/systemd scenarios
  • Solution: Increased to 120 seconds (2 minutes)
  • 🔍 Enhanced diagnostics:
    • Detects SSH sessions and logs extended timeout info
    • Tracks OAuth callback wait duration
    • Logs recommendations when browser auto-launch is blocked
  • 📍 Location: internal/upstream/core/connection.go:1756

Code Cleanup

  • 🧹 Removed unused isSocketAvailable function from cmd/mcpproxy-tray/main.go
  • ✅ Fixes linter warning

Testing

Linter: 0 issues
E2E Tests: 25/25 passed
Unit Tests: All passing

Impact

For JetBrains MCP Users:

  • SSE connections will survive longer than 3 minutes ⏱️
  • Connection drops are now logged and trigger automatic reconnection 🔄
  • OAuth flows accommodate slower manual authorization (e.g., SSH scenarios) 🔐

For Remote/Systemd Users:

  • Extended OAuth timeout (120s) allows time for manual browser opening 🌐
  • SSH session detection with helpful logging 📡
  • Better guidance when browser auto-launch is blocked 💡

Related

  • Addresses STABILITY_UPSTREAM.md Phase 2 & 3 requirements
  • Fixes issues identified for JetBrains MCP over SSE transport
  • Uses mcp-go v0.42.0 with OnConnectionLost support

Test Plan

  • SSE connections tested to survive >3 minutes
  • OnConnectionLost logging verified for disconnection scenarios
  • OAuth timeout extended and tested with SSH session detection
  • Linter passes with 0 issues
  • All E2E tests pass (25/25)
  • Manual testing with JetBrains MCP recommended

…narios

Addresses upstream MCP server stability issues identified in STABILITY_UPSTREAM.md,
particularly for JetBrains MCP and other SSE-based servers.

**Phase 2 - SSE Transport Hardening:**
- Remove hard-coded 180s `http.Client.Timeout` for SSE connections
  * The Timeout field applies to entire request duration, killing long-lived SSE streams
  * Now uses `IdleConnTimeout` (300s) and `ResponseHeaderTimeout` (30s) instead
  * Allows SSE streams to run indefinitely without forced disconnection
  * Fixes: internal/transport/http.go:225-279

**Phase 3 - OAuth Token Lifecycle:**
- Extend OAuth callback timeout from 30s to 120s (2 minutes)
  * Accommodates remote/systemd scenarios where manual browser opening is needed
  * Adds diagnostic logging for SSH session detection
  * Logs OAuth callback wait duration for debugging
  * Fixes: internal/upstream/core/connection.go:1756

**Enhanced Diagnostics:**
- Added detailed logging for SSE transport configuration
- Track OAuth authorization timing (wait start, completion duration)
- Detect and log remote SSH sessions with extended timeout info
- Log recommendations for manual URL opening when browser auto-launch blocked

**Testing Notes:**
- SSE connections should now survive longer than 3 minutes
- Remote users (SSH, systemd) have more time to complete OAuth flows
- Enhanced logs help diagnose connection issues in production

Fixes #[issue-number-if-applicable]
Addresses stability concerns in STABILITY_UPSTREAM.md Phase 2 & 3
**Phase 2 - Connection Loss Detection:**
- Register OnConnectionLost handlers for all SSE transport types (headers-auth, no-auth, OAuth)
- Detects GOAWAY frames and HTTP/2 idle resets from JetBrains MCP servers
- Logs connection drops with server name, error, and transport type
- Enables proactive reconnection through managed client's health monitoring
- Fixes: internal/upstream/core/connection.go:1133-1140, 1170-1177, 1255-1262

**Code Cleanup:**
- Remove unused isSocketAvailable function from tray main.go
- Fixes linter warning: unused function

**Benefits:**
- SSE connections now detect when server drops connection
- Network issues surface immediately in logs instead of hanging silently
- Reconnection logic can respond to connection loss events
- Improves diagnostic visibility for JetBrains MCP and other SSE servers

Related to STABILITY_UPSTREAM.md Phase 2 requirements
@github-actions
Copy link

github-actions bot commented Oct 28, 2025

📦 Build Artifacts for PR #110

Version: v0.9.20-dev-fd9aa8f-pr110
Latest tag: v0.9.20

Available Artifacts

  • archive-linux-amd64 (9.4 MB)
  • archive-darwin-arm64 (16 MB)
  • installer-dmg-darwin-arm64 (18 MB)
  • archive-linux-arm64 (8.6 MB)
  • archive-darwin-amd64 (17 MB)
  • installer-dmg-darwin-amd64 (19 MB)
  • archive-windows-amd64 (17 MB)
  • archive-windows-arm64 (16 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 18918730354 --repo smart-mcp-proxy/mcpproxy-go

# Or download a specific artifact:
gh run download 18918730354 --name dmg-darwin-arm64

Note: Artifacts expire in 14 days.

Dumbris and others added 2 commits October 28, 2025 09:55
…re clean exit

Problem:
- When Quit menu clicked, core was killed BEFORE SSE was stopped
- SSE detected disconnection and tried to reconnect
- Background goroutines continued running after shutdown
- Tray icon remained visible and unresponsive
- No synchronization between cleanup and program exit

Root Cause:
- Incorrect shutdown order in handleShutdown()
- Context cancelled after process termination
- No wait mechanism for goroutines to finish
- Race between cleanup and tray exit

Solution:
1. Reorder handleShutdown() to stop SSE FIRST, then health monitor, then core
2. Add 100ms delay after stopping SSE for goroutine cleanup
3. Reverse main shutdown flow: cleanup BEFORE context cancellation
4. Add shutdownComplete channel to synchronize shutdown
5. Wait for shutdown completion before exiting main()
6. Unified shutdown flow for both signal handler and Quit menu

Shutdown Sequence (Before):
1. Kill core process → SSE detects disconnection → reconnection attempts
2. Stop SSE (too late)
3. Cancel context
4. Quit tray (race condition)
5. Program exits while goroutines still running

Shutdown Sequence (After):
1. Stop SSE explicitly → no reconnection attempts
2. Wait 100ms for SSE goroutine to exit
3. Stop health monitor
4. Kill core process
5. Cancel context
6. Quit tray
7. Wait for shutdown complete
8. Program exits cleanly

Testing:
- Verified no SSE reconnection attempts after shutdown
- Verified tray exits cleanly
- Verified all processes terminate properly
- Tested with SIGTERM signal handler
- Checked logs for correct shutdown sequence

Fixes: User-reported issue where Quit menu click left tray icon visible and unresponsive

🤖 Generated with Claude Code
Fixes SSE endpoint discovery timeout issue where the logging transport
was consuming the stream before mcp-go could read it.

**Root Cause:**
The LoggingTransport's SSE reader goroutine was consuming the entire
SSE stream, preventing mcp-go's readSSE() goroutine from seeing the
endpoint frame, causing "timeout waiting for endpoint" errors even
though the endpoint frame arrived successfully within milliseconds.

**Solution:**
1. Implemented io.TeeReader to duplicate SSE stream instead of consuming it
   - Main stream flows to mcp-go's readSSE() goroutine
   - Tee'd copy flows to logging goroutine via io.Pipe
   - Both can read the same data independently

2. Removed redundant 30-second timeout wrapper in cli/client.go
   - Now respects user's --timeout flag directly
   - Previously wrapped context with hard-coded 30s timeout

**Testing:**
- JetBrains IDEA SSE server now connects successfully
- All 20 tools discovered with --timeout=60s
- Frame-by-frame tracing works with --trace-transport flag
- Endpoint frame arrives and is processed in ~600µs

**Files Changed:**
- internal/transport/logging.go: TeeReader implementation
- internal/upstream/cli/client.go: Removed timeout wrapper
- internal/transport/http.go: Integrated logging transport
- cmd/mcpproxy/tools_cmd.go: Added --trace-transport flag

Closes issue with SSE endpoint discovery timeouts
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Melodeiro
Copy link

Melodeiro commented Oct 28, 2025

Hmm works better, but i was able to reproduce with STDIO in a specific way:

  1. Start Intellij IDEA
  2. Connect to it from mcpproxy using stdio (you can copy config from IDEA settings)
  3. Wait until it connects
  4. Quit IDEA
  5. Wait until you get a connection error in mcpproxy (without this it was fine, even with a short disconnection)
  6. Start IDEA again
  7. Wait for 3-30 seconds, while trying to refresh the web dashboard
  8. /servers request can't be performed again, but this time it is even worse, hands like for 3-4 minutes for me

About SSE:
It still shows 0 tools in UI, web dashboard, but when i query tool list from mcp client, im getting all the tools. Dashboard still showing 0. When i tried to call a tool it just timed out:
image
None of the tools responded, tried about 3

@Melodeiro
Copy link

Melodeiro commented Oct 29, 2025

Oh and you get OAuth errors becase it is trying to cycle through authentication methods one by one, even if MCP server doesn't support this kind of auth.

F.e.: MCP server supports a simple bearer token, in case of connection issues, mcpproxy fallbacks to other kind of auth until it tries the last one (OAuth) and then fails with OAuth error, but this is weird because as you understand, there isn't OAauth

The problem here i believe is that it misinterprets connection issues as auth fails. Probably because connection issue is happening right during an auth attempt. Thats my guess

So the root cause might be not related to OAuth that much

Dumbris and others added 4 commits October 29, 2025 17:47
…105)

Implements non-blocking quarantine inspection with circuit breaker pattern
to prevent UI freezes and cascading failures with unstable MCP servers.

## Problem (Issue #105)
Quarantine inspection used blocking 20s wait loop that froze the MCP
handler thread, causing dashboard hangs and requiring forced kills.
Unstable servers (like JetBrains MCP) triggered repeated timeouts.

## Solution

### 1. Non-blocking Connection Wait (mcp.go:1177-1269)
- Replaced blocking time.Sleep() loop with goroutine + channels
- Proper context cancellation support
- Progress logging every 2 seconds
- 20-second timeout with clean error messages

### 2. Circuit Breaker Pattern (supervisor.go:945-1059)
- Tracks consecutive inspection failures per server
- 3-failure threshold triggers 5-minute cooldown
- Prevents repeated timeout attempts on unstable servers
- Success resets failure counter
- Methods: CanInspect(), RecordInspectionFailure(), RecordInspectionSuccess()

### 3. Config Sync Fix (mcp.go:1568-1579)
- UpdateConfig() called after adding server
- Fixes ConfigService synchronization in test environments
- Ensures supervisor sees newly added quarantined servers

## Test Results

### Unit Tests (exemption_test.go)
- TestCircuitBreaker: 3-failure threshold ✅
- TestCircuitBreakerReset: Success resets counter ✅
- TestExemptionExpiry: Time-based expiration ✅

### Real-World Test (idea-ide-local)
- Calls 1-3: Each took ~20s (timeout), recorded failures ✅
- Call 4: Returned instantly with circuit breaker message ✅
- Cooldown: 5-minute protection active ✅

### E2E Test Updates
- Fixed tool name: upstream_servers → quarantine_security
- Improved error handling and debugging
- Better content type assertions

## Impact
- ✅ No more thread blocking during inspection
- ✅ Dashboard remains responsive during 20s timeout
- ✅ Circuit breaker prevents cascading failures
- ✅ Clear error messages reference issue #105

## Related
- Addresses feedback from PR #110 (SSE stability)
- Complements connection loss detection improvements
- Part of upstream stability improvements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…s servers

**Critical Bug Fix**: The ConnectServer() method was a no-op that only logged
and returned nil, never actually calling the managed client's Connect() method.
This caused quarantined servers with inspection exemptions to never connect.

## Problem
When quarantine inspection granted temporary exemptions:
1. Supervisor created "Connect" action for exempted server
2. Supervisor called ActorPoolSimple.ConnectServer()
3. ConnectServer() logged "Connecting server via manager" and returned nil
4. NO actual connection attempt was made
5. Inspection timed out after 20s waiting for connection that never happened
6. Server logs showed ZERO connection attempts during inspection

This was the root cause of issue #105 inspection failures for SSE servers
like idea-ide-local, which connect in ~175ms but never got the chance.

## Solution
Implemented ConnectServer() to actually call the managed client's Connect():
- Get client from manager
- Call client.Connect(ctx)
- Handle "already connecting/connected" errors gracefully (expected)
- Let supervisor's reconciliation handle retries if needed

## Evidence Fix Works
Before: Server logs show NO connection attempts during inspection (16:43-16:44)
After: Server logs show successful connection at 18:05:46:
  - Starting connection attempt: 18:05:46.762
  - Successfully connected: 18:05:46.937 (175ms)
  - Stayed connected: 30 seconds (vs previous 125ms flapping)

Unit tests: All 16 supervisor tests passing ✅

## Related
- Closes root cause of SSE inspection failures in #105
- Works with circuit breaker from commit 8221ef7
- Tested with idea-ide-local (JetBrains MCP Server)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixes #105 where SSE servers (idea-ide-local) show connected but 0 tools.

**Root Cause 1: Context Cancellation**
SSE stream context was canceled immediately after Connect() returns.
Managed client reconnection creates 5-min timeout context with defer cancel(),
which fires after Connect() completes. SSE stream runs in background goroutine
and needs context to stay alive for entire connection duration.

**Root Cause 2: Concurrent Requests**
SSE transport in mcp-go v0.42.0 has issues with concurrent requests.
When health checks and reactive discovery both call ListTools simultaneously,
responses get lost.

**Solution:**
1. Use context.Background() for SSE Start() calls (same as stdio transport)
2. Add sseRequestMu mutex to serialize SSE requests

**Changes:**
- internal/upstream/core/client.go: Add sseRequestMu mutex, serialize SSE requests in ListTools/CallTool
- internal/upstream/core/connection.go: Use context.Background() for SSE Start() in all auth strategies

**Testing:**
- idea-ide-local now shows tool_count: 20 (was 0)
- No "context canceled" errors in logs
- stdio and http transports continue working
- Health checks pass consistently

**Impact:**
- Fixes complete SSE support for production use
- No performance impact on other transports
- Backward compatible

Closes #105

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
**E2E Test Fix (TestE2E_InspectQuarantined):**
- Fixed type assertion for mcp.TextContent in quarantine inspection test
- Test was only checking for *mcp.TextContent (pointer) but content can be value type
- Added value type check consistent with debug logging (lines 1073-1078)

**Windows CI Fix:**
- Added PowerShell retry logic for npm ci on Windows (3 attempts with 5s wait)
- Windows has intermittent EPERM errors during npm cleanup
- Cleanup node_modules between retries to avoid stale state
- Uses --prefer-offline and --no-audit flags for faster installs
- Only applies to Windows matrix (Unix uses standard npm ci)

**Testing:**
- Verified TestE2E_InspectQuarantined passes locally
- Windows npm install will now retry on permission errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Dumbris Dumbris force-pushed the fix/upstream-stability branch from f7e0bc5 to f43b290 Compare October 29, 2025 18:46
@Dumbris Dumbris merged commit f7caea5 into main Oct 29, 2025
35 checks passed
@Dumbris Dumbris deleted the fix/upstream-stability branch October 29, 2025 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants