-
Notifications
You must be signed in to change notification settings - Fork 8
fix(upstream): improve SSE stability and OAuth timeout for JetBrains MCP #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…narios Addresses upstream MCP server stability issues identified in STABILITY_UPSTREAM.md, particularly for JetBrains MCP and other SSE-based servers. **Phase 2 - SSE Transport Hardening:** - Remove hard-coded 180s `http.Client.Timeout` for SSE connections * The Timeout field applies to entire request duration, killing long-lived SSE streams * Now uses `IdleConnTimeout` (300s) and `ResponseHeaderTimeout` (30s) instead * Allows SSE streams to run indefinitely without forced disconnection * Fixes: internal/transport/http.go:225-279 **Phase 3 - OAuth Token Lifecycle:** - Extend OAuth callback timeout from 30s to 120s (2 minutes) * Accommodates remote/systemd scenarios where manual browser opening is needed * Adds diagnostic logging for SSH session detection * Logs OAuth callback wait duration for debugging * Fixes: internal/upstream/core/connection.go:1756 **Enhanced Diagnostics:** - Added detailed logging for SSE transport configuration - Track OAuth authorization timing (wait start, completion duration) - Detect and log remote SSH sessions with extended timeout info - Log recommendations for manual URL opening when browser auto-launch blocked **Testing Notes:** - SSE connections should now survive longer than 3 minutes - Remote users (SSH, systemd) have more time to complete OAuth flows - Enhanced logs help diagnose connection issues in production Fixes #[issue-number-if-applicable] Addresses stability concerns in STABILITY_UPSTREAM.md Phase 2 & 3
**Phase 2 - Connection Loss Detection:** - Register OnConnectionLost handlers for all SSE transport types (headers-auth, no-auth, OAuth) - Detects GOAWAY frames and HTTP/2 idle resets from JetBrains MCP servers - Logs connection drops with server name, error, and transport type - Enables proactive reconnection through managed client's health monitoring - Fixes: internal/upstream/core/connection.go:1133-1140, 1170-1177, 1255-1262 **Code Cleanup:** - Remove unused isSocketAvailable function from tray main.go - Fixes linter warning: unused function **Benefits:** - SSE connections now detect when server drops connection - Network issues surface immediately in logs instead of hanging silently - Reconnection logic can respond to connection loss events - Improves diagnostic visibility for JetBrains MCP and other SSE servers Related to STABILITY_UPSTREAM.md Phase 2 requirements
📦 Build Artifacts for PR #110Version: Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 18918730354 --repo smart-mcp-proxy/mcpproxy-go
# Or download a specific artifact:
gh run download 18918730354 --name dmg-darwin-arm64
|
…re clean exit Problem: - When Quit menu clicked, core was killed BEFORE SSE was stopped - SSE detected disconnection and tried to reconnect - Background goroutines continued running after shutdown - Tray icon remained visible and unresponsive - No synchronization between cleanup and program exit Root Cause: - Incorrect shutdown order in handleShutdown() - Context cancelled after process termination - No wait mechanism for goroutines to finish - Race between cleanup and tray exit Solution: 1. Reorder handleShutdown() to stop SSE FIRST, then health monitor, then core 2. Add 100ms delay after stopping SSE for goroutine cleanup 3. Reverse main shutdown flow: cleanup BEFORE context cancellation 4. Add shutdownComplete channel to synchronize shutdown 5. Wait for shutdown completion before exiting main() 6. Unified shutdown flow for both signal handler and Quit menu Shutdown Sequence (Before): 1. Kill core process → SSE detects disconnection → reconnection attempts 2. Stop SSE (too late) 3. Cancel context 4. Quit tray (race condition) 5. Program exits while goroutines still running Shutdown Sequence (After): 1. Stop SSE explicitly → no reconnection attempts 2. Wait 100ms for SSE goroutine to exit 3. Stop health monitor 4. Kill core process 5. Cancel context 6. Quit tray 7. Wait for shutdown complete 8. Program exits cleanly Testing: - Verified no SSE reconnection attempts after shutdown - Verified tray exits cleanly - Verified all processes terminate properly - Tested with SIGTERM signal handler - Checked logs for correct shutdown sequence Fixes: User-reported issue where Quit menu click left tray icon visible and unresponsive 🤖 Generated with Claude Code
Fixes SSE endpoint discovery timeout issue where the logging transport was consuming the stream before mcp-go could read it. **Root Cause:** The LoggingTransport's SSE reader goroutine was consuming the entire SSE stream, preventing mcp-go's readSSE() goroutine from seeing the endpoint frame, causing "timeout waiting for endpoint" errors even though the endpoint frame arrived successfully within milliseconds. **Solution:** 1. Implemented io.TeeReader to duplicate SSE stream instead of consuming it - Main stream flows to mcp-go's readSSE() goroutine - Tee'd copy flows to logging goroutine via io.Pipe - Both can read the same data independently 2. Removed redundant 30-second timeout wrapper in cli/client.go - Now respects user's --timeout flag directly - Previously wrapped context with hard-coded 30s timeout **Testing:** - JetBrains IDEA SSE server now connects successfully - All 20 tools discovered with --timeout=60s - Frame-by-frame tracing works with --trace-transport flag - Endpoint frame arrives and is processed in ~600µs **Files Changed:** - internal/transport/logging.go: TeeReader implementation - internal/upstream/cli/client.go: Removed timeout wrapper - internal/transport/http.go: Integrated logging transport - cmd/mcpproxy/tools_cmd.go: Added --trace-transport flag Closes issue with SSE endpoint discovery timeouts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Oh and you get OAuth errors becase it is trying to cycle through authentication methods one by one, even if MCP server doesn't support this kind of auth. F.e.: MCP server supports a simple bearer token, in case of connection issues, mcpproxy fallbacks to other kind of auth until it tries the last one (OAuth) and then fails with OAuth error, but this is weird because as you understand, there isn't OAauth The problem here i believe is that it misinterprets connection issues as auth fails. Probably because connection issue is happening right during an auth attempt. Thats my guess So the root cause might be not related to OAuth that much |
…105) Implements non-blocking quarantine inspection with circuit breaker pattern to prevent UI freezes and cascading failures with unstable MCP servers. ## Problem (Issue #105) Quarantine inspection used blocking 20s wait loop that froze the MCP handler thread, causing dashboard hangs and requiring forced kills. Unstable servers (like JetBrains MCP) triggered repeated timeouts. ## Solution ### 1. Non-blocking Connection Wait (mcp.go:1177-1269) - Replaced blocking time.Sleep() loop with goroutine + channels - Proper context cancellation support - Progress logging every 2 seconds - 20-second timeout with clean error messages ### 2. Circuit Breaker Pattern (supervisor.go:945-1059) - Tracks consecutive inspection failures per server - 3-failure threshold triggers 5-minute cooldown - Prevents repeated timeout attempts on unstable servers - Success resets failure counter - Methods: CanInspect(), RecordInspectionFailure(), RecordInspectionSuccess() ### 3. Config Sync Fix (mcp.go:1568-1579) - UpdateConfig() called after adding server - Fixes ConfigService synchronization in test environments - Ensures supervisor sees newly added quarantined servers ## Test Results ### Unit Tests (exemption_test.go) - TestCircuitBreaker: 3-failure threshold ✅ - TestCircuitBreakerReset: Success resets counter ✅ - TestExemptionExpiry: Time-based expiration ✅ ### Real-World Test (idea-ide-local) - Calls 1-3: Each took ~20s (timeout), recorded failures ✅ - Call 4: Returned instantly with circuit breaker message ✅ - Cooldown: 5-minute protection active ✅ ### E2E Test Updates - Fixed tool name: upstream_servers → quarantine_security - Improved error handling and debugging - Better content type assertions ## Impact - ✅ No more thread blocking during inspection - ✅ Dashboard remains responsive during 20s timeout - ✅ Circuit breaker prevents cascading failures - ✅ Clear error messages reference issue #105 ## Related - Addresses feedback from PR #110 (SSE stability) - Complements connection loss detection improvements - Part of upstream stability improvements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…s servers **Critical Bug Fix**: The ConnectServer() method was a no-op that only logged and returned nil, never actually calling the managed client's Connect() method. This caused quarantined servers with inspection exemptions to never connect. ## Problem When quarantine inspection granted temporary exemptions: 1. Supervisor created "Connect" action for exempted server 2. Supervisor called ActorPoolSimple.ConnectServer() 3. ConnectServer() logged "Connecting server via manager" and returned nil 4. NO actual connection attempt was made 5. Inspection timed out after 20s waiting for connection that never happened 6. Server logs showed ZERO connection attempts during inspection This was the root cause of issue #105 inspection failures for SSE servers like idea-ide-local, which connect in ~175ms but never got the chance. ## Solution Implemented ConnectServer() to actually call the managed client's Connect(): - Get client from manager - Call client.Connect(ctx) - Handle "already connecting/connected" errors gracefully (expected) - Let supervisor's reconciliation handle retries if needed ## Evidence Fix Works Before: Server logs show NO connection attempts during inspection (16:43-16:44) After: Server logs show successful connection at 18:05:46: - Starting connection attempt: 18:05:46.762 - Successfully connected: 18:05:46.937 (175ms) - Stayed connected: 30 seconds (vs previous 125ms flapping) Unit tests: All 16 supervisor tests passing ✅ ## Related - Closes root cause of SSE inspection failures in #105 - Works with circuit breaker from commit 8221ef7 - Tested with idea-ide-local (JetBrains MCP Server) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Fixes #105 where SSE servers (idea-ide-local) show connected but 0 tools. **Root Cause 1: Context Cancellation** SSE stream context was canceled immediately after Connect() returns. Managed client reconnection creates 5-min timeout context with defer cancel(), which fires after Connect() completes. SSE stream runs in background goroutine and needs context to stay alive for entire connection duration. **Root Cause 2: Concurrent Requests** SSE transport in mcp-go v0.42.0 has issues with concurrent requests. When health checks and reactive discovery both call ListTools simultaneously, responses get lost. **Solution:** 1. Use context.Background() for SSE Start() calls (same as stdio transport) 2. Add sseRequestMu mutex to serialize SSE requests **Changes:** - internal/upstream/core/client.go: Add sseRequestMu mutex, serialize SSE requests in ListTools/CallTool - internal/upstream/core/connection.go: Use context.Background() for SSE Start() in all auth strategies **Testing:** - idea-ide-local now shows tool_count: 20 (was 0) - No "context canceled" errors in logs - stdio and http transports continue working - Health checks pass consistently **Impact:** - Fixes complete SSE support for production use - No performance impact on other transports - Backward compatible Closes #105 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
**E2E Test Fix (TestE2E_InspectQuarantined):** - Fixed type assertion for mcp.TextContent in quarantine inspection test - Test was only checking for *mcp.TextContent (pointer) but content can be value type - Added value type check consistent with debug logging (lines 1073-1078) **Windows CI Fix:** - Added PowerShell retry logic for npm ci on Windows (3 attempts with 5s wait) - Windows has intermittent EPERM errors during npm cleanup - Cleanup node_modules between retries to avoid stale state - Uses --prefer-offline and --no-audit flags for faster installs - Only applies to Windows matrix (Unix uses standard npm ci) **Testing:** - Verified TestE2E_InspectQuarantined passes locally - Windows npm install will now retry on permission errors 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
f7e0bc5 to
f43b290
Compare

Summary
Implements upstream stability improvements from STABILITY_UPSTREAM.md to fix JetBrains MCP and other SSE-based server connection issues, particularly for remote/systemd deployments.
Changes
Phase 2 - SSE Transport Hardening
1. Fixed SSE Timeout Issue (
internal/transport/http.go)http.Client.Timeout = 180skilled healthy SSE streams after 3 minutesTimeoutfield entirely for SSE connectionsIdleConnTimeout: 300s(5 minutes) - detects dead connectionsResponseHeaderTimeout: 30s- only for initial connection2. Added SSE Connection Loss Detection (
internal/upstream/core/connection.go)OnConnectionLosthandlers for all SSE transport types (headers-auth, no-auth, OAuth)Phase 3 - OAuth Token Lifecycle
3. Extended OAuth Callback Timeout (
internal/upstream/core/connection.go)Code Cleanup
isSocketAvailablefunction fromcmd/mcpproxy-tray/main.goTesting
✅ Linter: 0 issues
✅ E2E Tests: 25/25 passed
✅ Unit Tests: All passing
Impact
For JetBrains MCP Users:
For Remote/Systemd Users:
Related
Test Plan