fix(upstream): improve SSE stability and OAuth timeout for JetBrains MCP #110

Dumbris · 2025-10-28T03:52:43Z

Summary

Implements upstream stability improvements from STABILITY_UPSTREAM.md to fix JetBrains MCP and other SSE-based server connection issues, particularly for remote/systemd deployments.

Changes

Phase 2 - SSE Transport Hardening

1. Fixed SSE Timeout Issue (internal/transport/http.go)

❌ Problem: Hard-coded http.Client.Timeout = 180s killed healthy SSE streams after 3 minutes
✅ Solution: Removed Timeout field entirely for SSE connections
🔧 New approach:
- IdleConnTimeout: 300s (5 minutes) - detects dead connections
- ResponseHeaderTimeout: 30s - only for initial connection
- Allows SSE streams to run indefinitely
📍 Location: internal/transport/http.go:225-279

2. Added SSE Connection Loss Detection (internal/upstream/core/connection.go)

✅ Registered OnConnectionLost handlers for all SSE transport types (headers-auth, no-auth, OAuth)
✅ Detects GOAWAY frames and HTTP/2 idle resets from JetBrains servers
✅ Logs connection drops with server name, error, and transport type
✅ Enables proactive reconnection through managed client's health monitoring
📍 Locations: internal/upstream/core/connection.go:1133-1140, 1170-1177, 1255-1262

Phase 3 - OAuth Token Lifecycle

3. Extended OAuth Callback Timeout (internal/upstream/core/connection.go)

❌ Problem: 30-second timeout too aggressive for remote/systemd scenarios
✅ Solution: Increased to 120 seconds (2 minutes)
🔍 Enhanced diagnostics:
- Detects SSH sessions and logs extended timeout info
- Tracks OAuth callback wait duration
- Logs recommendations when browser auto-launch is blocked
📍 Location: internal/upstream/core/connection.go:1756

Code Cleanup

🧹 Removed unused isSocketAvailable function from cmd/mcpproxy-tray/main.go
✅ Fixes linter warning

Testing

✅ Linter: 0 issues
✅ E2E Tests: 25/25 passed
✅ Unit Tests: All passing

Impact

For JetBrains MCP Users:

SSE connections will survive longer than 3 minutes ⏱️
Connection drops are now logged and trigger automatic reconnection 🔄
OAuth flows accommodate slower manual authorization (e.g., SSH scenarios) 🔐

For Remote/Systemd Users:

Extended OAuth timeout (120s) allows time for manual browser opening 🌐
SSH session detection with helpful logging 📡
Better guidance when browser auto-launch is blocked 💡

Test Plan

SSE connections tested to survive >3 minutes
OnConnectionLost logging verified for disconnection scenarios
OAuth timeout extended and tested with SSH session detection
Linter passes with 0 issues
All E2E tests pass (25/25)
Manual testing with JetBrains MCP recommended

…narios Addresses upstream MCP server stability issues identified in STABILITY_UPSTREAM.md, particularly for JetBrains MCP and other SSE-based servers. **Phase 2 - SSE Transport Hardening:** - Remove hard-coded 180s `http.Client.Timeout` for SSE connections * The Timeout field applies to entire request duration, killing long-lived SSE streams * Now uses `IdleConnTimeout` (300s) and `ResponseHeaderTimeout` (30s) instead * Allows SSE streams to run indefinitely without forced disconnection * Fixes: internal/transport/http.go:225-279 **Phase 3 - OAuth Token Lifecycle:** - Extend OAuth callback timeout from 30s to 120s (2 minutes) * Accommodates remote/systemd scenarios where manual browser opening is needed * Adds diagnostic logging for SSH session detection * Logs OAuth callback wait duration for debugging * Fixes: internal/upstream/core/connection.go:1756 **Enhanced Diagnostics:** - Added detailed logging for SSE transport configuration - Track OAuth authorization timing (wait start, completion duration) - Detect and log remote SSH sessions with extended timeout info - Log recommendations for manual URL opening when browser auto-launch blocked **Testing Notes:** - SSE connections should now survive longer than 3 minutes - Remote users (SSH, systemd) have more time to complete OAuth flows - Enhanced logs help diagnose connection issues in production Fixes #[issue-number-if-applicable] Addresses stability concerns in STABILITY_UPSTREAM.md Phase 2 & 3

**Phase 2 - Connection Loss Detection:** - Register OnConnectionLost handlers for all SSE transport types (headers-auth, no-auth, OAuth) - Detects GOAWAY frames and HTTP/2 idle resets from JetBrains MCP servers - Logs connection drops with server name, error, and transport type - Enables proactive reconnection through managed client's health monitoring - Fixes: internal/upstream/core/connection.go:1133-1140, 1170-1177, 1255-1262 **Code Cleanup:** - Remove unused isSocketAvailable function from tray main.go - Fixes linter warning: unused function **Benefits:** - SSE connections now detect when server drops connection - Network issues surface immediately in logs instead of hanging silently - Reconnection logic can respond to connection loss events - Improves diagnostic visibility for JetBrains MCP and other SSE servers Related to STABILITY_UPSTREAM.md Phase 2 requirements

github-actions · 2025-10-28T03:59:57Z

📦 Build Artifacts for PR #110

Version: v0.9.20-dev-fd9aa8f-pr110
Latest tag: v0.9.20

Available Artifacts

archive-linux-amd64 (9.4 MB)
archive-darwin-arm64 (16 MB)
installer-dmg-darwin-arm64 (18 MB)
archive-linux-arm64 (8.6 MB)
archive-darwin-amd64 (17 MB)
installer-dmg-darwin-amd64 (19 MB)
archive-windows-amd64 (17 MB)
archive-windows-arm64 (16 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 18918730354 --repo smart-mcp-proxy/mcpproxy-go

# Or download a specific artifact:
gh run download 18918730354 --name dmg-darwin-arm64

Note: Artifacts expire in 14 days.

…re clean exit Problem: - When Quit menu clicked, core was killed BEFORE SSE was stopped - SSE detected disconnection and tried to reconnect - Background goroutines continued running after shutdown - Tray icon remained visible and unresponsive - No synchronization between cleanup and program exit Root Cause: - Incorrect shutdown order in handleShutdown() - Context cancelled after process termination - No wait mechanism for goroutines to finish - Race between cleanup and tray exit Solution: 1. Reorder handleShutdown() to stop SSE FIRST, then health monitor, then core 2. Add 100ms delay after stopping SSE for goroutine cleanup 3. Reverse main shutdown flow: cleanup BEFORE context cancellation 4. Add shutdownComplete channel to synchronize shutdown 5. Wait for shutdown completion before exiting main() 6. Unified shutdown flow for both signal handler and Quit menu Shutdown Sequence (Before): 1. Kill core process → SSE detects disconnection → reconnection attempts 2. Stop SSE (too late) 3. Cancel context 4. Quit tray (race condition) 5. Program exits while goroutines still running Shutdown Sequence (After): 1. Stop SSE explicitly → no reconnection attempts 2. Wait 100ms for SSE goroutine to exit 3. Stop health monitor 4. Kill core process 5. Cancel context 6. Quit tray 7. Wait for shutdown complete 8. Program exits cleanly Testing: - Verified no SSE reconnection attempts after shutdown - Verified tray exits cleanly - Verified all processes terminate properly - Tested with SIGTERM signal handler - Checked logs for correct shutdown sequence Fixes: User-reported issue where Quit menu click left tray icon visible and unresponsive 🤖 Generated with Claude Code

Fixes SSE endpoint discovery timeout issue where the logging transport was consuming the stream before mcp-go could read it. **Root Cause:** The LoggingTransport's SSE reader goroutine was consuming the entire SSE stream, preventing mcp-go's readSSE() goroutine from seeing the endpoint frame, causing "timeout waiting for endpoint" errors even though the endpoint frame arrived successfully within milliseconds. **Solution:** 1. Implemented io.TeeReader to duplicate SSE stream instead of consuming it - Main stream flows to mcp-go's readSSE() goroutine - Tee'd copy flows to logging goroutine via io.Pipe - Both can read the same data independently 2. Removed redundant 30-second timeout wrapper in cli/client.go - Now respects user's --timeout flag directly - Previously wrapped context with hard-coded 30s timeout **Testing:** - JetBrains IDEA SSE server now connects successfully - All 20 tools discovered with --timeout=60s - Frame-by-frame tracing works with --trace-transport flag - Endpoint frame arrives and is processed in ~600µs **Files Changed:** - internal/transport/logging.go: TeeReader implementation - internal/upstream/cli/client.go: Removed timeout wrapper - internal/transport/http.go: Integrated logging transport - cmd/mcpproxy/tools_cmd.go: Added --trace-transport flag Closes issue with SSE endpoint discovery timeouts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Melodeiro · 2025-10-28T16:31:52Z

Hmm works better, but i was able to reproduce with STDIO in a specific way:

Start Intellij IDEA
Connect to it from mcpproxy using stdio (you can copy config from IDEA settings)
Wait until it connects
Quit IDEA
Wait until you get a connection error in mcpproxy (without this it was fine, even with a short disconnection)
Start IDEA again
Wait for 3-30 seconds, while trying to refresh the web dashboard
/servers request can't be performed again, but this time it is even worse, hands like for 3-4 minutes for me

About SSE:
It still shows 0 tools in UI, web dashboard, but when i query tool list from mcp client, im getting all the tools. Dashboard still showing 0. When i tried to call a tool it just timed out:

None of the tools responded, tried about 3

Melodeiro · 2025-10-29T14:55:53Z

Oh and you get OAuth errors becase it is trying to cycle through authentication methods one by one, even if MCP server doesn't support this kind of auth.

F.e.: MCP server supports a simple bearer token, in case of connection issues, mcpproxy fallbacks to other kind of auth until it tries the last one (OAuth) and then fails with OAuth error, but this is weird because as you understand, there isn't OAauth

The problem here i believe is that it misinterprets connection issues as auth fails. Probably because connection issue is happening right during an auth attempt. Thats my guess

So the root cause might be not related to OAuth that much

…105) Implements non-blocking quarantine inspection with circuit breaker pattern to prevent UI freezes and cascading failures with unstable MCP servers. ## Problem (Issue #105) Quarantine inspection used blocking 20s wait loop that froze the MCP handler thread, causing dashboard hangs and requiring forced kills. Unstable servers (like JetBrains MCP) triggered repeated timeouts. ## Solution ### 1. Non-blocking Connection Wait (mcp.go:1177-1269) - Replaced blocking time.Sleep() loop with goroutine + channels - Proper context cancellation support - Progress logging every 2 seconds - 20-second timeout with clean error messages ### 2. Circuit Breaker Pattern (supervisor.go:945-1059) - Tracks consecutive inspection failures per server - 3-failure threshold triggers 5-minute cooldown - Prevents repeated timeout attempts on unstable servers - Success resets failure counter - Methods: CanInspect(), RecordInspectionFailure(), RecordInspectionSuccess() ### 3. Config Sync Fix (mcp.go:1568-1579) - UpdateConfig() called after adding server - Fixes ConfigService synchronization in test environments - Ensures supervisor sees newly added quarantined servers ## Test Results ### Unit Tests (exemption_test.go) - TestCircuitBreaker: 3-failure threshold ✅ - TestCircuitBreakerReset: Success resets counter ✅ - TestExemptionExpiry: Time-based expiration ✅ ### Real-World Test (idea-ide-local) - Calls 1-3: Each took ~20s (timeout), recorded failures ✅ - Call 4: Returned instantly with circuit breaker message ✅ - Cooldown: 5-minute protection active ✅ ### E2E Test Updates - Fixed tool name: upstream_servers → quarantine_security - Improved error handling and debugging - Better content type assertions ## Impact - ✅ No more thread blocking during inspection - ✅ Dashboard remains responsive during 20s timeout - ✅ Circuit breaker prevents cascading failures - ✅ Clear error messages reference issue #105 ## Related - Addresses feedback from PR #110 (SSE stability) - Complements connection loss detection improvements - Part of upstream stability improvements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…s servers **Critical Bug Fix**: The ConnectServer() method was a no-op that only logged and returned nil, never actually calling the managed client's Connect() method. This caused quarantined servers with inspection exemptions to never connect. ## Problem When quarantine inspection granted temporary exemptions: 1. Supervisor created "Connect" action for exempted server 2. Supervisor called ActorPoolSimple.ConnectServer() 3. ConnectServer() logged "Connecting server via manager" and returned nil 4. NO actual connection attempt was made 5. Inspection timed out after 20s waiting for connection that never happened 6. Server logs showed ZERO connection attempts during inspection This was the root cause of issue #105 inspection failures for SSE servers like idea-ide-local, which connect in ~175ms but never got the chance. ## Solution Implemented ConnectServer() to actually call the managed client's Connect(): - Get client from manager - Call client.Connect(ctx) - Handle "already connecting/connected" errors gracefully (expected) - Let supervisor's reconciliation handle retries if needed ## Evidence Fix Works Before: Server logs show NO connection attempts during inspection (16:43-16:44) After: Server logs show successful connection at 18:05:46: - Starting connection attempt: 18:05:46.762 - Successfully connected: 18:05:46.937 (175ms) - Stayed connected: 30 seconds (vs previous 125ms flapping) Unit tests: All 16 supervisor tests passing ✅ ## Related - Closes root cause of SSE inspection failures in #105 - Works with circuit breaker from commit 8221ef7 - Tested with idea-ide-local (JetBrains MCP Server) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Fixes #105 where SSE servers (idea-ide-local) show connected but 0 tools. **Root Cause 1: Context Cancellation** SSE stream context was canceled immediately after Connect() returns. Managed client reconnection creates 5-min timeout context with defer cancel(), which fires after Connect() completes. SSE stream runs in background goroutine and needs context to stay alive for entire connection duration. **Root Cause 2: Concurrent Requests** SSE transport in mcp-go v0.42.0 has issues with concurrent requests. When health checks and reactive discovery both call ListTools simultaneously, responses get lost. **Solution:** 1. Use context.Background() for SSE Start() calls (same as stdio transport) 2. Add sseRequestMu mutex to serialize SSE requests **Changes:** - internal/upstream/core/client.go: Add sseRequestMu mutex, serialize SSE requests in ListTools/CallTool - internal/upstream/core/connection.go: Use context.Background() for SSE Start() in all auth strategies **Testing:** - idea-ide-local now shows tool_count: 20 (was 0) - No "context canceled" errors in logs - stdio and http transports continue working - Health checks pass consistently **Impact:** - Fixes complete SSE support for production use - No performance impact on other transports - Backward compatible Closes #105 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

**E2E Test Fix (TestE2E_InspectQuarantined):** - Fixed type assertion for mcp.TextContent in quarantine inspection test - Test was only checking for *mcp.TextContent (pointer) but content can be value type - Added value type check consistent with debug logging (lines 1073-1078) **Windows CI Fix:** - Added PowerShell retry logic for npm ci on Windows (3 attempts with 5s wait) - Windows has intermittent EPERM errors during npm cleanup - Cleanup node_modules between retries to avoid stale state - Uses --prefer-offline and --no-audit flags for faster installs - Only applies to Windows matrix (Unix uses standard npm ci) **Testing:** - Verified TestE2E_InspectQuarantined passes locally - Windows npm install will now retry on permission errors 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Dumbris added 2 commits October 27, 2025 21:46

Dumbris and others added 2 commits October 28, 2025 09:55

Add examption to reconnect quarantined servers (WIP)

7601c98

Dumbris and others added 4 commits October 29, 2025 17:47

Dumbris force-pushed the fix/upstream-stability branch from f7e0bc5 to f43b290 Compare October 29, 2025 18:46

Dumbris merged commit f7caea5 into main Oct 29, 2025
35 checks passed

Dumbris deleted the fix/upstream-stability branch October 29, 2025 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(upstream): improve SSE stability and OAuth timeout for JetBrains MCP #110

fix(upstream): improve SSE stability and OAuth timeout for JetBrains MCP #110

Uh oh!

Dumbris commented Oct 28, 2025

Uh oh!

github-actions bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

Melodeiro commented Oct 28, 2025 •

edited

Loading

Uh oh!

Melodeiro commented Oct 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(upstream): improve SSE stability and OAuth timeout for JetBrains MCP #110

fix(upstream): improve SSE stability and OAuth timeout for JetBrains MCP #110

Uh oh!

Conversation

Dumbris commented Oct 28, 2025

Summary

Changes

Phase 2 - SSE Transport Hardening

Phase 3 - OAuth Token Lifecycle

Code Cleanup

Testing

Impact

For JetBrains MCP Users:

For Remote/Systemd Users:

Related

Test Plan

Uh oh!

github-actions bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Build Artifacts for PR #110

Available Artifacts

How to Download

Uh oh!

Melodeiro commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Melodeiro commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Oct 28, 2025 •

edited

Loading

Melodeiro commented Oct 28, 2025 •

edited

Loading

Melodeiro commented Oct 29, 2025 •

edited

Loading