|
| 1 | +# Upstream Stability Investigation Plan |
| 2 | + |
| 3 | +## Observed Symptoms |
| 4 | +- JetBrains MCP over SSE never surfaces tools, while stdio loads but drops connections; when this happens the dashboard freezes or the daemon hangs until forced termination (reported on Xubuntu 25.04 running as a user-scoped systemd service). |
| 5 | +- Intermittent upstream instability is suspected to stem from SSE transport behaviour and OAuth refresh handling. |
| 6 | + |
| 7 | +## Working Hypotheses (code references) |
| 8 | +1. **SSE timeout ejects long-lived streams** – `internal/transport/http.go:225-279` hard-codes `http.Client{Timeout: 180 * time.Second}` for SSE. Go’s client timeout covers the entire request, so an otherwise healthy SSE stream is forcibly closed every three minutes, likely leaving the proxy in a bad state when the upstream cannot recover quickly. |
| 9 | +2. **Endpoint bootstrap deadline too aggressive** – the SSE transport waits only 30s for the `endpoint` event (`github.com/mark3labs/mcp-go@v0.38.0/client/transport/sse.go:176-187`). If JetBrains (or other) servers delay emitting the endpoint while doing OAuth/device checks, we fail before tools load. |
| 10 | +3. **OAuth browser flow races with remote UX** – manual OAuth waits 30s for the callback (`internal/upstream/core/connection.go:1722-1759`). In a remote/systemd scenario the user may need more time (or use an out-of-band browser), causing repeated failures and triggering connection churn. |
| 11 | +4. **Connection-loss handling gaps** – we never register `Client.OnConnectionLost(...)` on SSE transports, so HTTP/2 idle resets or GOAWAY frames (which JetBrains emits) go unnoticed until the next RPC, amplifying freeze perceptions. This also limits our ability to surface diagnostics in logs/UI. |
| 12 | + |
| 13 | +## Phase 1 – Reproduce & Capture Baseline |
| 14 | +- Configure two JetBrains upstreams (SSE and stdio) with `log.level=debug` and, if possible, `transport` trace logging. |
| 15 | +- Exercising `scripts/run-web-smoke.sh` and manual UI navigation, collect: |
| 16 | + - Upstream-specific logs under `~/.mcpproxy/logs/<server>.log`. |
| 17 | + - HTTP traces for `/events` (SSE) and `/api/v1` from the proxy (e.g. `MITM_PROXY=1 go run ./cmd/mcpproxy` or curl with `--trace-time`). |
| 18 | + - OAuth callback timing from `internal/oauth` logs to confirm 30s deadline trigger frequency. |
| 19 | +- Inspect BoltDB (`bbolt` CLI or `scripts/db-dump.go`) for stored OAuth tokens to see if refresh metadata is present/updated. |
| 20 | + |
| 21 | +**Verification checklist** |
| 22 | +- [ ] Baseline reproduction yields “timeout waiting for endpoint” or “context deadline exceeded” in logs when SSE fails. |
| 23 | +- [ ] Confirm whether OAuth callback timeout entries align with user interaction delays. |
| 24 | +- [ ] Identify whether SSE stream closes almost exactly at 180s uptime. |
| 25 | + |
| 26 | +## Phase 2 – SSE Transport Hardening |
| 27 | +- Audit the full SSE pipeline: |
| 28 | + - Replace the global `http.Client.Timeout` with per-request contexts or keepalive idle deadlines; ensure this does not regress HTTP fallback. |
| 29 | + - Capture GOAWAY/NO_ERROR disconnects by wiring `client.OnConnectionLost` inside `core.connectSSE` and propagate them to the managed client. |
| 30 | + - Revisit the 30s endpoint wait; consider JetBrains-specific delay or signal logging (e.g. log the time between `Start` and first `endpoint` frame). |
| 31 | +- Develop instrumentation hooks: |
| 32 | + - Record SSE connection uptime, retry counters, and last-error state in `StateManager`. |
| 33 | + - Emit structured events (e.g., `EventTypeUpstreamTransport`) with transport diagnostics for `/events`. |
| 34 | + |
| 35 | +**Verification checklist** |
| 36 | +- [ ] Stress an SSE upstream for >10 minutes and confirm no forced disconnect occurs due to client timeout. |
| 37 | +- [ ] Simulate endpoint delay (e.g., proxy that waits 90s before emitting) and confirm new logic handles it or logs actionable warnings. |
| 38 | +- [ ] Ensure managed state transitions (`ready` → `error` → `reconnecting`) align with injected connection-lost scenarios. |
| 39 | + |
| 40 | +## Phase 3 – OAuth Token Lifecycle Review |
| 41 | +- Trace refresh flow end-to-end: |
| 42 | + - Instrument `PersistentTokenStore.SaveToken/GetToken` to log token expiry deltas (guarded by debug level). |
| 43 | + - Validate `MarkOAuthCompletedWithDB` propagation by queuing fake events in BoltDB and ensuring `Manager.processOAuthEvents` consumes them without double-processing. |
| 44 | + - Explore extending the OAuth callback wait window and providing CLI guidance for headless setups (e.g., print verification URL without failing immediately). |
| 45 | +- Consider tooling to introspect OAuth state (`/api/v1/oauth/status` or tray dialog) so users can identify expired/invalid tokens. |
| 46 | + |
| 47 | +**Verification checklist** |
| 48 | +- [ ] Refreshing an OAuth token updates BoltDB and triggers a reconnect without manual intervention. |
| 49 | +- [ ] Extending callback timeout (experimentally) eliminates repeated “OAuth authorization timeout” messages for remote environments. |
| 50 | +- [ ] Cross-process completion events always drive a reconnect within the expected polling window (≤5s default). |
| 51 | + |
| 52 | +## Phase 4 – Introspection & User-Facing Diagnostics |
| 53 | +- Design lightweight diagnostics: |
| 54 | + - CLI subcommand (e.g., `mcpproxy debug upstream <name>`) to dump current transport stats, token expiry, last error, and SSE uptime. |
| 55 | + - Optional `/api/v1/diagnostics/upstream` endpoint returning same payload for UI integration. |
| 56 | +- Expand logging guidance in `MANUAL_TESTING.md` for capturing SSE issues (e.g., enabling trace on `transport` logger, how to tail upstream logs). |
| 57 | +- Evaluate adding Prometheus-style counters (connection retries, OAuth failures) to aid longer-term monitoring. |
| 58 | + |
| 59 | +**Verification checklist** |
| 60 | +- [ ] Diagnostics output surfaces enough context for a user to determine whether the issue is OAuth, SSE transport, or upstream crash. |
| 61 | +- [ ] UI/tray can surface a human-readable warning when SSE drops repeatedly (without freezing). |
| 62 | +- [ ] Documentation changes tested by a fresh install following the guide reproduce the troubleshooting steps successfully. |
| 63 | + |
0 commit comments