-
Notifications
You must be signed in to change notification settings - Fork 193
fix: connection stats being enabled break subscriptions #2203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: connection stats being enabled break subscriptions #2203
Conversation
WalkthroughAdds subscription-focused connection metrics tests for Prometheus and OTLP. Introduces request-aware context/value extraction for tracing and connection metrics, updating related signatures and wiring. Renames and re-exports DialerFunc with syscall.Conn-aware wrapping. Extends anonymized usage reporting to include connection-stats flags for OTEL and Prometheus. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
✨ Finishing Touches
Comment |
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (8)
router/core/transport.go (1)
91-104: Guard against a nil req in callback.If req were ever nil, calling ActiveSubgraph(req) could panic depending on its implementation. Cheap defensive guard:
- getValuesFromRequest := func(ctx context.Context, req *http.Request) (*expr.Context, string) { + getValuesFromRequest := func(ctx context.Context, req *http.Request) (*expr.Context, string) { reqContext := getRequestContext(ctx) if reqContext == nil { return &expr.Context{}, "" } - var activeSubgraphName string - if activeSubgraph := reqContext.ActiveSubgraph(req); activeSubgraph != nil { - activeSubgraphName = activeSubgraph.Name - } + var activeSubgraphName string + if req != nil { + if activeSubgraph := reqContext.ActiveSubgraph(req); activeSubgraph != nil { + activeSubgraphName = activeSubgraph.Name + } + } return &reqContext.expressionContext, activeSubgraphName }Can you confirm traceclient never invokes the getter with a nil request?
router/core/trace_dialer.go (1)
34-37: Minor: return nil on dial error.Safer to avoid leaking a partially initialized conn in edge cases:
- if err != nil { - counter.Add(-1) - return conn, err - } + if err != nil { + counter.Add(-1) + return nil, err + }Please check callers don’t rely on a non-nil conn alongside an error.
router-tests/telemetry/connection_metrics_test.go (2)
253-260: Defer-close the WebSocket to avoid leaking goroutines/FDs across parallel testsClose the connection when the subtest ends to reduce cross-test interference.
- conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil) + conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil) + defer conn.Close()
272-276: Guard against nil before dereferencing the scope metricFail fast with a clear assertion instead of risking a panic if metrics aren’t yet emitted.
- scopeMetric := *integration.GetMetricScopeByName(rm.ScopeMetrics, "cosmo.router.connections") - excludePortFromMetrics(t, rm.ScopeMetrics) + sm := integration.GetMetricScopeByName(rm.ScopeMetrics, "cosmo.router.connections") + require.NotNil(t, sm) + excludePortFromMetrics(t, rm.ScopeMetrics) + scopeMetric := *smrouter-tests/prometheus_test.go (1)
4879-4886: Defer-close the WebSocket to ensure clean teardown in parallel runsPrevents leftover connections from affecting subsequent metric snapshots.
- conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil) + conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil) + defer conn.Close()router/internal/traceclient/traceclient.go (3)
41-51: Validate constructor input to avoid latent panicsFail fast if the callback isn’t provided.
func NewTraceInjectingRoundTripper( base http.RoundTripper, connectionMetricStore metric.ConnectionMetricStore, reqContextValuesGetter func(ctx context.Context, req *http.Request) (*expr.Context, string), ) *TraceInjectingRoundTripper { + if reqContextValuesGetter == nil { + panic("TraceInjectingRoundTripper: reqContextValuesGetter must not be nil") + } return &TraceInjectingRoundTripper{ base: base, connectionMetricStore: connectionMetricStore, reqContextValuesGetter: reqContextValuesGetter, } }
77-80: Use the request context when processing metricsSlightly more robust if middleware swaps contexts on the request path.
- t.processConnectionMetrics(ctx, req) + t.processConnectionMetrics(req.Context(), req)
104-106: Read client-trace from the request contextKeeps trace extraction aligned with the context actually used by httptrace.
-func (t *TraceInjectingRoundTripper) processConnectionMetrics(ctx context.Context, req *http.Request) { - trace := GetClientTraceFromContext(ctx) +func (t *TraceInjectingRoundTripper) processConnectionMetrics(ctx context.Context, req *http.Request) { + trace := GetClientTraceFromContext(req.Context())
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
router-tests/prometheus_test.go(1 hunks)router-tests/telemetry/connection_metrics_test.go(2 hunks)router/core/router_config.go(2 hunks)router/core/trace_dialer.go(4 hunks)router/core/transport.go(1 hunks)router/internal/traceclient/traceclient.go(4 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/core/router_config.go
📚 Learning: 2025-08-28T09:18:10.121Z
Learnt from: endigma
PR: wundergraph/cosmo#2141
File: router-tests/http_subscriptions_test.go:100-108
Timestamp: 2025-08-28T09:18:10.121Z
Learning: In router-tests/http_subscriptions_test.go heartbeat tests, the message ordering should remain strict with data messages followed by heartbeat messages, as the timing is deterministic and known by design in the Cosmo router implementation.
Applied to files:
router-tests/telemetry/connection_metrics_test.gorouter-tests/prometheus_test.go
📚 Learning: 2025-09-03T11:38:45.855Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2183
File: router/internal/traceclient/traceclient.go:115-118
Timestamp: 2025-09-03T11:38:45.855Z
Learning: In the Cosmo codebase, the exprContext obtained from getExprContext in TraceInjectingRoundTripper.processConnectionMetrics is expected to always be non-nil as part of the existing system behavior, so defensive nil checks are not needed.
Applied to files:
router/core/transport.gorouter/internal/traceclient/traceclient.go
📚 Learning: 2025-08-28T09:17:49.477Z
Learnt from: endigma
PR: wundergraph/cosmo#2141
File: router-tests/http_subscriptions_test.go:17-55
Timestamp: 2025-08-28T09:17:49.477Z
Learning: The Cosmo router uses a custom, intentionally rigid multipart implementation for GraphQL subscriptions. The multipart parsing in test files should remain strict and not be made more tolerant, as this rigidity is by design.
Applied to files:
router-tests/prometheus_test.go
🧬 Code graph analysis (5)
router/core/router_config.go (2)
router/pkg/metric/config.go (1)
OpenTelemetry(69-84)router/pkg/config/config.go (1)
Prometheus(97-111)
router-tests/telemetry/connection_metrics_test.go (4)
router-tests/testenv/testenv.go (5)
Run(105-122)Config(284-340)MetricOptions(263-277)Environment(1727-1763)WebSocketMessage(2279-2283)router-tests/utils.go (1)
GetMetricScopeByName(88-95)router/pkg/otel/attributes.go (3)
ServerAddress(93-93)WgClientReusedConnection(54-54)WgSubgraphName(23-23)router/pkg/otel/otelconfig/otelconfig.go (1)
CumulativeTemporality(17-17)
router/core/transport.go (2)
router/internal/expr/expr.go (2)
Context(35-39)Request(66-75)router/internal/traceclient/traceclient.go (1)
NewTraceInjectingRoundTripper(41-51)
router/internal/traceclient/traceclient.go (2)
router/pkg/metric/connection_metric_store.go (1)
ConnectionMetricStore(25-29)router/internal/expr/expr.go (2)
Context(35-39)Request(66-75)
router-tests/prometheus_test.go (1)
router-tests/testenv/testenv.go (5)
Run(105-122)Config(284-340)MetricOptions(263-277)Environment(1727-1763)WebSocketMessage(2279-2283)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: build-router
- GitHub Check: build_test
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: build_push_image
- GitHub Check: image_scan (nonroot)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: image_scan
- GitHub Check: build_test
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./events)
🔇 Additional comments (3)
router/core/router_config.go (1)
188-189: Usage flags for connection stats — looks good; verify downstream consumers.The additions align with the config types and are gated behind the respective enable flags. Please confirm that any telemetry/usage ingestion (dashboards, analytics jobs) recognizes:
- metrics_otel_connection_stats
- metrics_prometheus_connection_stats
Also applies to: 199-200
router/core/trace_dialer.go (1)
87-94: SyscallConn wrapper behavior — LGTM.Graceful fallback with an explicit error preserves interface semantics and avoids unsafe assertions.
router/internal/traceclient/traceclient.go (1)
113-118: Trace injection constructor migration verifiedAll usages of NewTraceInjectingRoundTripper now use the new getValuesFromRequest callback and no legacy getExprContext calls remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
router/core/graph_server.go (1)
133-136: Remove temporary/commented code and document the rationale.Keep TraceDialer always-on if that's the intended fix, but drop the commented guard and replace the note with a clear, permanent comment to avoid confusion.
- // NOTE: Temporarily enable to verify works for all tests / will revert - //if isConnStoreEnabled { - traceDialer = NewTraceDialer() - //} + // Always initialize TraceDialer to keep connection behavior consistent across HTTP/WebSocket/SSE; + // metrics emission remains gated by isConnStoreEnabled below. + traceDialer = NewTraceDialer()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
router/core/graph_server.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
router/core/graph_server.go (1)
router/core/trace_dialer.go (1)
NewTraceDialer(15-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: image_scan (nonroot)
- GitHub Check: image_scan
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
router/core/graph_server.go (1)
131-146: Confirm no unintended side-effects when metrics disabled
Verified newHTTPTransport only wraps DialContext when traceDialer != nil (nil yields the unmodified transport) and TraceDialer.WrapDial performs only atomic counter updates without spawning goroutines or altering socket options; all newHTTPTransport call sites pass traceDialer explicitly.
…led-break-subscriptions
…led-break-subscriptions
Summary by CodeRabbit
Checklist