Skip to content

Conversation

@SkArchon
Copy link
Contributor

@SkArchon SkArchon commented Sep 10, 2025

Summary by CodeRabbit

  • New Features
    • Connection stats for WebSocket subscriptions available via Prometheus and OTLP (max/active connections, acquire duration) with richer labels (server address/port, subgraph).
    • Improved attribution of metrics and traces to the active subgraph.
  • Refactor
    • Internal networking and transport updated to better propagate request context and support syscall-aware connections for more accurate metrics/tracing.
  • Chores
    • Anonymized usage telemetry now indicates whether Prometheus/OTLP connection stats are enabled.
  • Tests
    • Added end-to-end coverage for subscription connection metrics and engine lifecycle metrics.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
  • I have read the Contributors Guide.

@coderabbitai
Copy link

coderabbitai bot commented Sep 10, 2025

Walkthrough

Adds subscription-focused connection metrics tests for Prometheus and OTLP. Introduces request-aware context/value extraction for tracing and connection metrics, updating related signatures and wiring. Renames and re-exports DialerFunc with syscall.Conn-aware wrapping. Extends anonymized usage reporting to include connection-stats flags for OTEL and Prometheus.

Changes

Cohort / File(s) Summary of Changes
Subscription connection metrics tests
router-tests/prometheus_test.go, router-tests/telemetry/connection_metrics_test.go
Add tests validating connection metrics during GraphQL WebSocket subscription lifecycle for Prometheus and OTLP, including gauges and histograms, attribute checks, and engine-level metrics assertions.
Request-aware tracing and metrics plumbing
router/core/transport.go, router/internal/traceclient/traceclient.go
Replace expr-context getter with request-aware getter returning expr.Context and active subgraph name; propagate req through RoundTrip and processConnectionMetrics; use subgraph name fallback for attributes.
Trace dialer API and syscall support
router/core/trace_dialer.go
Rename/re-export dialerFunc → DialerFunc; update WrapDial signature; add trackedConnWithSyscall to handle syscall.Conn; centralize onClose; add explicit SyscallConn error path.
Usage metrics flags
router/core/router_config.go
Include metrics_otel_connection_stats and metrics_prometheus_connection_stats in anonymized config when respective systems are enabled.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly identifies the primary focus of the changeset—resolving a bug where enabling connection metrics prevents subscriptions from functioning—and does so in a concise, file-agnostic manner that aligns with the pull request’s scope. It avoids unnecessary detail and accurately reflects the main change implemented across the telemetry, router, and trace client code.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e9944c and 22963fe.

📒 Files selected for processing (2)
  • router-tests/prometheus_test.go (1 hunks)
  • router-tests/telemetry/connection_metrics_test.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router-tests/telemetry/connection_metrics_test.go
  • router-tests/prometheus_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: Analyze (go)
✨ Finishing Touches
  • 📝 Generate Docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Sep 10, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-cae83e5cd8f7eeb3c83984b7d177cff241e42c92-nonroot

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
router/core/transport.go (1)

91-104: Guard against a nil req in callback.

If req were ever nil, calling ActiveSubgraph(req) could panic depending on its implementation. Cheap defensive guard:

- getValuesFromRequest := func(ctx context.Context, req *http.Request) (*expr.Context, string) {
+ getValuesFromRequest := func(ctx context.Context, req *http.Request) (*expr.Context, string) {
   reqContext := getRequestContext(ctx)
   if reqContext == nil {
     return &expr.Context{}, ""
   }
-  var activeSubgraphName string
-  if activeSubgraph := reqContext.ActiveSubgraph(req); activeSubgraph != nil {
-    activeSubgraphName = activeSubgraph.Name
-  }
+  var activeSubgraphName string
+  if req != nil {
+    if activeSubgraph := reqContext.ActiveSubgraph(req); activeSubgraph != nil {
+      activeSubgraphName = activeSubgraph.Name
+    }
+  }
   return &reqContext.expressionContext, activeSubgraphName
 }

Can you confirm traceclient never invokes the getter with a nil request?

router/core/trace_dialer.go (1)

34-37: Minor: return nil on dial error.

Safer to avoid leaking a partially initialized conn in edge cases:

- if err != nil {
-   counter.Add(-1)
-   return conn, err
- }
+ if err != nil {
+   counter.Add(-1)
+   return nil, err
+ }

Please check callers don’t rely on a non-nil conn alongside an error.

router-tests/telemetry/connection_metrics_test.go (2)

253-260: Defer-close the WebSocket to avoid leaking goroutines/FDs across parallel tests

Close the connection when the subtest ends to reduce cross-test interference.

-			conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil)
+			conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil)
+			defer conn.Close()

272-276: Guard against nil before dereferencing the scope metric

Fail fast with a clear assertion instead of risking a panic if metrics aren’t yet emitted.

-			scopeMetric := *integration.GetMetricScopeByName(rm.ScopeMetrics, "cosmo.router.connections")
-			excludePortFromMetrics(t, rm.ScopeMetrics)
+			sm := integration.GetMetricScopeByName(rm.ScopeMetrics, "cosmo.router.connections")
+			require.NotNil(t, sm)
+			excludePortFromMetrics(t, rm.ScopeMetrics)
+			scopeMetric := *sm
router-tests/prometheus_test.go (1)

4879-4886: Defer-close the WebSocket to ensure clean teardown in parallel runs

Prevents leftover connections from affecting subsequent metric snapshots.

-			conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil)
+			conn := xEnv.InitGraphQLWebSocketConnection(nil, nil, nil)
+			defer conn.Close()
router/internal/traceclient/traceclient.go (3)

41-51: Validate constructor input to avoid latent panics

Fail fast if the callback isn’t provided.

 func NewTraceInjectingRoundTripper(
   base http.RoundTripper,
   connectionMetricStore metric.ConnectionMetricStore,
   reqContextValuesGetter func(ctx context.Context, req *http.Request) (*expr.Context, string),
 ) *TraceInjectingRoundTripper {
+	if reqContextValuesGetter == nil {
+		panic("TraceInjectingRoundTripper: reqContextValuesGetter must not be nil")
+	}
 	return &TraceInjectingRoundTripper{
 		base:                   base,
 		connectionMetricStore:  connectionMetricStore,
 		reqContextValuesGetter: reqContextValuesGetter,
 	}
 }

77-80: Use the request context when processing metrics

Slightly more robust if middleware swaps contexts on the request path.

-	t.processConnectionMetrics(ctx, req)
+	t.processConnectionMetrics(req.Context(), req)

104-106: Read client-trace from the request context

Keeps trace extraction aligned with the context actually used by httptrace.

-func (t *TraceInjectingRoundTripper) processConnectionMetrics(ctx context.Context, req *http.Request) {
-	trace := GetClientTraceFromContext(ctx)
+func (t *TraceInjectingRoundTripper) processConnectionMetrics(ctx context.Context, req *http.Request) {
+	trace := GetClientTraceFromContext(req.Context())
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6fad5f and 24aee79.

📒 Files selected for processing (6)
  • router-tests/prometheus_test.go (1 hunks)
  • router-tests/telemetry/connection_metrics_test.go (2 hunks)
  • router/core/router_config.go (2 hunks)
  • router/core/trace_dialer.go (4 hunks)
  • router/core/transport.go (1 hunks)
  • router/internal/traceclient/traceclient.go (4 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/core/router_config.go
📚 Learning: 2025-08-28T09:18:10.121Z
Learnt from: endigma
PR: wundergraph/cosmo#2141
File: router-tests/http_subscriptions_test.go:100-108
Timestamp: 2025-08-28T09:18:10.121Z
Learning: In router-tests/http_subscriptions_test.go heartbeat tests, the message ordering should remain strict with data messages followed by heartbeat messages, as the timing is deterministic and known by design in the Cosmo router implementation.

Applied to files:

  • router-tests/telemetry/connection_metrics_test.go
  • router-tests/prometheus_test.go
📚 Learning: 2025-09-03T11:38:45.855Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2183
File: router/internal/traceclient/traceclient.go:115-118
Timestamp: 2025-09-03T11:38:45.855Z
Learning: In the Cosmo codebase, the exprContext obtained from getExprContext in TraceInjectingRoundTripper.processConnectionMetrics is expected to always be non-nil as part of the existing system behavior, so defensive nil checks are not needed.

Applied to files:

  • router/core/transport.go
  • router/internal/traceclient/traceclient.go
📚 Learning: 2025-08-28T09:17:49.477Z
Learnt from: endigma
PR: wundergraph/cosmo#2141
File: router-tests/http_subscriptions_test.go:17-55
Timestamp: 2025-08-28T09:17:49.477Z
Learning: The Cosmo router uses a custom, intentionally rigid multipart implementation for GraphQL subscriptions. The multipart parsing in test files should remain strict and not be made more tolerant, as this rigidity is by design.

Applied to files:

  • router-tests/prometheus_test.go
🧬 Code graph analysis (5)
router/core/router_config.go (2)
router/pkg/metric/config.go (1)
  • OpenTelemetry (69-84)
router/pkg/config/config.go (1)
  • Prometheus (97-111)
router-tests/telemetry/connection_metrics_test.go (4)
router-tests/testenv/testenv.go (5)
  • Run (105-122)
  • Config (284-340)
  • MetricOptions (263-277)
  • Environment (1727-1763)
  • WebSocketMessage (2279-2283)
router-tests/utils.go (1)
  • GetMetricScopeByName (88-95)
router/pkg/otel/attributes.go (3)
  • ServerAddress (93-93)
  • WgClientReusedConnection (54-54)
  • WgSubgraphName (23-23)
router/pkg/otel/otelconfig/otelconfig.go (1)
  • CumulativeTemporality (17-17)
router/core/transport.go (2)
router/internal/expr/expr.go (2)
  • Context (35-39)
  • Request (66-75)
router/internal/traceclient/traceclient.go (1)
  • NewTraceInjectingRoundTripper (41-51)
router/internal/traceclient/traceclient.go (2)
router/pkg/metric/connection_metric_store.go (1)
  • ConnectionMetricStore (25-29)
router/internal/expr/expr.go (2)
  • Context (35-39)
  • Request (66-75)
router-tests/prometheus_test.go (1)
router-tests/testenv/testenv.go (5)
  • Run (105-122)
  • Config (284-340)
  • MetricOptions (263-277)
  • Environment (1727-1763)
  • WebSocketMessage (2279-2283)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: image_scan
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./events)
🔇 Additional comments (3)
router/core/router_config.go (1)

188-189: Usage flags for connection stats — looks good; verify downstream consumers.

The additions align with the config types and are gated behind the respective enable flags. Please confirm that any telemetry/usage ingestion (dashboards, analytics jobs) recognizes:

  • metrics_otel_connection_stats
  • metrics_prometheus_connection_stats

Also applies to: 199-200

router/core/trace_dialer.go (1)

87-94: SyscallConn wrapper behavior — LGTM.

Graceful fallback with an explicit error preserves interface semantics and avoids unsafe assertions.

router/internal/traceclient/traceclient.go (1)

113-118: Trace injection constructor migration verified

All usages of NewTraceInjectingRoundTripper now use the new getValuesFromRequest callback and no legacy getExprContext calls remain.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router/core/graph_server.go (1)

133-136: Remove temporary/commented code and document the rationale.

Keep TraceDialer always-on if that's the intended fix, but drop the commented guard and replace the note with a clear, permanent comment to avoid confusion.

-	// NOTE: Temporarily enable to verify works for all tests / will revert
-	//if isConnStoreEnabled {
-	traceDialer = NewTraceDialer()
-	//}
+	// Always initialize TraceDialer to keep connection behavior consistent across HTTP/WebSocket/SSE;
+	// metrics emission remains gated by isConnStoreEnabled below.
+	traceDialer = NewTraceDialer()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24aee79 and d97cd0d.

📒 Files selected for processing (1)
  • router/core/graph_server.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
router/core/graph_server.go (1)
router/core/trace_dialer.go (1)
  • NewTraceDialer (15-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
router/core/graph_server.go (1)

131-146: Confirm no unintended side-effects when metrics disabled
Verified newHTTPTransport only wraps DialContext when traceDialer != nil (nil yields the unmodified transport) and TraceDialer.WrapDial performs only atomic counter updates without spawning goroutines or altering socket options; all newHTTPTransport call sites pass traceDialer explicitly.

@SkArchon SkArchon merged commit 64f87e6 into main Sep 10, 2025
29 checks passed
@SkArchon SkArchon deleted the milinda/eng-8119-connection-stats-being-enabled-break-subscriptions branch September 10, 2025 12:41
@Noroth Noroth mentioned this pull request Sep 30, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants