Skip to content

Conversation

@SkArchon
Copy link
Contributor

@SkArchon SkArchon commented Aug 13, 2025

This PR contains code for event metrics for edfs. We use a generic produce and consume metrics, which is documented by semconv. https://opentelemetry.io/docs/specs/semconv/messaging/messaging-metrics/#producer-metrics

router.streams.sent.messages: counter of how many messages have been sent
router.streams.received.messages: counter of how many messages have been received

Both metrics share common attributes

  • wg_stream_operation_name: Name of the operation based on the broker, publish, request, etc.
  • wg_provider_type: "kafka", "redis", "nats"
  • wg_error_type: If there is an error we specify a generic error type, "publish_error", etc.
  • wg_destination_name: Name of the target topic / channel
  • wg_provider_id: Name of the configured provider id

The following is an example (extra common attributes like federated_graph_id, etc have been removed)

streams_sent_messages_total{
    wg_destination_name="employeeUpdatedMyNats.2",
    wg_provider_id="my-nats",
    wg_provider_type="nats",
    wg_stream_operation_name="publish"
} 1

streams_received_messages_total{
    wg_destination_name="employeeUpdatedMyNats.2",
    wg_provider_id="my-nats",
    wg_provider_type="nats",
    wg_stream_operation_name="publish"
} 1

Note that this PR does not contain the graphql operation name as an attribute. cc: @StarpTech

Summary by CodeRabbit

  • New Features

    • Per-provider stream observability for messaging (sent/received counters) across Kafka, NATS, and Redis with operation, provider, destination and provider-id labels.
    • New no-op and real stream metric backends (OTLP/Prometheus) and runtime store aggregation.
  • Configuration

    • New flags to enable stream metrics for OTLP and Prometheus (off by default).
  • Improvements

    • Safer metrics init/flush/shutdown handling.
  • Tests

    • End-to-end stream-metrics tests and shared helpers for producing messages.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
  • I have read the Contributors Guide.

@coderabbitai
Copy link

coderabbitai bot commented Aug 13, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Threads a new StreamMetricStore through the pubsub/provider/adapter stack and metric subsystem: adds stream metrics implementations (OTLP, Prometheus, Noop), introduces datasource.ProviderOpts to pass the store to provider adapters (Kafka, NATS, Redis), instruments adapters to emit stream events, extends config/schema with Streams flags, and updates tests and helpers.

Changes

Cohort / File(s) Summary of changes
Metric stream infrastructure
router/pkg/metric/stream_metric_store.go, router/pkg/metric/stream_measurements.go, router/pkg/metric/oltp_stream_metric_store.go, router/pkg/metric/prom_stream_metric_store.go, router/pkg/metric/noop_stream_metrics.go, router/pkg/metric/config.go
Adds StreamMetricStore types, StreamMetricProvider implementations (OTLP/Prometheus/Noop), instruments/counters for stream events, constructor NewStreamMetricStore and wiring for provider selection.
Metric API & attributes
router/pkg/otel/attributes.go, router/pkg/metric/stream_metric_store.go, router/pkg/metric/stream_measurements.go
Adds stream-related OTEL attribute keys and metrics names/options for sent/received stream counters.
Metric init error handling
router/pkg/metric/oltp_connection_metric_store.go, router/pkg/metric/prom_connection_metric_store.go
startInitMetrics return value is now checked and errors propagated during metric initialization.
Config, schema, and testdata
router/pkg/config/config.go, router/pkg/config/config.schema.json, router/pkg/config/testdata/config_defaults.json, router/pkg/config/testdata/config_full.json
Adds boolean Streams flags to Prometheus and OTLP metric configs; updates schema and testdata.
Core engine wiring
router/core/graph_server.go, router/core/factoryresolver.go, router/core/plan_generator.go, router/core/router.go
Adds StreamMetricStore to RouterEngineConfiguration, initializes Noop/real store based on config, wires store into BuildProvidersAndDataSources, ensures shutdown/flush integration, and wires Streams into metric config construction.
PubSub build pipeline
router/pkg/pubsub/pubsub.go, router/pkg/pubsub/datasource/provider.go, router/pkg/pubsub/datasource/mocks.go
Threads metric.StreamMetricStore through BuildProvidersAndDataSources and internal build helpers; introduces ProviderOpts { StreamMetricStore } and updates ProviderBuilder.BuildProvider signature and mocks to accept providerOpts.
Kafka provider
router/pkg/pubsub/kafka/adapter.go, router/pkg/pubsub/kafka/provider_builder.go, router/pkg/pubsub/kafka/provider_builder_test.go
Adds per-provider streamMetricStore on adapter, emits stream metrics on receive/produce, propagates providerId via PollerOpts, updates NewProviderAdapter/BuildProvider signatures and tests, stores adapters by provider ID.
NATS provider
router/pkg/pubsub/nats/adapter.go, router/pkg/pubsub/nats/provider_builder.go, router/pkg/pubsub/nats/provider_builder_test.go, demo/pkg/subgraphs/subgraphs.go
Adds streamMetricStore to adapter, instruments publish/request/receive with stream events, updates NewAdapter/Publish signatures to accept ProviderOpts, and updates demo to pass a Noop store.
Redis provider
router/pkg/pubsub/redis/adapter.go, router/pkg/pubsub/redis/provider_builder.go, router/pkg/pubsub/redis/provider_builder_test.go
Adds streamMetricStore to adapter, instruments publish/subscribe metrics, accepts datasource.ProviderOpts in NewProviderAdapter/BuildProvider, improves subscription shutdown handling, and updates tests to provide opts.
PubSub tests & mocks
router/pkg/pubsub/pubsub_test.go
Tests updated to pass a NoopStreamMetricStore and reflect updated BuildProvidersAndDataSources signature and mock expectations.
Events test helpers and packages
router-tests/events/event_helpers.go, router-tests/events/events_config_test.go, router-tests/events/kafka_events_test.go, router-tests/events/nats_events_test.go, router-tests/events/redis_events_test.go
Adds exported test helpers (ProduceKafkaMessage, EnsureTopicExists, ProduceRedisMessage), moves several tests from package events_test → events, and replaces local helpers with shared exported helpers.
Stream metrics integration tests
router-tests/prometheus_stream_metrics_test.go, router-tests/telemetry/stream_metrics_test.go
Adds end-to-end tests validating Prometheus/OTLP stream metrics across Kafka, NATS, and Redis, asserting operation, provider type/id, destination, and counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • feat: edfs stream metrics #2137 — Implements stream/event metrics and threads datasource.ProviderOpts (including StreamMetricStore) through provider builders and adapters; appears to be the same code-level change.

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 83bc1f8 and 0da26e7.

📒 Files selected for processing (3)
  • router-tests/prometheus_stream_metrics_test.go (1 hunks)
  • router-tests/telemetry/stream_metrics_test.go (1 hunks)
  • router-tests/testenv/testenv.go (5 hunks)
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch milinda/eng-7785-subscriptions-metrics

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link

github-actions bot commented Aug 13, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-486d820623f0b3dcde6d74c1cccd301da20e2506-nonroot

@SkArchon
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Aug 14, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🔭 Outside diff range comments (17)
router/pkg/metric/prom_connection_metric_store.go (2)

55-61: Initial max-connections observations omit base attributes

For consistency with the callback (which includes both base and per-subgraph attrs), include the provided base attributes in the initial MeasureMaxConnections calls. Otherwise, time series will differ in label sets between initial and subsequent observations.

Apply:

-		h.MeasureMaxConnections(context.Background(), maxConns, otelmetric.WithAttributes(attrs...))
+		h.MeasureMaxConnections(
+			context.Background(),
+			maxConns,
+			otelmetric.WithAttributes(attributes...),
+			otelmetric.WithAttributes(attrs...),
+		)

101-104: errors.Join usage drops previously collected errors

You overwrite err each iteration with errors.Join(regErr), losing prior errors. Accumulate instead.

Apply:

-		if regErr := reg.Unregister(); regErr != nil {
-			err = errors.Join(regErr)
-		}
+		if regErr := reg.Unregister(); regErr != nil {
+			err = errors.Join(err, regErr)
+		}
router/pkg/metric/oltp_connection_metric_store.go (2)

55-61: Include base attributes in initial max-connections measurements

Match the callback’s attribute set by adding baseAttributes to the initial MeasureMaxConnections calls. This avoids label-set drift between initial and subsequent observations.

Apply:

-		h.MeasureMaxConnections(context.Background(), maxConns, otelmetric.WithAttributes(attrs...))
+		h.MeasureMaxConnections(
+			context.Background(),
+			maxConns,
+			otelmetric.WithAttributes(attributes...),
+			otelmetric.WithAttributes(attrs...),
+		)

102-105: errors.Join usage drops previously collected errors

Same issue as in the Prometheus implementation—accumulate unregister errors instead of overwriting.

Apply:

-		if regErr := reg.Unregister(); regErr != nil {
-			err = errors.Join(regErr)
-		}
+		if regErr := reg.Unregister(); regErr != nil {
+			err = errors.Join(err, regErr)
+		}
router-tests/events/nats_events_test.go (2)

87-95: Build is broken: undefined type natsSubscriptionArgs (reintroduce the helper type).

The local helper type was removed but is still referenced throughout this file, causing the Router CI failure:
undefined: natsSubscriptionArgs

Reintroduce a minimal local type to unblock builds.

Apply this diff near the top (after the timeout const) to restore the type:

 const NatsWaitTimeout = time.Second * 30

+// helper args container for subscription callbacks
+type natsSubscriptionArgs struct {
+	dataValue []byte
+	errValue  error
+}

1548-1555: Slice preallocation bug: using make with len and then appending doubles the slice size with zero entries.

You preallocate natsEventSources with length and then append, leaving leading zero-valued entries. This yields invalid provider entries and can alter test behavior.

Apply this diff to preallocate capacity only:

-				natsEventSources := make([]config.NatsEventSource, len(testenv.DemoNatsProviders))
-				for _, sourceName := range testenv.DemoNatsProviders {
-					natsEventSources = append(natsEventSources, config.NatsEventSource{
+				natsEventSources := make([]config.NatsEventSource, 0, len(testenv.DemoNatsProviders))
+				for _, sourceName := range testenv.DemoNatsProviders {
+					natsEventSources = append(natsEventSources, config.NatsEventSource{
 						ID:  sourceName,
 						URL: url,
 					})
 				}
router/pkg/config/config.go (1)

103-111: Apply Prometheus EventMetrics env var fix in config.go

Scan for METRICS_OTLP_EVENT_METRICS shows two definitions of EventMetrics in router/pkg/config/config.go. Only the Prometheus one (lines 103–111) should be updated; the OTLP toggle at lines 141–143 must remain using METRICS_OTLP_EVENT_METRICS.

• Modify Prometheus config at ~line 103:

-	EventMetrics bool `yaml:"event_metrics" envDefault:"false" env:"METRICS_OTLP_EVENT_METRICS"`
+	EventMetrics bool `yaml:"event_metrics" envDefault:"false" env:"PROMETHEUS_EVENT_METRICS"`

• Leave the OTLP config (lines 141–143) unchanged.
• Update config.schema.json and any docs to reflect PROMETHEUS_EVENT_METRICS.
• Add an integration test that sets PROMETHEUS_EVENT_METRICS=true to guard against regressions.

router/pkg/pubsub/datasource/mocks.go (1)

795-853: Mock BuildProvider must forward providerOpts to the testify mock

The autogenerated mock for BuildProvider currently ignores the providerOpts parameter when calling _mock.Called, so any expectations on providerOpts will silently be dropped. Please update router/pkg/pubsub/datasource/mocks.go to:

  • In the MockProviderBuilder.BuildProvider implementation (around line 795):
    • Change
    ret := _mock.Called(options)
    to
    ret := _mock.Called(options, providerOpts)
    • Adjust the return‐value logic to first handle callbacks of type
    func(P, ProviderOpts) (Provider, error)
    and
    func(P, ProviderOpts) Provider
    before falling back to the old single‐arg signatures.
  • In the expecter helper MockProviderBuilder_Expecter.BuildProvider (around line 829):
    • Add a second parameter for providerOpts interface{}
    • Change the call to
    _e.mock.On("BuildProvider", options, providerOpts)
  • In the call‐builder helpers (around lines 833–853):
    • Update Run to accept func(options P, providerOpts ProviderOpts) and extract args[1]
    • Update RunAndReturn to accept func(options P, providerOpts ProviderOpts) (Provider, error)

Applying these changes will allow tests to set and assert expectations on the new providerOpts argument.

router/pkg/pubsub/pubsub_test.go (1)

64-65: Mock expectations likely outdated: BuildProvider now accepts provider options.

Per the PR changes, MockProviderBuilder.BuildProvider takes providerOpts. The current mocks expect only a single argument and will mismatch once the new signature is in effect. Update the test mocks to include the additional argument (use mock.Anything to keep tests decoupled from the exact opts).

Apply these diffs:

@@
-	mockBuilder.On("BuildProvider", natsEventSources[0]).Return(mockPubSubProvider, nil)
+	mockBuilder.On("BuildProvider", natsEventSources[0], mock.Anything).Return(mockPubSubProvider, nil)
@@
-	mockBuilder.On("BuildProvider", natsEventSources[0]).Return(nil, errors.New("provider error"))
+	mockBuilder.On("BuildProvider", natsEventSources[0], mock.Anything).Return(nil, errors.New("provider error"))
@@
-	mockBuilder.On("BuildProvider", natsEventSources[1]).Return(mockPubSubUsedProvider, nil)
+	mockBuilder.On("BuildProvider", natsEventSources[1], mock.Anything).Return(mockPubSubUsedProvider, nil)

Add the missing import to use mock.Anything:

@@
-	"github.com/stretchr/testify/assert"
+	"github.com/stretchr/testify/assert"
+	"github.com/stretchr/testify/mock"

Also applies to: 121-122, 239-240

router/pkg/metric/oltp_event_metric_store.go (1)

68-69: Interface mismatch: Shutdown signature must accept a context.

EventMetricStore requires Shutdown(ctx context.Context) error. The current method omits the context and won’t satisfy the interface. Also, delegating to meterProvider.Shutdown(ctx) ensures proper exporter teardown.

Apply this diff:

-func (o *otlpEventMetrics) Shutdown() error { return nil }
+func (o *otlpEventMetrics) Shutdown(ctx context.Context) error {
+	return o.meterProvider.Shutdown(ctx)
+}
router/core/factoryresolver.go (2)

472-480: Add nil-safe fallback for EventMetricStore

Add a guard around routerEngineConfig.EventMetricStore in factoryresolver.go to prevent nil stores from causing panics. Verified this is the only production call site that needs updating.

  • File: router/core/factoryresolver.go
 factoryProviders, factoryDataSources, err := pubsub.BuildProvidersAndDataSources(
   l.ctx,
   routerEngineConfig.Events,
+  func() rmetric.EventMetricStore {
+    if routerEngineConfig.EventMetricStore == nil {
+      return rmetric.NewNoopEventMetricStore()
+    }
+    return routerEngineConfig.EventMetricStore
+  }(),
   l.logger,
   pubSubDS,
   l.resolver.InstanceData().HostName,
   l.resolver.InstanceData().ListenAddress,
 )

210-216: Ensure all RouterEngineConfiguration literals set EventMetricStore

In router/core/plan_generator.go the zero-value literal leaves EventMetricStore nil. Update the struct instantiation to use a noop store, for example:

• router/core/plan_generator.go:256

- routerEngineConfig := RouterEngineConfiguration{}
+ routerEngineConfig := RouterEngineConfiguration{
+     EventMetricStore: rmetric.NoopEventMetricStore,
+ }

The graph_server path already supplies a non-nil store.

router-tests/events/kafka_events_test.go (1)

71-79: Bug in EnsureTopicExists helper: preallocate+append yields empty topics.

Calls to EnsureTopicExists(t, xEnv, topics...) rely on router-tests/events/event_helpers.go. That helper constructs prefixedTopics := make([]string, len(topics)) and then appends, resulting in an initial slice of empty strings plus the desired topics. Deleting/creating topics with empty names will likely fail intermittently.

Fix in router-tests/events/event_helpers.go:

func EnsureTopicExists(t *testing.T, xEnv *testenv.Environment, topics ...string) {
	// Delete topic for idempotency
	deleteCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
	defer cancel()

	prefixedTopics := make([]string, 0, len(topics))
	for _, topic := range topics {
		prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic))
	}

	_, err := xEnv.KafkaAdminClient.DeleteTopics(deleteCtx, prefixedTopics...)
	require.NoError(t, err)

	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
	defer cancel()

	_, err = xEnv.KafkaAdminClient.CreateTopics(ctx, 1, 1, nil, prefixedTopics...)
	require.NoError(t, err)
}

Do you want me to open a PR patching this helper for you?

router/pkg/pubsub/redis/provider_builder.go (1)

46-66: Guard against missing adapter to avoid nil deref.

Unlike Kafka/NATS builders, BuildEngineDataSourceFactory does not check whether an adapter exists for providerId. If the configuration references a non-existent provider, b.adapters[providerId] will be nil and downstream usage can panic.

Align with other builders by checking existence and returning an error if not found.

Apply this diff:

 func (b *ProviderBuilder) BuildEngineDataSourceFactory(data *nodev1.RedisEventConfiguration) (datasource.EngineDataSourceFactory, error) {
   providerId := data.GetEngineEventConfiguration().GetProviderId()

+  adapter, ok := b.adapters[providerId]
+  if !ok {
+    return nil, fmt.Errorf("failed to get adapter for provider %s with ID %s", b.TypeID(), providerId)
+  }
+
   var eventType EventType
   switch data.GetEngineEventConfiguration().GetType() {
   case nodev1.EventType_PUBLISH:
     eventType = EventTypePublish
   case nodev1.EventType_SUBSCRIBE:
     eventType = EventTypeSubscribe
   default:
     return nil, fmt.Errorf("unsupported event type: %s", data.GetEngineEventConfiguration().GetType())
   }

   return &EngineDataSourceFactory{
-    RedisAdapter: b.adapters[providerId],
+    RedisAdapter: adapter,
     fieldName:    data.GetEngineEventConfiguration().GetFieldName(),
     eventType:    eventType,
     channels:     data.GetChannels(),
     providerId:   providerId,
   }, nil
 }
router/pkg/pubsub/kafka/adapter.go (1)

48-60: Use the subscription context for PollRecords to improve cancellation responsiveness.

topicPoller checks ctx.Done(), but PollRecords currently uses p.ctx, which can delay cancellation when the subscription ctx is canceled. Use ctx in PollRecords so the poll loop can exit promptly on unsubscribe.

Apply this diff:

-      fetches := client.PollRecords(p.ctx, 10_000)
+      fetches := client.PollRecords(ctx, 10_000)
router/pkg/pubsub/nats/adapter.go (2)

129-147: JetStream fetch loop terminates on no-message/timeout; should continue with backoff.

Using FetchNoWait(300) and returning on any error will stop the subscription when there are temporarily no messages (or on timeouts). This leads to dropped subscriptions and flakiness. Also, calling FetchNoWait in a tight loop risks a CPU spin if changed to continue.

Switch to Fetch with a short context timeout and continue on deadline/timeout errors.

-                    msgBatch, consumerFetchErr := consumer.FetchNoWait(300)
-                    if consumerFetchErr != nil {
-                        log.Error("error fetching messages", zap.Error(consumerFetchErr))
-                        return
-                    }
+                    fetchCtx, cancel := context.WithTimeout(p.ctx, 500*time.Millisecond)
+                    msgBatch, consumerFetchErr := consumer.Fetch(fetchCtx, 300)
+                    cancel()
+                    if consumerFetchErr != nil {
+                        // benign: no messages within the fetch window; keep polling
+                        if errors.Is(consumerFetchErr, context.DeadlineExceeded) || errors.Is(consumerFetchErr, nats.ErrTimeout) {
+                            continue
+                        }
+                        log.Error("error fetching messages", zap.Error(consumerFetchErr))
+                        return
+                    }

318-341: Default to a NoopEventMetricStore to prevent nil-pointer panics.

eventMetricStore is dereferenced in multiple code paths. If ProviderOpts.EventMetricStore is nil, this will panic. Initialize a NoopEventMetricStore by default.

 func NewAdapter(
   ctx context.Context,
   logger *zap.Logger,
   url string,
   opts []nats.Option,
   hostName string,
   routerListenAddr string,
   providerOpts datasource.ProviderOpts,
 ) (Adapter, error) {
   if logger == nil {
     logger = zap.NewNop()
   }
 
+  if providerOpts.EventMetricStore == nil {
+    providerOpts.EventMetricStore = metric.NewNoopEventMetricStore()
+  }
+
   return &ProviderAdapter{
     ctx:              ctx,
     logger:           logger.With(zap.String("pubsub", "nats")),
     closeWg:          sync.WaitGroup{},
     hostName:         hostName,
     routerListenAddr: routerListenAddr,
     url:              url,
     opts:             opts,
     flushTimeout:     10 * time.Second,
     eventMetricStore: providerOpts.EventMetricStore,
   }, nil
 }
🧹 Nitpick comments (32)
router/pkg/metric/prom_connection_metric_store.go (2)

46-49: Best-effort init is fine; consider documenting the intent and guarding against nil logger

Logging and continuing aligns with “best-effort” startup, but:

  • If logger can be nil, this will panic. Guard with zap.NewNop().
  • Add a short comment to document that init errors are intentionally non-fatal.

Apply this guard near the top of the constructor:

 func newPromConnectionMetrics(logger *zap.Logger, meterProvider *metric.MeterProvider, stats *ConnectionPoolStats, attributes []attribute.KeyValue) (*promConnectionMetrics, error) {
+	if logger == nil {
+		logger = zap.NewNop()
+	}
 	meter := meterProvider.Meter(

54-92: Receiver naming is inconsistent (h vs m)

This file mixes receiver names (startInitMetrics uses h; other methods use m). Unify for readability.

Would you like a follow-up patch to rename receivers consistently (e.g., m) across methods?

router/pkg/metric/oltp_connection_metric_store.go (2)

46-49: Best-effort init: log-and-continue is OK; add nil-logger guard and document behavior

Same note as the Prometheus variant:

  • Guard against nil logger to avoid panic.
  • Consider a brief comment explaining that initialization errors are intentionally non-fatal.

Suggested guard:

 func newOtlpConnectionMetrics(logger *zap.Logger, meterProvider *metric.MeterProvider, stats *ConnectionPoolStats, baseAttributes []attribute.KeyValue) (*otlpConnectionMetrics, error) {
+	if logger == nil {
+		logger = zap.NewNop()
+	}
 	meter := meterProvider.Meter(

54-85: Duplicate init logic across OTLP and Prometheus — consider DRYing up

startInitMetrics is nearly identical in both files. Extract common logic into a shared helper (e.g., a small internal function taking meter, instruments, stats, and base attrs) to reduce drift and ease future changes.

If you want, I can propose a small internal package-level helper to encapsulate the shared callback registration and initial measurements.

router-tests/events/nats_events_test.go (1)

464-464: Fix test name grammar.

Nit: “subscribe after message don't a boundary” → “subscribe after message: don't add a boundary”.

Apply this diff:

-	t.Run("subscribe after message don't a boundary", func(t *testing.T) {
+	t.Run("subscribe after message: don't add a boundary", func(t *testing.T) {
router/pkg/metric/event_measurements.go (1)

12-19: Metric names consistency check.

Names look coherent. Consider pluralizing failure counters for consistency (“failures”) if you plan to expose these externally for long-term stability:

  • router.events.publish.fail → router.events.publish.failures
  • router.nats.request.fail → router.nats.request.failures

If you decide to change names now (before they become part of a stable public contract), update both the constants and any downstream references in providers/stores.

router/pkg/pubsub/datasource/provider.go (1)

36-38: Add doc comment for ProviderOpts to clarify intent and future usage

A brief comment will help consumers understand what belongs in ProviderOpts vs. provider-specific options.

Apply:

+// ProviderOpts carries cross-cutting, runtime-provided facilities for providers
+// that don't belong to their static config (e.g. metric/event stores).
 type ProviderOpts struct {
   EventMetricStore metric.EventMetricStore
 }
demo/pkg/subgraphs/subgraphs.go (1)

214-221: Adapter calls updated with ProviderOpts — OK

Calls to natsPubsub.NewAdapter now include ProviderOpts{}. This keeps the demo compiling with the new signature. If you later want to demo event metrics, consider threading a real EventMetricStore here.

router/pkg/metric/oltp_event_metric_store.go (1)

23-40: Optional: guard against nil meterProvider and consider using logger or removing it.

If meterProvider is ever nil, newOtlpEventMetrics will panic. Either validate inputs or document invariants. Also, logger is stored but unused; either log instrumentation setup failure or drop the field.

Example guard:

 func newOtlpEventMetrics(logger *zap.Logger, meterProvider *metric.MeterProvider) (*otlpEventMetrics, error) {
+	if meterProvider == nil {
+		return nil, fmt.Errorf("meterProvider is nil")
+	}
 	meter := meterProvider.Meter(

And either remove logger or use it to log instrumentation errors:

 	instruments, err := newEventInstruments(meter)
 	if err != nil {
-		return nil, err
+		if logger != nil {
+			logger.Error("failed to initialize event instruments", zap.Error(err))
+		}
+		return nil, err
 	}
router/pkg/pubsub/redis/adapter.go (1)

90-96: Optional: close PubSub on cleanup to avoid resource leaks.

Unsubscribing is good, but explicitly closing the subscription ensures resources are freed promptly.

Apply this diff:

 	cleanup := func() {
 		err := sub.PUnsubscribe(ctx, event.Channels...)
 		if err != nil {
 			log.Error(fmt.Sprintf("error unsubscribing from redis for topics %v", event.Channels), zap.Error(err))
 		}
+		_ = sub.Close()
 	}
router/pkg/metric/prom_event_metric_store.go (2)

64-65: Prefer logging on ForceFlush errors.

Bubble up the error but also log it for visibility during tests and ops.

Apply this diff:

-func (p *promEventMetrics) Flush(ctx context.Context) error { return p.meterProvider.ForceFlush(ctx) }
-func (p *promEventMetrics) Shutdown() error                 { return nil }
+func (p *promEventMetrics) Flush(ctx context.Context) error {
+	err := p.meterProvider.ForceFlush(ctx)
+	if err != nil && p.logger != nil {
+		p.logger.Debug("prom event metrics flush failed", zap.Error(err))
+	}
+	return err
+}
+func (p *promEventMetrics) Shutdown() error { return nil }

16-21: Struct fields ‘meter’ (and previously ‘logger’) are currently unused in methods.

If you don’t plan to use meter, consider removing it to avoid lint warnings. You already use logger with the Flush suggestion above.

router-tests/testenv/testenv.go (1)

2825-2827: NATS test adapters now receive ProviderOpts with a Noop EventMetricStore.

Good default for tests. If you later add tests validating event metrics emitted by subgraph-side publishers, consider making this configurable to pass a real store.

router/core/graph_server.go (1)

759-764: Flush EventMetrics before Shutdown to avoid losing buffered data.

Shutdown sequencing should flush first to minimize metric loss on shutdown.

Apply this diff:

-	if s.eventMetricStore != nil {
-		if aErr := s.eventMetricStore.Shutdown(ctx); aErr != nil {
-			err = errors.Join(err, aErr)
-		}
-	}
+	if s.eventMetricStore != nil {
+		if aErr := s.eventMetricStore.Flush(ctx); aErr != nil {
+			err = errors.Join(err, aErr)
+		}
+		if aErr := s.eventMetricStore.Shutdown(ctx); aErr != nil {
+			err = errors.Join(err, aErr)
+		}
+	}
router/pkg/pubsub/kafka/adapter.go (1)

183-189: Emit publish metrics on both success and failure; consider guarding if store can be nil.

Metrics on failure and success are emitted correctly and after waiting for the produce callback. This will panic if eventMetricStore is nil; the constructor-level default suggested above avoids that risk. If you prefer not to default, add nil checks here.

If you decide against constructor defaulting, here’s a guarded alternative:

-    p.eventMetricStore.KafkaPublishFailure(ctx, event.ProviderID, event.Topic)
+    if p.eventMetricStore != nil {
+      p.eventMetricStore.KafkaPublishFailure(ctx, event.ProviderID, event.Topic)
+    }
...
-    p.eventMetricStore.KafkaPublish(ctx, event.ProviderID, event.Topic)
+    if p.eventMetricStore != nil {
+      p.eventMetricStore.KafkaPublish(ctx, event.ProviderID, event.Topic)
+    }

Please confirm whether a no-op store is always injected across the codebase; if yes, we can keep the current calls without guards.

router-tests/events/event_helpers.go (2)

46-54: Optionally ignore “unknown topic” on delete to be idempotent.

DeleteTopics for non-existent topics can return an error. Since this function aims for idempotency, consider ignoring "unknown topic or partition" errors from the admin API rather than failing the test.

If the client returns typed per-topic results, iterate and only require.NoError on unexpected errors. If it returns a single error, pattern-match and ignore expected “unknown topic” errors.


16-35: Minor: make produce callback non-blocking.

pErrCh is unbuffered; the producer's callback goroutine can block if AwaitChannelWithT isn't receiving yet. Buffering to 1 makes the callback non-blocking and simplifies synchronization.

Apply this diff:

-  pErrCh := make(chan error)
+  pErrCh := make(chan error, 1)
router-tests/prometheus_event_metrics_test.go (3)

48-51: Guard against missing metric family to avoid nil dereference.

Add a nil assertion before using family to produce clearer failures.

-        family := findMetricFamilyByName(mf, "router_events_publish_messages_total")
+        family := findMetricFamilyByName(mf, "router_events_publish_messages_total")
+        require.NotNil(t, family, "metric family router_events_publish_messages_total not found")
         metrics := family.GetMetric()
         require.Len(t, metrics, 1)

112-126: Guard against missing metric family in Kafka subscribe path as well.

Nit for consistency: assert the metric family is found before use.

-                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
+                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
+                    require.NotNil(t, family, "metric family router_events_messages_received_total not found")
                     metrics := family.GetMetric()
                     require.Len(t, metrics, 1)

272-285: Add robustness checks before indexing metrics slice.

In this NATS subscribe path, you don't assert the number of metrics before accessing metrics[0]. Add a length assertion for symmetry with other tests and to avoid potential panics.

-                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
-                    metrics := family.GetMetric()
+                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
+                    require.NotNil(t, family, "metric family router_events_messages_received_total not found")
+                    metrics := family.GetMetric()
+                    require.Len(t, metrics, 1)
router/pkg/metric/noop_event_metrics.go (1)

3-7: Remove unused noopEventMetricProvider and otel metric import.

noopEventMetricProvider is dead code and introduces an unnecessary otel dependency in this file. Simplify by removing it.

 import (
   "context"
-
-  otelmetric "go.opentelemetry.io/otel/metric"
 )
 
-// A noop metric provider so we do not need to do nil checks for each provider call from the store
-type noopEventMetricProvider struct{}
-
-func (n *noopEventMetricProvider) Publish(ctx context.Context, opts ...otelmetric.AddOption)        {}
-func (n *noopEventMetricProvider) PublishFailure(ctx context.Context, opts ...otelmetric.AddOption) {}
-func (n *noopEventMetricProvider) MessagesReceived(ctx context.Context, opts ...otelmetric.AddOption) {
-}
-
-func (n *noopEventMetricProvider) NatsRequest(ctx context.Context, opts ...otelmetric.AddOption) {}
-func (n *noopEventMetricProvider) NatsRequestFailure(ctx context.Context, opts ...otelmetric.AddOption) {
-}
-func (n *noopEventMetricProvider) Flush(ctx context.Context) error { return nil }
-func (n *noopEventMetricProvider) Shutdown() error                 { return nil }
router-tests/telemetry/event_metrics_test.go (6)

54-60: Avoid panics: assert data points are present before indexing.

Guard against empty data point sets to produce clearer failures. Apply similarly to other occurrences in this file.

-                sum, ok := metric.Data.(metricdata.Sum[int64])
-                require.True(t, ok)
-
-                dataPoint := sum.DataPoints[0]
+                sum, ok := metric.Data.(metricdata.Sum[int64])
+                require.True(t, ok)
+                require.NotEmpty(t, sum.DataPoints)
+                dataPoint := sum.DataPoints[0]

118-126: Repeat the data-point presence check in Kafka subscribe path.

Same defensive assertion recommended here before accessing sum.DataPoints[0].


213-231: Repeat the data-point presence check in NATS request path.

Add require.NotEmpty(t, sum.DataPoints) before indexing.


293-314: Repeat the data-point presence check in NATS subscribe path.

Add require.NotEmpty(t, sum.DataPoints) before indexing.


345-366: Repeat the data-point presence check in Redis publish path.

Add require.NotEmpty(t, sum.DataPoints) before indexing.


413-434: Repeat the data-point presence check in Redis subscribe path.

Add require.NotEmpty(t, sum.DataPoints) before indexing.

router/pkg/metric/event_metric_store.go (5)

16-30: Optionally pass context to Shutdown to align with OTel APIs

OTel components typically require context for Shutdown. Consider threading ctx through the provider Shutdown to support deadlines/cancellation and align with go.opentelemetry.io APIs. This change will touch provider implementations.

Apply this diff (breaking change within this PR):

 type EventMetricProvider interface {
   // unified publish/receive for brokers (kafka, redis, nats)
   Publish(ctx context.Context, opts ...otelmetric.AddOption)
   PublishFailure(ctx context.Context, opts ...otelmetric.AddOption)
   MessagesReceived(ctx context.Context, opts ...otelmetric.AddOption)

   // keep NATS request separate
   NatsRequest(ctx context.Context, opts ...otelmetric.AddOption)
   NatsRequestFailure(ctx context.Context, opts ...otelmetric.AddOption)

   Flush(ctx context.Context) error
-  Shutdown() error
+  Shutdown(ctx context.Context) error
 }
 func (e *EventMetrics) Shutdown(ctx context.Context) error {
   var err error

   if errFlush := e.Flush(ctx); errFlush != nil {
     err = errors.Join(err, fmt.Errorf("failed to flush metrics: %w", errFlush))
   }

-  if errProm := e.promMetrics.Shutdown(); errProm != nil {
+  if errProm := e.promMetrics.Shutdown(ctx); errProm != nil {
     err = errors.Join(err, fmt.Errorf("failed to shutdown prom metrics: %w", errProm))
   }

-  if errOtlp := e.otlpMetrics.Shutdown(); errOtlp != nil {
+  if errOtlp := e.otlpMetrics.Shutdown(ctx); errOtlp != nil {
     err = errors.Join(err, fmt.Errorf("failed to shutdown otlp metrics: %w", errOtlp))
   }

   return err
 }

Also applies to: 209-226


60-85: Guard against nil metricsConfig to avoid panics

If callers accidentally pass a nil metricsConfig, the current code panics. A small guard keeps the store safe and no-op by default.

Apply this diff:

 func NewEventMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *sdkmetric.MeterProvider, metricsConfig *Config) (*EventMetrics, error) {
   store := &EventMetrics{
     baseAttributes: baseAttributes,
     logger:         logger,
     otlpMetrics:    &noopEventMetricProvider{},
     promMetrics:    &noopEventMetricProvider{},
   }

+  if metricsConfig == nil {
+    // No metrics configured; return noop store.
+    return store, nil
+  }

92-180: Reduce repetition with a small helper for common attributes

Most methods repeat provider.type and provider.id. A tiny helper can centralize this and reduce copy/paste errors.

Apply this diff to add a helper (usage can be incrementally adopted):

 func (e *EventMetrics) withAttrs(attrs ...attribute.KeyValue) otelmetric.AddOption {
   copied := append([]attribute.KeyValue{}, e.baseAttributes...)
   return otelmetric.WithAttributes(append(copied, attrs...)...)
 }
+
+// withProviderAttrs composes common provider attributes with additional ones.
+func (e *EventMetrics) withProviderAttrs(providerType, providerID string, extra ...attribute.KeyValue) otelmetric.AddOption {
+  attrs := []attribute.KeyValue{
+    otelattrs.WgEventProviderType.String(providerType),
+    otelattrs.WgEventProviderID.String(providerID),
+  }
+  attrs = append(attrs, extra...)
+  return e.withAttrs(attrs...)
+}

92-170: Optional: replace magic strings with typed constants

The literal "kafka", "redis", "nats" appear multiple times. Introduce constants to avoid typos and enable future refactors.

Apply this diff to define constants:

 package metric

 import (
@@
   otelattrs "github.com/wundergraph/cosmo/router/pkg/otel"
 )
 
+const (
+  providerTypeKafka = "kafka"
+  providerTypeRedis = "redis"
+  providerTypeNats  = "nats"
+)

Then replace occurrences of "kafka"/"redis"/"nats" with the respective constants (can be done progressively, especially if you adopt withProviderAttrs).


16-23: Naming consistency: consider singular MessageReceived

Provider interface uses MessagesReceived (plural) whereas store methods are KafkaMessageReceived, RedisMessageReceived, NatsMessageReceived (singular). Consider aligning to singular for consistency. This is cosmetic but avoids cognitive overhead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 36083a7 and 7f8e869.

📒 Files selected for processing (34)
  • demo/pkg/subgraphs/employees/subgraph/schema.resolvers.go (1 hunks)
  • demo/pkg/subgraphs/subgraphs.go (2 hunks)
  • router-tests/events/event_helpers.go (1 hunks)
  • router-tests/events/events_config_test.go (1 hunks)
  • router-tests/events/kafka_events_test.go (29 hunks)
  • router-tests/events/nats_events_test.go (1 hunks)
  • router-tests/events/redis_events_test.go (19 hunks)
  • router-tests/prometheus_event_metrics_test.go (1 hunks)
  • router-tests/telemetry/event_metrics_test.go (1 hunks)
  • router-tests/testenv/testenv.go (5 hunks)
  • router/core/factoryresolver.go (3 hunks)
  • router/core/graph_server.go (5 hunks)
  • router/core/router.go (2 hunks)
  • router/pkg/config/config.go (2 hunks)
  • router/pkg/config/config.schema.json (2 hunks)
  • router/pkg/metric/config.go (2 hunks)
  • router/pkg/metric/event_measurements.go (1 hunks)
  • router/pkg/metric/event_metric_store.go (1 hunks)
  • router/pkg/metric/noop_event_metrics.go (1 hunks)
  • router/pkg/metric/oltp_connection_metric_store.go (1 hunks)
  • router/pkg/metric/oltp_event_metric_store.go (1 hunks)
  • router/pkg/metric/prom_connection_metric_store.go (1 hunks)
  • router/pkg/metric/prom_event_metric_store.go (1 hunks)
  • router/pkg/otel/attributes.go (2 hunks)
  • router/pkg/pubsub/datasource/mocks.go (1 hunks)
  • router/pkg/pubsub/datasource/provider.go (2 hunks)
  • router/pkg/pubsub/kafka/adapter.go (6 hunks)
  • router/pkg/pubsub/kafka/provider_builder.go (2 hunks)
  • router/pkg/pubsub/nats/adapter.go (9 hunks)
  • router/pkg/pubsub/nats/provider_builder.go (2 hunks)
  • router/pkg/pubsub/pubsub.go (6 hunks)
  • router/pkg/pubsub/pubsub_test.go (3 hunks)
  • router/pkg/pubsub/redis/adapter.go (4 hunks)
  • router/pkg/pubsub/redis/provider_builder.go (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (23)
router/pkg/pubsub/datasource/mocks.go (1)
router/pkg/pubsub/datasource/provider.go (2)
  • ProviderOpts (36-38)
  • Provider (18-24)
router/pkg/metric/config.go (1)
router/pkg/metric/event_metric_store.go (1)
  • EventMetrics (52-58)
router/pkg/metric/oltp_connection_metric_store.go (1)
router/pkg/pubsub/datasource/error.go (1)
  • Error (3-6)
router-tests/events/event_helpers.go (2)
router-tests/testenv/testenv.go (1)
  • Environment (1725-1761)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router/pkg/pubsub/datasource/provider.go (1)
router/pkg/metric/event_metric_store.go (1)
  • EventMetricStore (32-49)
router/core/router.go (3)
router/pkg/metric/event_metric_store.go (1)
  • EventMetrics (52-58)
router/pkg/config/config.go (2)
  • Metrics (127-132)
  • Prometheus (97-111)
router/pkg/metric/metric_store.go (1)
  • Metrics (105-124)
router/pkg/pubsub/redis/provider_builder.go (3)
router/pkg/pubsub/datasource/provider.go (3)
  • ProviderBuilder (27-34)
  • ProviderOpts (36-38)
  • Provider (18-24)
router/pkg/config/config.go (1)
  • RedisEventSource (611-615)
router/pkg/pubsub/redis/adapter.go (1)
  • NewProviderAdapter (28-38)
demo/pkg/subgraphs/subgraphs.go (2)
router/pkg/pubsub/nats/adapter.go (1)
  • NewAdapter (318-342)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/config/config.go (1)
router/pkg/metric/event_metric_store.go (1)
  • EventMetrics (52-58)
router/pkg/pubsub/kafka/adapter.go (3)
router/pkg/metric/event_metric_store.go (1)
  • EventMetricStore (32-49)
router/pkg/pubsub/datasource/error.go (1)
  • NewError (12-17)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/metric/event_metric_store.go (2)
router/pkg/metric/config.go (2)
  • Config (107-131)
  • OpenTelemetry (69-84)
router/pkg/otel/attributes.go (5)
  • WgEventProviderType (66-66)
  • WgEventProviderID (65-65)
  • WgKafkaTopic (67-67)
  • WgRedisChannel (69-69)
  • WgNatsSubject (68-68)
router/pkg/pubsub/redis/adapter.go (5)
router/pkg/pubsub/kafka/adapter.go (3)
  • NewProviderAdapter (232-246)
  • Adapter (25-30)
  • ProviderAdapter (37-45)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/metric/event_metric_store.go (1)
  • EventMetricStore (32-49)
router/internal/persistedoperation/operationstorage/redis/rdcloser.go (1)
  • RDCloser (14-16)
router/pkg/pubsub/datasource/error.go (1)
  • NewError (12-17)
router-tests/prometheus_event_metrics_test.go (4)
router-tests/testenv/testenv.go (8)
  • Run (105-122)
  • Config (284-339)
  • ConfigWithEdfsKafkaJSONTemplate (85-85)
  • MetricOptions (263-277)
  • Environment (1725-1761)
  • GraphQLRequest (1893-1901)
  • ConfigWithEdfsNatsJSONTemplate (87-87)
  • ConfigWithEdfsRedisJSONTemplate (89-89)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-77)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router/pkg/config/config.go (1)
  • EngineExecutionConfiguration (370-398)
router/pkg/pubsub/nats/adapter.go (4)
router/pkg/metric/event_metric_store.go (1)
  • EventMetricStore (32-49)
router/pkg/pubsub/nats/engine_datasource.go (1)
  • PublishAndRequestEventConfiguration (28-32)
router/pkg/pubsub/datasource/error.go (1)
  • NewError (12-17)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router-tests/telemetry/event_metrics_test.go (5)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-77)
router-tests/utils.go (1)
  • GetMetricByName (79-86)
router/pkg/otel/attributes.go (5)
  • WgEventProviderID (65-65)
  • WgEventProviderType (66-66)
  • WgKafkaTopic (67-67)
  • WgNatsSubject (68-68)
  • WgRedisChannel (69-69)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router/pkg/config/config.go (1)
  • EngineExecutionConfiguration (370-398)
router/pkg/pubsub/nats/provider_builder.go (2)
router/pkg/pubsub/datasource/provider.go (3)
  • ProviderBuilder (27-34)
  • ProviderOpts (36-38)
  • Provider (18-24)
router/pkg/pubsub/nats/adapter.go (2)
  • Adapter (22-33)
  • NewAdapter (318-342)
router/core/factoryresolver.go (1)
router/pkg/metric/event_metric_store.go (1)
  • EventMetricStore (32-49)
router-tests/events/kafka_events_test.go (2)
router-tests/events/event_helpers.go (2)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router/pkg/pubsub/kafka/provider_builder.go (3)
router/pkg/pubsub/datasource/provider.go (3)
  • ProviderBuilder (27-34)
  • ProviderOpts (36-38)
  • Provider (18-24)
router/pkg/config/config.go (1)
  • KafkaEventSource (599-605)
router/pkg/pubsub/kafka/adapter.go (2)
  • Adapter (25-30)
  • NewProviderAdapter (232-246)
router-tests/events/redis_events_test.go (1)
router-tests/events/event_helpers.go (1)
  • ProduceRedisMessage (56-77)
router/pkg/pubsub/pubsub.go (3)
router/pkg/metric/event_metric_store.go (1)
  • EventMetricStore (32-49)
router/pkg/pubsub/datasource/provider.go (3)
  • Provider (18-24)
  • ProviderBuilder (27-34)
  • ProviderOpts (36-38)
router/pkg/pubsub/kafka/provider_builder.go (2)
  • NewProviderBuilder (167-180)
  • ProviderBuilder (21-27)
router-tests/testenv/testenv.go (4)
router/pkg/metric/event_metric_store.go (2)
  • EventMetrics (52-58)
  • EventMetricStore (32-49)
router/pkg/pubsub/nats/adapter.go (1)
  • NewAdapter (318-342)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/metric/noop_event_metrics.go (1)
  • NewNoopEventMetricStore (50-50)
router/core/graph_server.go (5)
router/pkg/metric/event_metric_store.go (3)
  • EventMetricStore (32-49)
  • EventMetrics (52-58)
  • NewEventMetricStore (60-85)
router/pkg/metric/noop_metrics.go (1)
  • NewNoopMetrics (63-65)
router/pkg/metric/noop_event_metrics.go (1)
  • NewNoopEventMetricStore (50-50)
router/pkg/metric/config.go (1)
  • OpenTelemetry (69-84)
router/pkg/config/config.go (1)
  • Prometheus (97-111)
🪛 GitHub Actions: Router CI
router-tests/events/nats_events_test.go

[error] 87-87: Test run failed: undefined: natsSubscriptionArgs. Command: gotestsum -f github-actions -- -ldflags=-extldflags=-Wl,-ld_classic -run '^Test[^(Flaky)]' --timeout=5m -p 1 --parallel 10 -race ./events

🔇 Additional comments (33)
router-tests/events/nats_events_test.go (1)

281-283: Comment contradicts code: sleep is shorter than configured read timeout.

WebSocketClientReadTimeout is set to 1s (Line 239) but we sleep only 200ms here while the comment says “Wait longer than the read timeout”.

Consider updating the sleep to exceed the timeout (or fix the comment). Example:

-			// Wait longer than the read timeout to ensure that read timeouts are handled correctly
-			time.Sleep(time.Millisecond * 200)
+			// Wait longer than the read timeout to ensure that read timeouts are handled correctly
+			time.Sleep(time.Second + 200*time.Millisecond)

If a longer sleep makes the test too slow, alternatively correct the comment instead of increasing the delay.

router-tests/events/events_config_test.go (1)

1-1: Package rename to internal test package looks good.

Using package events aligns with the other event tests and allows access to unexported internals where needed.

router/pkg/metric/config.go (2)

38-39: Prometheus config: EventMetrics toggle added — LGTM.

Adding EventMetrics bool to PrometheusConfig is consistent with the new event metrics feature flag surface.


82-84: OTLP config: EventMetrics toggle added — LGTM.

Adding EventMetrics bool to OpenTelemetry config aligns with the new metrics wiring.

router/pkg/metric/event_measurements.go (2)

52-103: Initialize counters — looks correct and defensive.

The instruments are created with error checks and wrapped messages. Good grouping and separation for generic events vs NATS request metrics.


21-39: It looks like everywhere in the repo you’re using otelmetric.Int64CounterOption and otelmetric.WithDescription, with no imports from go.opentelemetry.io/otel/metric/instrument. The new options in event_measurements.go are consistent with existing usage. No changes needed.

Action: None—merge as is.

router/pkg/config/config.go (1)

141-145: OTLP EventMetrics toggle — LGTM.

The OTLP event metrics toggle/tag looks correct and consistent with the rest of the configuration surface.

router/pkg/config/config.schema.json (1)

1061-1065: OTLP event_metrics flag added — looks good

The schema addition is consistent with the surrounding options and uses a safe default (false). No issues spotted here.

router/core/router.go (1)

2242-2259: Correctly wires event metrics flags into metric config

Mapping cfg.Metrics.OTLP.EventMetrics and cfg.Metrics.Prometheus.EventMetrics into rmetric.Config is accurate and aligns with the schema additions. No issues.

router/pkg/pubsub/pubsub_test.go (2)

17-76: Minor: test covers happy-path build correctly.

No issues spotted in the positive-path assertions (provider count, root nodes), given the wiring changes.


293-294: Production already defaults to a no-op EventMetricStore — no change needed

Confirmed: tests pass nil intentionally, and production paths initialize and pass a no-op store so adapters won't panic. BuildProvidersAndDataSources simply forwards the store (it doesn't itself substitute nil).

Key locations:

  • router/pkg/pubsub/pubsub_test.go — tests call BuildProvidersAndDataSources(..., nil, ...) (lines ~287, 340, 393).
  • router/pkg/pubsub/pubsub.go — BuildProvidersAndDataSources signature accepts store and forwards it into provider opts (signature at line ~55; forwarded at build(...) -> ProviderOpts.EventMetricStore).
  • router/pkg/metric/noop_event_metrics.go — Noop implementation and constructor: NewNoopEventMetricStore().
  • router/core/graph_server.go — buildGraphMux initializes gm.eventMetricStore = rmetric.NewNoopEventMetricStore() and only replaces it with a real store when metrics are enabled (see gm creation and later assignment).
  • router/core/factoryresolver.go — calls pubsub.BuildProvidersAndDataSources(..., routerEngineConfig.EventMetricStore, ...) (around line ~472), so production callers supply the non-nil gm.eventMetricStore.

Result: production callers already supply a no-op default; tests can pass nil safely for focused testing.

router-tests/events/redis_events_test.go (2)

1-1: Package switch to internal events package is sensible for shared helpers.

Moving to package events allows direct use of shared test helpers without import cycles. This improves reuse and reduces duplication.


107-107: Consolidated to the shared ProduceRedisMessage helper — LGTM.

Replacing the local publisher with ProduceRedisMessage centralizes behavior and keeps tests DRY. The message payloads and assertions remain unchanged, preserving test intent.

Also applies to: 173-173, 185-185, 195-195, 205-205, 276-276, 357-357, 378-378, 454-454, 512-512, 517-517, 593-593, 666-666, 795-795, 856-856, 869-869, 880-880, 891-891, 994-994

router/pkg/otel/attributes.go (1)

63-71: Event metric attribute keys look consistent and future-proof.

Provider-scoped and backend-scoped labels are clearly named and consistent with existing wg.* attributes. This should make metric queries straightforward.

router/core/factoryresolver.go (1)

7-7: Import alias for metrics looks correct and scoped.

Alias rmetric is used for the new EventMetricStore type; no issues here.

router-tests/testenv/testenv.go (3)

273-277: Config flags for EventMetrics are well-scoped and consistent.

Adding EnableOTLPEventMetrics and EnablePrometheusEventMetrics to MetricOptions keeps test toggles symmetric with existing metrics flags.


1508-1514: Prometheus EventMetrics wiring is correct.

PrometheusConfig.EventMetrics mirrors the new flag and integrates with the metrics bootstrap.


1530-1539: OTLP EventMetrics wiring is correct.

Metrics.OTLP.EventMetrics is correctly controlled by the test flag.

router/core/graph_server.go (4)

517-518: Graph mux now tracks an EventMetricStore.

This aligns with the engine wiring and allows lifecycle control per-mux. Looks good.


779-783: Sane defaults: Noop stores on initialization.

Defaulting both metric stores to Noop avoids nil checks everywhere and ensures safe shutdown.


879-891: Conditional creation of real EventMetricStore is correct.

Creating a concrete store only when either OTLP or Prometheus event metrics are enabled keeps overhead minimal.


1134-1135: Engine configuration now includes EventMetricStore.

Passing the store into RouterEngineConfiguration is the right place for downstream pub/sub provider wiring.

router-tests/events/kafka_events_test.go (2)

110-116: Migration to exported test helpers LGTM.

Using ProduceKafkaMessage from the shared helpers removes duplication and improves readability.


434-455: Multipart test flow remains correct with helper usage.

The switch to EnsureTopicExists + ProduceKafkaMessage preserves behavior. No issues spotted.

router/pkg/pubsub/redis/provider_builder.go (1)

69-75: Provider options propagation looks correct.

The new BuildProvider signature and passing providerOpts into NewProviderAdapter is consistent with the generic ProviderBuilder interface change and with Kafka/NATS builders.

router/pkg/pubsub/kafka/provider_builder.go (2)

59-68: ProviderOpts threading is correct; map registration retained.

BuildProvider now accepts providerOpts and registers the adapter in p.adapters after creation. This matches the interface change and maintains downstream EngineDataSourceFactory lookups.


153-165: Forwarding providerOpts into adapter creation is correct.

buildProvider threads providerOpts into NewProviderAdapter. Error wrapping includes provider ID which is helpful for diagnostics.

router/pkg/pubsub/nats/provider_builder.go (2)

67-75: ProviderOpts propagation and adapter registration LGTM.

BuildProvider now forwards providerOpts through buildProvider and stores the adapter for later lookup. Consistent with Kafka.


121-133: ProviderOpts passed into NewAdapter correctly.

buildProvider wires providerOpts into NewAdapter, enabling metrics injection. Error messages include provider ID which is good.

router/pkg/pubsub/kafka/adapter.go (1)

93-97: Record “message received” with provider and topic looks good.

The metric emission at message receipt is correctly placed after iterating a record and before updater.Update.

router-tests/prometheus_event_metrics_test.go (1)

41-42: Potential flakiness from EnsureTopicExists helper (prefixedTopics building bug).

The helper events.EnsureTopicExists is used here. Based on router-tests/events/event_helpers.go, prefixedTopics is created with length len(topics) and then appended to, which doubles the length and inserts empty strings. This can cause unexpected behavior and test flakiness.

Please fix the helper as follows:

// router-tests/events/event_helpers.go
func EnsureTopicExists(t *testing.T, xEnv *testenv.Environment, topics ...string) {
  // Delete topic for idempotency
  deleteCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
  defer cancel()
- prefixedTopics := make([]string, len(topics))
- for _, topic := range topics {
-   prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic))
- }
+ prefixedTopics := make([]string, 0, len(topics))
+ for _, topic := range topics {
+   prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic))
+ }

  _, err := xEnv.KafkaAdminClient.DeleteTopics(deleteCtx, prefixedTopics...)
  require.NoError(t, err)

  ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
  defer cancel()

  _, err = xEnv.KafkaAdminClient.CreateTopics(ctx, 1, 1, nil, prefixedTopics...)
  require.NoError(t, err)
}
router/pkg/metric/event_metric_store.go (2)

87-90: Attribute merging helper is correct and copy-safe

Copy-on-append avoids mutating baseAttributes and ensures per-call isolation. Good practice.


194-207: No action needed: Go version supports errors.Join
All go.mod files declare Go 1.23.x (≥ 1.20), so using errors.Join is safe.

@wundergraph wundergraph deleted a comment from coderabbitai bot Aug 14, 2025
@wundergraph wundergraph deleted a comment from coderabbitai bot Aug 14, 2025
@SkArchon
Copy link
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Aug 14, 2025

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🔭 Outside diff range comments (16)
router/pkg/metric/prom_connection_metric_store.go (1)

102-104: errors.Join usage bug: this drops previously accumulated errors

errors.Join(regErr) with a single arg is effectively just regErr. You’re overwriting err each time and not accumulating. Use errors.Join(err, regErr) to collect all unregister errors.

-		if regErr := reg.Unregister(); regErr != nil {
-			err = errors.Join(regErr)
-		}
+		if regErr := reg.Unregister(); regErr != nil {
+			err = errors.Join(err, regErr)
+		}
router/pkg/metric/oltp_connection_metric_store.go (1)

102-104: Fix error accumulation in Shutdown

As in the Prometheus store, errors.Join(regErr) drops previously collected errors. Use errors.Join(err, regErr) to aggregate all failures.

-		if regErr := reg.Unregister(); regErr != nil {
-			err = errors.Join(regErr)
-		}
+		if regErr := reg.Unregister(); regErr != nil {
+			err = errors.Join(err, regErr)
+		}
router-tests/events/nats_events_test.go (2)

1367-1389: Fix blocking ReadLine calls causing CI timeout in subscribe_sse_with_filter

CI failed with a 5m timeout in this test. The direct calls to reader.ReadLine() block indefinitely if an expected SSE frame is not delivered (e.g., due to filter logic differences or timing). Wrap the reads with a timeout to fail fast and avoid hanging the entire suite.

Apply this diff to replace blocking reads with a helper that enforces NatsWaitTimeout:

-            eventNext, _, gErr := reader.ReadLine()
-            require.NoError(t, gErr)
-            require.Equal(t, "event: next", string(eventNext))
-            data, _, gErr := reader.ReadLine()
-            require.NoError(t, gErr)
-            require.Equal(t, fmt.Sprintf("data: {\"data\":{\"filteredEmployeeUpdated\":{\"id\":%d,\"details\":{\"forename\":\"%s\",\"surname\":\"%s\"}}}}", 1, testData[1].forename, testData[1].surname), string(data))
-            line, _, gErr := reader.ReadLine()
-            require.NoError(t, gErr)
-            require.Equal(t, "", string(line))
+            eventNext := readLineWithTimeout(t, reader, NatsWaitTimeout)
+            require.Equal(t, "event: next", eventNext)
+            data := readLineWithTimeout(t, reader, NatsWaitTimeout)
+            require.Equal(t, fmt.Sprintf("data: {\"data\":{\"filteredEmployeeUpdated\":{\"id\":%d,\"details\":{\"forename\":\"%s\",\"surname\":\"%s\"}}}}", 1, testData[1].forename, testData[1].surname), data)
+            line := readLineWithTimeout(t, reader, NatsWaitTimeout)
+            require.Equal(t, "", line)
-                case 1, 3, 4, 5, 7, 8, 11:
-                    eventNext, _, gErr = reader.ReadLine()
-                    require.NoError(t, gErr)
-                    require.Equal(t, "event: next", string(eventNext))
-                    data, _, gErr = reader.ReadLine()
-                    require.NoError(t, gErr)
-                    require.Equal(t, fmt.Sprintf("data: {\"data\":{\"filteredEmployeeUpdated\":{\"id\":%d,\"details\":{\"forename\":\"%s\",\"surname\":\"%s\"}}}}", i, testData[i].forename, testData[i].surname), string(data))
-                    line, _, gErr = reader.ReadLine()
-                    require.NoError(t, gErr)
-                    require.Equal(t, "", string(line))
+                case 1, 3, 4, 5, 7, 8, 11:
+                    eventNext := readLineWithTimeout(t, reader, NatsWaitTimeout)
+                    require.Equal(t, "event: next", eventNext)
+                    data := readLineWithTimeout(t, reader, NatsWaitTimeout)
+                    require.Equal(t, fmt.Sprintf("data: {\"data\":{\"filteredEmployeeUpdated\":{\"id\":%d,\"details\":{\"forename\":\"%s\",\"surname\":\"%s\"}}}}", i, testData[i].forename, testData[i].surname), data)
+                    line := readLineWithTimeout(t, reader, NatsWaitTimeout)
+                    require.Equal(t, "", line)

Add this helper within this test file (outside the selected range), near the other test helpers:

// helper to prevent indefinite blocking on streaming reads
func readLineWithTimeout(t *testing.T, reader *bufio.Reader, timeout time.Duration) string {
	t.Helper()
	lineCh := make(chan string, 1)
	errCh := make(chan error, 1)
	go func() {
		line, _, err := reader.ReadLine()
		if err != nil {
			errCh <- err
			return
		}
		lineCh <- string(line)
	}()
	select {
	case s := <-lineCh:
		return s
	case err := <-errCh:
		require.NoError(t, err)
		return ""
	case <-time.After(timeout):
		t.Fatalf("timed out after %s waiting for stream line", timeout)
		return ""
	}
}

1548-1560: Slice preallocation bug: duplicate zero-value entries in providers list

The slice is pre-sized and then appended to, yielding len = 2x providers with the first half being zero-values. Use zero length with capacity preallocation.

-                natsEventSources := make([]config.NatsEventSource, len(testenv.DemoNatsProviders))
+                natsEventSources := make([]config.NatsEventSource, 0, len(testenv.DemoNatsProviders))
                 for _, sourceName := range testenv.DemoNatsProviders {
                     natsEventSources = append(natsEventSources, config.NatsEventSource{
                         ID:  sourceName,
                         URL: url,
                     })
                 }

This avoids invalid empty providers from being added unintentionally. Even though this test expects failure, the current code introduces misleading extra entries.

router-tests/prometheus_event_metrics_test.go (1)

26-424: Reduce flakiness by polling metrics until expected values appear

Prometheus collection can lag behind the event emission. Right after triggering mutations/publishes, gathering once may intermittently return stale values. Consider polling the registry for the expected counter with a timeout to make this test deterministic.

I can factor a small helper like awaitPromCounter(registry, familyName, labelKey, labelSuffix, expected, timeout) used across the subtests. Want me to draft it?

router/pkg/metric/config.go (1)

1-164: Fix Prometheus EventMetrics env tag

The EventMetrics field in the Prometheus config is incorrectly using the OTLP env key. Update it to its own prefix so users can toggle Prometheus event metrics independently.

Locations to update:

  • router/pkg/config/config.go: line 103

Diff:

--- a/router/pkg/config/config.go
+++ b/router/pkg/config/config.go
@@ -100,7 +100,7 @@ type PrometheusConfig struct {
-   EventMetrics        bool        `yaml:"event_metrics" envDefault:"false" env:"METRICS_OTLP_EVENT_METRICS"`
+   EventMetrics        bool        `yaml:"event_metrics" envDefault:"false" env:"PROMETHEUS_EVENT_METRICS"`

Keep the OTLP EventMetrics at line 141 unchanged.

router/pkg/pubsub/datasource/mocks.go (1)

795-820: Mock does not account for new providerOpts parameter; expectations will silently mismatch

The mock ignores providerOpts when invoking _mock.Called(...), and the expecter helpers don’t include the second argument in On(...), Run(...), or RunAndReturn(...). Any tests that set expectations with both args or want to use providerOpts in return functions will break or be impossible to write.

Fix by threading providerOpts through the mock call, expecter, and function pointer types:

 func (_mock *MockProviderBuilder[P, E]) BuildProvider(options P, providerOpts ProviderOpts) (Provider, error) {
-	ret := _mock.Called(options)
+	ret := _mock.Called(options, providerOpts)

 	if len(ret) == 0 {
 		panic("no return value specified for BuildProvider")
 	}

 	var r0 Provider
 	var r1 error
-	if returnFunc, ok := ret.Get(0).(func(P) (Provider, error)); ok {
-		return returnFunc(options)
+	if returnFunc, ok := ret.Get(0).(func(P, ProviderOpts) (Provider, error)); ok {
+		return returnFunc(options, providerOpts)
 	}
-	if returnFunc, ok := ret.Get(0).(func(P) Provider); ok {
-		r0 = returnFunc(options)
+	if returnFunc, ok := ret.Get(0).(func(P, ProviderOpts) Provider); ok {
+		r0 = returnFunc(options, providerOpts)
 	} else {
 		if ret.Get(0) != nil {
 			r0 = ret.Get(0).(Provider)
 		}
 	}
-	if returnFunc, ok := ret.Get(1).(func(P) error); ok {
-		r1 = returnFunc(options)
+	if returnFunc, ok := ret.Get(1).(func(P, ProviderOpts) error); ok {
+		r1 = returnFunc(options, providerOpts)
 	} else {
 		r1 = ret.Error(1)
 	}
 	return r0, r1
 }
@@
-//   - options P
-func (_e *MockProviderBuilder_Expecter[P, E]) BuildProvider(options interface{}) *MockProviderBuilder_BuildProvider_Call[P, E] {
-	return &MockProviderBuilder_BuildProvider_Call[P, E]{Call: _e.mock.On("BuildProvider", options)}
+//   - options P
+//   - providerOpts ProviderOpts
+func (_e *MockProviderBuilder_Expecter[P, E]) BuildProvider(options interface{}, providerOpts interface{}) *MockProviderBuilder_BuildProvider_Call[P, E] {
+	return &MockProviderBuilder_BuildProvider_Call[P, E]{Call: _e.mock.On("BuildProvider", options, providerOpts)}
 }
 
-func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) Run(run func(options P)) *MockProviderBuilder_BuildProvider_Call[P, E] {
+func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) Run(run func(options P, providerOpts ProviderOpts)) *MockProviderBuilder_BuildProvider_Call[P, E] {
 	_c.Call.Run(func(args mock.Arguments) {
 		var arg0 P
 		if args[0] != nil {
 			arg0 = args[0].(P)
 		}
-		run(
-			arg0,
-		)
+		var arg1 ProviderOpts
+		if args[1] != nil {
+			arg1 = args[1].(ProviderOpts)
+		}
+		run(
+			arg0,
+			arg1,
+		)
 	})
 	return _c
 }
 
 func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) Return(provider Provider, err error) *MockProviderBuilder_BuildProvider_Call[P, E] {
 	_c.Call.Return(provider, err)
 	return _c
 }
 
-func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) RunAndReturn(run func(options P) (Provider, error)) *MockProviderBuilder_BuildProvider_Call[P, E] {
+func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) RunAndReturn(run func(options P, providerOpts ProviderOpts) (Provider, error)) *MockProviderBuilder_BuildProvider_Call[P, E] {
 	_c.Call.Return(run)
 	return _c
 }

Also applies to: 827-854

router/pkg/pubsub/pubsub_test.go (4)

3-16: Import testify/mock to enable argument matchers

You’ll need mock.Anything to update BuildProvider expectations (see below).

 import (
 	"context"
 	"errors"
 	rmetric "github.com/wundergraph/cosmo/router/pkg/metric"
 	"testing"

 	"github.com/stretchr/testify/assert"
+	"github.com/stretchr/testify/mock"
 	"github.com/stretchr/testify/require"
 	nodev1 "github.com/wundergraph/cosmo/router/gen/proto/wg/cosmo/node/v1"
 	"github.com/wundergraph/cosmo/router/pkg/config"
 	"github.com/wundergraph/cosmo/router/pkg/pubsub/datasource"
 	"github.com/wundergraph/graphql-go-tools/v2/pkg/engine/plan"
 	"go.uber.org/zap"
 )

64-66: Update mock to match new BuildProvider signature

BuildProvider now receives providerOpts; update the mock to include it.

-	mockBuilder.On("TypeID").Return("nats")
-	mockBuilder.On("BuildProvider", natsEventSources[0]).Return(mockPubSubProvider, nil)
+	mockBuilder.On("TypeID").Return("nats")
+	mockBuilder.On("BuildProvider", natsEventSources[0], mock.Anything).Return(mockPubSubProvider, nil)

239-241: Fix failing mock expectation for “not used” provider

Update to include providerOpts.

-mockBuilder.On("TypeID").Return("nats")
-mockBuilder.On("BuildProvider", natsEventSources[1]).Return(mockPubSubUsedProvider, nil)
+mockBuilder.On("TypeID").Return("nats")
+mockBuilder.On("BuildProvider", natsEventSources[1], mock.Anything).Return(mockPubSubUsedProvider, nil)

288-295: Default nil metric store to NewNoopEventMetricStore in BuildProvidersAndDataSources

BuildProvidersAndDataSources currently accepts a nil EventMetricStore and passes it through, violating the “never nil” invariant. Add a guard at the top of the function so that any nil store is replaced with metric.NewNoopEventMetricStore():

File: router/pkg/pubsub/pubsub.go
Location: start of func BuildProvidersAndDataSources (around line 55)

Suggested diff:

 func BuildProvidersAndDataSources(
     ctx context.Context,
     config config.EventsConfiguration,
-    store metric.EventMetricStore,
+    store metric.EventMetricStore,
     logger *zap.Logger,
     dsConfs []DataSourceConfigurationWithMetadata,
     hostName string,
     routerListenAddr string,
 ) ([]pubsub_datasource.Provider, []plan.DataSource, error) {
+   // Ensure metric store is never nil
+   if store == nil {
+       store = metric.NewNoopEventMetricStore()
+   }
     var pubSubProviders []pubsub_datasource.Provider
     var outs []plan.DataSource
     …

This change will uphold the null-object pattern and satisfy tests that pass nil for the metric store.

router-tests/testenv/testenv.go (1)

1632-1644: Fix port-collision flake in CI: avoid pre-selecting ports for httptest servers

The pipeline shows: “could not listen on port ... bind: address already in use” at Line 1637. The current pattern obtains a free port and then binds later, which races under parallel tests.

Use the OS to select an ephemeral port directly (":0") to eliminate TOCTOU races.

Apply this refactor to make the listener robust:

func makeSafeHttpTestServer(t testing.TB, handler http.Handler) *httptest.Server {
	s := httptest.NewUnstartedServer(handler)
	l, err := net.Listen("tcp", "127.0.0.1:0") // Let OS pick an available port
	if err != nil {
		t.Fatalf("could not listen on port: %s", err.Error())
	}
	_ = s.Listener.Close()
	s.Listener = l
	s.Start()
	return s
}

Consider applying the same pattern to other helpers that pre-pick ports (e.g., makeHttpTestServerWithPort/SetupCDNServer) or add a fallback to ":0" on bind failure.

router/pkg/pubsub/nats/provider_builder.go (1)

77-119: Fix potential nil deref and make logger usage safe.

  • Possible nil deref: eventSource.Authentication.UserInfo may be nil before accessing Username/Password.
  • Logger may be nil; using it in handlers will panic. Default to zap.NewNop() to be safe.

Apply this diff:

 func buildNatsOptions(eventSource config.NatsEventSource, logger *zap.Logger) ([]nats.Option, error) {
+  if logger == nil {
+    logger = zap.NewNop()
+  }
   opts := []nats.Option{
     nats.Name(fmt.Sprintf("cosmo.router.edfs.nats.%s", eventSource.ID)),
     nats.ReconnectJitter(500*time.Millisecond, 2*time.Second),
     nats.ClosedHandler(func(conn *nats.Conn) {
       logger.Info("NATS connection closed", zap.String("provider_id", eventSource.ID), zap.Error(conn.LastError()))
     }),
     nats.ConnectHandler(func(nc *nats.Conn) {
       logger.Info("NATS connection established", zap.String("provider_id", eventSource.ID), zap.String("url", nc.ConnectedUrlRedacted()))
     }),
     nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
       if err != nil {
         logger.Error("NATS disconnected; will attempt to reconnect", zap.Error(err), zap.String("provider_id", eventSource.ID))
       } else {
         logger.Info("NATS disconnected", zap.String("provider_id", eventSource.ID))
       }
     }),
     nats.ErrorHandler(func(conn *nats.Conn, subscription *nats.Subscription, err error) {
       if errors.Is(err, nats.ErrSlowConsumer) {
         logger.Warn(
           "NATS slow consumer detected. Events are being dropped. Please consider increasing the buffer size or reducing the number of messages being sent.",
           zap.Error(err),
           zap.String("provider_id", eventSource.ID),
         )
       } else {
         logger.Error("NATS error", zap.Error(err))
       }
     }),
     nats.ReconnectHandler(func(conn *nats.Conn) {
       logger.Info("NATS reconnected", zap.String("provider_id", eventSource.ID), zap.String("url", conn.ConnectedUrlRedacted()))
     }),
   }

   if eventSource.Authentication != nil {
     if eventSource.Authentication.Token != nil {
       opts = append(opts, nats.Token(*eventSource.Authentication.Token))
-    } else if eventSource.Authentication.UserInfo.Username != nil && eventSource.Authentication.UserInfo.Password != nil {
+    } else if eventSource.Authentication.UserInfo != nil &&
+      eventSource.Authentication.UserInfo.Username != nil &&
+      eventSource.Authentication.UserInfo.Password != nil {
       opts = append(opts, nats.UserInfo(*eventSource.Authentication.UserInfo.Username, *eventSource.Authentication.UserInfo.Password))
     }
   }
router/pkg/pubsub/kafka/adapter.go (1)

48-99: Close per-subscription Kafka clients to avoid leaks.

topicPoller returns on cancellations and errors but never closes the client created for subscriptions. Add a defer to ensure the client is closed when the poller exits.

-func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, providerId string) error {
+func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, providerId string) error {
+  // Ensure client is closed when the poller exits
+  defer client.Close()
   for {
     select {
router/pkg/pubsub/nats/adapter.go (2)

115-147: JetStream fetch loop can hot-spin and abort on “no messages”; handle non-fatal cases with backoff

FetchNoWait in a tight default loop will spin CPU and spam the server when no messages are available. Also, transient “no messages/timeout” conditions shouldn’t terminate the subscription.

Add non-fatal handling with a short sleep/backoff and only abort on real errors.

Apply this diff:

       default:
-          msgBatch, consumerFetchErr := consumer.FetchNoWait(300)
-          if consumerFetchErr != nil {
-            log.Error("error fetching messages", zap.Error(consumerFetchErr))
-            return
-          }
+          msgBatch, consumerFetchErr := consumer.FetchNoWait(300)
+          if consumerFetchErr != nil {
+            // Treat "no messages" / timeouts as non-fatal to avoid tearing down the loop.
+            // Prevent a hot loop with a small backoff.
+            if errors.Is(consumerFetchErr, jetstream.ErrNoMessages) || errors.Is(consumerFetchErr, nats.ErrTimeout) {
+              time.Sleep(50 * time.Millisecond)
+              continue
+            }
+            log.Error("error fetching messages", zap.Error(consumerFetchErr))
+            return
+          }
 
           for msg := range msgBatch.Messages() {
             log.Debug("subscription update", zap.String("message_subject", msg.Subject()), zap.ByteString("data", msg.Data()))
 
-            p.eventMetricStore.NatsMessageReceived(p.ctx, event.ProviderID, msg.Subject())
+            p.eventMetricStore.NatsMessageReceived(ctx, event.ProviderID, msg.Subject())
             updater.Update(msg.Data())

156-165: Unbuffered ChanSubscribe risks slow-consumer backpressure; use a buffered channel

Using an unbuffered channel for ChanSubscribe can stall delivery and trigger slow-consumer issues. Allocate a reasonable buffer.

Apply this diff:

- msgChan := make(chan *nats.Msg)
+ // Use a buffered channel to avoid slow-consumer backpressure.
+ msgChan := make(chan *nats.Msg, 1024)
   subscriptions := make([]*nats.Subscription, len(event.Subjects))
♻️ Duplicate comments (9)
router/pkg/config/config.schema.json (1)

1171-1175: LGTM: Prometheus event_metrics schema flag, but verify env var wiring in code (duplicate from prior review)

Schema is correct. As noted previously, ensure Prometheus.EventMetrics in router/pkg/config/config.go does not reuse the OTLP env tag. It should be independently toggleable.

Run this to confirm struct tags:

#!/bin/bash
set -euo pipefail

echo "Prometheus.EventMetrics env tag:"
rg -n $'type\\s+Prometheus\\s+struct\\s*{[\\s\\S]*?EventMetrics\\s+bool\\s+`[^`]*env:([^`]*)`[\\s\\S]*?}' router/pkg/config/config.go -N -U -S -A 5

echo
echo "MetricsOTLP.EventMetrics env tag:"
rg -n $'type\\s+MetricsOTLP\\s+struct\\s*{[\\s\\S]*?EventMetrics\\s+bool\\s+`[^`]*env:([^`]*)`[\\s\\S]*?}' router/pkg/config/config.go -N -U -S -A 5
router/pkg/pubsub/redis/adapter.go (2)

114-115: Recording RedisMessageReceived on subscribe — LGTM

Instrumentation is in the hot path and uses the provided context. Per learnings, EventMetricStore is guaranteed non-nil via noop store when disabled.


153-158: Publish success/failure instrumentation — LGTM

Failure and success paths are instrumented correctly with provider/channel attributes.

router/pkg/pubsub/kafka/adapter.go (1)

231-244: Previous concern about nil EventMetricStore is addressed by the null-object pattern.

Earlier feedback suggested defaulting EventMetricStore when nil. With the repository-wide guarantee that EventMetricStore is never nil (Noop store when disabled), this constructor is fine as-is.

router-tests/events/event_helpers.go (2)

41-44: Bug: prefixedTopics incorrectly constructed (leading empty entries).

Initialize with length 0 and capacity len(topics), or assign by index. This avoids empty topic names during delete/create.

-  prefixedTopics := make([]string, len(topics))
-  for _, topic := range topics {
-    prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic))
-  }
+  prefixedTopics := make([]string, 0, len(topics))
+  for _, topic := range topics {
+    prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic))
+  }

56-77: Close Redis client to avoid leaking connections in tests.

Ensure the client/cluster client is closed after Publish.

   if !xEnv.RedisWithClusterMode {
     redisConn = redis.NewClient(&redis.Options{
       Addr: parsedURL.Host,
     })
   } else {
     redisConn = redis.NewClusterClient(&redis.ClusterOptions{
       Addrs: []string{parsedURL.Host},
     })
   }
+  defer func() { _ = redisConn.Close() }()

   intCmd := redisConn.Publish(ctx, xEnv.GetPubSubName(topicName), message)
   require.NoError(t, intCmd.Err())
router/pkg/pubsub/pubsub.go (1)

55-55: Default to NoopEventMetricStore when store is nil to harden provider construction

This prevents nil derefs in adapters if upstream wiring ever passes a nil store. It follows the null-object pattern used across the codebase.

Apply this diff:

 func BuildProvidersAndDataSources(ctx context.Context, config config.EventsConfiguration, store metric.EventMetricStore, logger *zap.Logger, dsConfs []DataSourceConfigurationWithMetadata, hostName string, routerListenAddr string) ([]pubsub_datasource.Provider, []plan.DataSource, error) {
+  if store == nil {
+    store = metric.NewNoopEventMetricStore()
+  }
   var pubSubProviders []pubsub_datasource.Provider
   var outs []plan.DataSource
router/pkg/metric/event_metric_store.go (2)

8-13: Wrong import alias causes confusion/compile issues; use sdkmetric and update signature

Alias sdk/metric as sdkmetric and update NewEventMetricStore signature to use *sdkmetric.MeterProvider.

Apply this diff:

 import (
   "context"
   "errors"
   "fmt"
 
   "go.opentelemetry.io/otel/attribute"
   otelmetric "go.opentelemetry.io/otel/metric"
-  "go.opentelemetry.io/otel/sdk/metric"
+  sdkmetric "go.opentelemetry.io/otel/sdk/metric"
   "go.uber.org/zap"
 
   otelattrs "github.com/wundergraph/cosmo/router/pkg/otel"
 )
@@
-func NewEventMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *metric.MeterProvider, metricsConfig *Config) (*EventMetrics, error) {
+func NewEventMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *sdkmetric.MeterProvider, metricsConfig *Config) (*EventMetrics, error) {

Also applies to: 60-60


182-192: Add provider.type label to NATS request metrics for label-set parity

Other broker ops include wg.event.provider.type; request/request-failure omit it, causing inconsistent labels and downstream aggregation issues.

Apply this diff:

 func (e *EventMetrics) NatsRequest(ctx context.Context, providerID string, subject string) {
-  opts := e.withAttrs(otelattrs.WgEventProviderID.String(providerID), otelattrs.WgNatsSubject.String(subject))
+  opts := e.withAttrs(
+    otelattrs.WgEventProviderType.String("nats"),
+    otelattrs.WgEventProviderID.String(providerID),
+    otelattrs.WgNatsSubject.String(subject),
+  )
   e.otlpMetrics.NatsRequest(ctx, opts)
   e.promMetrics.NatsRequest(ctx, opts)
 }
 
 func (e *EventMetrics) NatsRequestFailure(ctx context.Context, providerID string, subject string) {
-  opts := e.withAttrs(otelattrs.WgEventProviderID.String(providerID), otelattrs.WgNatsSubject.String(subject))
+  opts := e.withAttrs(
+    otelattrs.WgEventProviderType.String("nats"),
+    otelattrs.WgEventProviderID.String(providerID),
+    otelattrs.WgNatsSubject.String(subject),
+  )
   e.otlpMetrics.NatsRequestFailure(ctx, opts)
   e.promMetrics.NatsRequestFailure(ctx, opts)
 }
🧹 Nitpick comments (30)
router/pkg/metric/prom_connection_metric_store.go (1)

70-73: Avoid multiple WithAttributes calls; combine attributes once to prevent subtle option handling differences

Passing multiple WithAttributes options relies on API-specific merging behavior. Combine base and dynamic attributes explicitly to make behavior deterministic and future-proof.

-			o.ObserveInt64(h.instruments.connectionsActive, activeConnections,
-				otelmetric.WithAttributes(attributes...),
-				otelmetric.WithAttributes(attrs...),
-			)
+			combined := make([]attribute.KeyValue, 0, len(attributes)+len(attrs))
+			combined = append(combined, attributes...)
+			combined = append(combined, attrs...)
+			o.ObserveInt64(
+				h.instruments.connectionsActive,
+				activeConnections,
+				otelmetric.WithAttributes(combined...),
+			)
router/pkg/metric/oltp_connection_metric_store.go (1)

71-74: Merge base and dynamic attributes before ObserveInt64

Relying on multiple WithAttributes calls is avoidable and can be brittle. Construct a single attribute slice to pass once.

-			o.ObserveInt64(h.instruments.connectionsActive, activeConnections,
-				otelmetric.WithAttributes(attributes...),
-				otelmetric.WithAttributes(attrs...),
-			)
+			combined := make([]attribute.KeyValue, 0, len(attributes)+len(attrs))
+			combined = append(combined, attributes...)
+			combined = append(combined, attrs...)
+			o.ObserveInt64(
+				h.instruments.connectionsActive,
+				activeConnections,
+				otelmetric.WithAttributes(combined...),
+			)
router-tests/events/nats_events_test.go (2)

469-471: Nit: fix test name grammar for clarity

Small wording improvement.

-    t.Run("subscribe after message don't a boundary", func(t *testing.T) {
+    t.Run("subscribe after message doesn't add a boundary", func(t *testing.T) {

142-146: Make log-count assertions less brittle

Exact counts on log snapshots tend to be flaky as logging evolves. Prefer lower bounds.

-            require.Len(t, natsLogs, 2)
+            require.GreaterOrEqual(t, len(natsLogs), 2)
-            require.Len(t, providerIDFields, 3)
+            require.GreaterOrEqual(t, len(providerIDFields), 3)

Please confirm whether tightening/loosening these assertions affects intended coverage.

router/pkg/metric/event_measurements.go (2)

21-39: Add explicit units to counters

Counters should declare units to aid downstream backends. For pure counts, use unit "1".

-    eventsPublishMessagesOptions = []otelmetric.Int64CounterOption{
-        otelmetric.WithDescription("Number of event messages published"),
-    }
+    eventsPublishMessagesOptions = []otelmetric.Int64CounterOption{
+        otelmetric.WithDescription("Number of event messages published"),
+        otelmetric.WithUnit("1"),
+    }
-    eventsPublishFailuresOptions = []otelmetric.Int64CounterOption{
-        otelmetric.WithDescription("Number of event publish failures"),
-    }
+    eventsPublishFailuresOptions = []otelmetric.Int64CounterOption{
+        otelmetric.WithDescription("Number of event publish failures"),
+        otelmetric.WithUnit("1"),
+    }
-    eventsMessagesReceivedOptions = []otelmetric.Int64CounterOption{
-        otelmetric.WithDescription("Number of event messages received"),
-    }
+    eventsMessagesReceivedOptions = []otelmetric.Int64CounterOption{
+        otelmetric.WithDescription("Number of event messages received"),
+        otelmetric.WithUnit("1"),
+    }
-    natsRequestsOptions = []otelmetric.Int64CounterOption{
-        otelmetric.WithDescription("Number of NATS requests"),
-    }
+    natsRequestsOptions = []otelmetric.Int64CounterOption{
+        otelmetric.WithDescription("Number of NATS requests"),
+        otelmetric.WithUnit("1"),
+    }
-    natsRequestFailuresOptions = []otelmetric.Int64CounterOption{
-        otelmetric.WithDescription("Number of NATS request failures"),
-    }
+    natsRequestFailuresOptions = []otelmetric.Int64CounterOption{
+        otelmetric.WithDescription("Number of NATS request failures"),
+        otelmetric.WithUnit("1"),
+    }

9-19: Consider naming consistency for NATS metrics

Minor: for uniformity with event metrics naming, consider a "router.events.nats.request" prefix. This is optional and only if it doesn't break existing dashboards.

Confirm whether these names are already referenced downstream (dashboards/alerts). If so, skip renaming to avoid breakage.

router-tests/prometheus_event_metrics_test.go (7)

51-68: Guard against nil metric family before dereferencing

If findMetricFamilyByName returns nil, family.GetMetric() will panic. Add a nil check to make failures explicit and less flaky.

Apply this diff:

-            mf, err := promRegistry.Gather()
+            mf, err := promRegistry.Gather()
             require.NoError(t, err)

-            family := findMetricFamilyByName(mf, "router_events_publish_messages_total")
+            family := findMetricFamilyByName(mf, "router_events_publish_messages_total")
+            require.NotNil(t, family, "router_events_publish_messages_total not found in Prometheus registry")
             metrics := family.GetMetric()
             require.Len(t, metrics, 1)

117-134: Kafka subscribe: add nil check for metric family

Same issue here when accessing family.GetMetric().

Apply this diff:

-                    mf, err := promRegistry.Gather()
+                    mf, err := promRegistry.Gather()
                     require.NoError(t, err)

-                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
+                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
+                    require.NotNil(t, family, "router_events_messages_received_total not found in Prometheus registry")
                     metrics := family.GetMetric()
                     require.Len(t, metrics, 1)

166-183: NATS publish: add nil check for metric family

Prevent a potential panic if the family is not present yet.

Apply this diff:

-                mf, err := promRegistry.Gather()
+                mf, err := promRegistry.Gather()
                 require.NoError(t, err)

-                family := findMetricFamilyByName(mf, "router_events_publish_messages_total")
+                family := findMetricFamilyByName(mf, "router_events_publish_messages_total")
+                require.NotNil(t, family, "router_events_publish_messages_total not found in Prometheus registry")
                 metrics := family.GetMetric()
                 require.Len(t, metrics, 1)

208-222: NATS request: add nil check for metric family

Same pattern as above.

Apply this diff:

-                mf, err := promRegistry.Gather()
+                mf, err := promRegistry.Gather()
                 require.NoError(t, err)

-                family := findMetricFamilyByName(mf, "router_nats_request_total")
+                family := findMetricFamilyByName(mf, "router_nats_request_total")
+                require.NotNil(t, family, "router_nats_request_total not found in Prometheus registry")
                 metrics := family.GetMetric()
                 require.Len(t, metrics, 1)

285-301: NATS subscribe: assert non-empty metrics before indexing

This block indexes metrics[0] without asserting there is at least one metric. Add a check to avoid panics on slow collectors.

Apply this diff:

-                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
-                    metrics := family.GetMetric()
+                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
+                    require.NotNil(t, family, "router_events_messages_received_total not found in Prometheus registry")
+                    metrics := family.GetMetric()
+                    require.NotEmpty(t, metrics, "no metrics found for router_events_messages_received_total")

335-352: Redis publish: add nil check for metric family

Prevent a potential panic when the family is missing.

Apply this diff:

-                mf, err := promRegistry.Gather()
+                mf, err := promRegistry.Gather()
                 require.NoError(t, err)

-                family := findMetricFamilyByName(mf, "router_events_publish_messages_total")
+                family := findMetricFamilyByName(mf, "router_events_publish_messages_total")
+                require.NotNil(t, family, "router_events_publish_messages_total not found in Prometheus registry")
                 metrics := family.GetMetric()
                 require.Len(t, metrics, 1)

400-416: Redis subscribe: add nil check for metric family

Same defensive pattern.

Apply this diff:

-                    mf, err := promRegistry.Gather()
+                    mf, err := promRegistry.Gather()
                     require.NoError(t, err)

-                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
+                    family := findMetricFamilyByName(mf, "router_events_messages_received_total")
+                    require.NotNil(t, family, "router_events_messages_received_total not found in Prometheus registry")
                     metrics := family.GetMetric()
                     require.Len(t, metrics, 1)
router/pkg/pubsub/datasource/provider.go (1)

36-38: Document ProviderOpts and null-object expectation

Add a brief doc comment stating that EventMetricStore is always non-nil (noop when disabled). This encodes the invariant and removes the need for nil checks downstream.

Apply this diff:

-type ProviderOpts struct {
-	EventMetricStore metric.EventMetricStore
-}
+// ProviderOpts carries cross-cutting configuration for providers/adapters.
+// EventMetricStore is never nil; when metrics are disabled, a no-op implementation is provided.
+type ProviderOpts struct {
+	EventMetricStore metric.EventMetricStore
+}
demo/pkg/subgraphs/subgraphs.go (2)

215-217: DRY: reuse a single ProviderOpts instance

You duplicate ProviderOpts construction for both adapters. Minor cleanup: create once and reuse.

Apply this diff:

-    defaultAdapter, err := natsPubsub.NewAdapter(ctx, zap.NewNop(), url, []nats.Option{}, "hostname", "test", datasource.ProviderOpts{
-        EventMetricStore: rmetric.NewNoopEventMetricStore(),
-    })
+    po := datasource.ProviderOpts{EventMetricStore: rmetric.NewNoopEventMetricStore()}
+    defaultAdapter, err := natsPubsub.NewAdapter(ctx, zap.NewNop(), url, []nats.Option{}, "hostname", "test", po)

223-225: DRY: reuse ProviderOpts for my-nats adapter as well

Follow-up to the previous comment.

Apply this diff:

-    myNatsAdapter, err := natsPubsub.NewAdapter(ctx, zap.NewNop(), url, []nats.Option{}, "hostname", "test", datasource.ProviderOpts{
-        EventMetricStore: rmetric.NewNoopEventMetricStore(),
-    })
+    myNatsAdapter, err := natsPubsub.NewAdapter(ctx, zap.NewNop(), url, []nats.Option{}, "hostname", "test", po)
router/pkg/metric/config.go (2)

36-39: Clarify semantics of EventMetrics (Prometheus)

Add a doc comment so readers know which metrics are controlled and where they surface.

Apply this diff:

 type PrometheusConfig struct {
@@
 	// Prometheus schema field usage configuration
 	PromSchemaFieldUsage PrometheusSchemaFieldUsage
-	EventMetrics         bool
+	// EventMetrics enables Prometheus event metrics (Kafka, Redis, NATS).
+	EventMetrics bool
 }

81-84: Clarify semantics of EventMetrics (OTLP)

Mirror the documentation for OTLP config as well.

Apply this diff:

 type OpenTelemetry struct {
@@
 	// TestReader is used for testing purposes. If set, the reader will be used instead of the configured exporters.
-	TestReader   sdkmetric.Reader
-	EventMetrics bool
+	TestReader sdkmetric.Reader
+	// EventMetrics enables OTLP event metrics (Kafka, Redis, NATS).
+	EventMetrics bool
 }
router/core/graph_server.go (1)

759-763: Remove unnecessary nil-check on EventMetricStore in Shutdown

Per our repo conventions (null-object pattern), EventMetricStore is always non-nil (noop when disabled), so this nil check is redundant and slightly obscures the invariant.

Using the learning from prior work in this codebase: EventMetricStore fields are never nil—NewNoopEventMetricStore() is used when metrics are disabled.

Apply this small cleanup:

-	if s.eventMetricStore != nil {
-		if aErr := s.eventMetricStore.Shutdown(ctx); aErr != nil {
-			err = errors.Join(err, aErr)
-		}
-	}
+	if aErr := s.eventMetricStore.Shutdown(ctx); aErr != nil {
+		err = errors.Join(err, aErr)
+	}
router/pkg/pubsub/kafka/provider_builder_test.go (2)

37-39: Avoid brittle assertion on options count in TLS test

Hard-coding the exact length makes this test fragile as builder options can evolve. Prefer asserting “more than baseline” like your SASL tests.

-		// Can't directly check for TLS options, but we can verify more options are present
-		require.Equal(t, len(opts), 5)
+		// Can't directly check for TLS options; verify more options are present
+		require.Greater(t, len(opts), 1)

97-107: Assert provider identity/type to strengthen the test

You already assert adapter/logger presence. Adding ID and TypeID checks strengthens the contract of the builder.

 	kafkaProvider, ok := provider.(*datasource.PubSubProvider)
 	require.True(t, ok)
 	assert.NotNil(t, kafkaProvider.Logger)
 	assert.NotNil(t, kafkaProvider.Adapter)
+	assert.Equal(t, providerId, provider.ID())
+	assert.Equal(t, "kafka", provider.TypeID())
router-tests/testenv/testenv.go (1)

2825-2827: Use provider-specific NATS params when constructing adapters

You’re iterating over DemoNatsProviders but always use natsData.Params[0]. If future providers use different URLs/options, both adapters will point to the same connection parameters.

A minimal refactor is to index Params in step with the providers (or map by provider ID) to avoid accidental cross-wiring. For example:

// Outside this range, adjust the loop to capture index:
for i, sourceName := range DemoNatsProviders {
    params := natsData.Params[i]
    adapter, err := pubsubNats.NewAdapter(ctx, logger, params.Url, params.Opts, "hostname", "listenaddr", datasource.ProviderOpts{
        EventMetricStore: rmetric.NewNoopEventMetricStore(), // per learnings: never nil
    })
    // ...
}

If Params ordering differs from provider IDs, prefer a lookup keyed by provider ID.

router-tests/telemetry/event_metrics_test.go (1)

60-74: Make metric DP assertions resilient: filter by attributes instead of expecting a single datapoint

Asserting require.Len(sum.DataPoints, 1) is brittle if additional datapoints with different attributes appear (e.g., multiple subjects/topics in future changes). Focus on the datapoint matching provider attributes and backend-topic/subject/channel.

Example utility to locate the right datapoint:

func findSumDP(d metricdata.Sum[int64], match func(attribute.Set) bool) *metricdata.DataPoint[int64] {
	for i := range d.DataPoints {
		if match(d.DataPoints[i].Attributes) {
			return &d.DataPoints[i]
		}
	}
	return nil
}

// Usage:
sum, _ := metricEntry.Data.(metricdata.Sum[int64])
dp := findSumDP(sum, func(attrs attribute.Set) bool {
  id, _ := attrs.Value(otelattrs.WgEventProviderID)
  tp, _ := attrs.Value(otelattrs.WgEventProviderType)
  topic, _ := attrs.Value(otelattrs.WgKafkaTopic) // or WgNatsSubject/WgRedisChannel
  return id.AsString() == "my-kafka" && tp.AsString() == "kafka" && strings.HasSuffix(topic.AsString(), "employeeUpdated")
})
require.NotNil(t, dp)
require.Equal(t, int64(2), dp.Value)

Apply the same pattern across similar assertions in this file to reduce flakiness as coverage grows.

router/pkg/pubsub/nats/provider_builder_test.go (2)

45-46: Avoid brittle assertions on the count of NATS options

The “> 7” check is tied to internal implementation details and may break with harmless changes. The NotEmpty assertion is sufficient for this unit test’s purpose.

Apply this diff:

-        // Can't directly check for token options, but we can verify options are present
-        require.Greater(t, len(opts), 7) // Basic options (7) + token option
+        // Options are present; detailed introspection is not stable across versions.
+        require.NotEmpty(t, opts)

67-68: Same here: remove “> 7” assertion for user/password test

Keep tests robust against harmless changes in option composition.

Apply this diff:

-        // Can't directly check for auth options, but we can verify options are present
-        require.Greater(t, len(opts), 7) // Basic options (7) + user info option
+        // Options are present; detailed introspection is not stable across versions.
+        require.NotEmpty(t, opts)
router/pkg/pubsub/redis/adapter.go (1)

28-37: Annotate logger with a “pubsub=redis” field for consistent logs

Kafka’s adapter tags the logger; doing the same here improves log filtering. Respecting the null-object pattern for metrics, no nil checks are needed.

Apply this diff:

 func NewProviderAdapter(ctx context.Context, logger *zap.Logger, urls []string, clusterEnabled bool, opts datasource.ProviderOpts) Adapter {
 	ctx, cancel := context.WithCancel(ctx)
-	return &ProviderAdapter{
-		ctx:              ctx,
-		cancel:           cancel,
-		logger:           logger,
+	if logger == nil {
+		logger = zap.NewNop()
+	}
+	return &ProviderAdapter{
+		ctx:              ctx,
+		cancel:           cancel,
+		logger:           logger.With(zap.String("pubsub", "redis")),
 		urls:             urls,
 		clusterEnabled:   clusterEnabled,
 		eventMetricStore: opts.EventMetricStore,
 	}
 }
router/pkg/pubsub/nats/adapter.go (3)

136-141: Use the operation context for metric emission instead of adapter context

Consistency: elsewhere (publish/request) you pass the method’s ctx to the store. Use the subscription ctx here as well.

Apply this diff:

-            p.eventMetricStore.NatsMessageReceived(p.ctx, event.ProviderID, msg.Subject())
+            p.eventMetricStore.NatsMessageReceived(ctx, event.ProviderID, msg.Subject())

And:

-        p.eventMetricStore.NatsMessageReceived(p.ctx, event.ProviderID, msg.Subject)
+        p.eventMetricStore.NatsMessageReceived(ctx, event.ProviderID, msg.Subject)

Also applies to: 172-177


205-228: Optional: ensure publish is flushed within ctx to detect broker errors earlier

NATS Publish is fire-and-forget. If you need delivery guarantees for metrics/tests, flush after publish using the provided ctx. Consider this behind a feature flag to avoid latency impact.

Apply this diff:

   err := p.client.Publish(event.Subject, event.Data)
   if err != nil {
     log.Error("publish error", zap.Error(err))
     p.eventMetricStore.NatsPublishFailure(ctx, event.ProviderID, event.Subject)
     return datasource.NewError(fmt.Sprintf("error publishing to NATS subject %s", event.Subject), err)
   } else {
     p.eventMetricStore.NatsPublish(ctx, event.ProviderID, event.Subject)
+    // Optionally flush to ensure the server processed the publish.
+    if fErr := p.flush(ctx); fErr != nil {
+      log.Warn("flush after publish failed", zap.Error(fErr))
+    }
   }

318-341: Defensive default for EventMetricStore to prevent nil deref across adapters

While the codebase aims to always pass a non-nil store (null-object pattern), a guard here prevents panics if callers regress.

Apply this diff:

 func NewAdapter(
   ctx context.Context,
   logger *zap.Logger,
   url string,
   opts []nats.Option,
   hostName string,
   routerListenAddr string,
   providerOpts datasource.ProviderOpts,
 ) (Adapter, error) {
   if logger == nil {
     logger = zap.NewNop()
   }
 
+  ems := providerOpts.EventMetricStore
+  if ems == nil {
+    ems = metric.NewNoopEventMetricStore()
+  }
+
   return &ProviderAdapter{
     ctx:              ctx,
     logger:           logger.With(zap.String("pubsub", "nats")),
     closeWg:          sync.WaitGroup{},
     hostName:         hostName,
     routerListenAddr: routerListenAddr,
     url:              url,
     opts:             opts,
     flushTimeout:     10 * time.Second,
-    eventMetricStore: providerOpts.EventMetricStore,
+    eventMetricStore: ems,
   }, nil
 }

Note: This aligns with the repository-wide null-object pattern for metrics (per learnings).

router/pkg/metric/noop_event_metrics.go (1)

1-51: Add compile-time interface assertions for safety

Assert implementations at compile time to catch signature drift early.

Apply this diff:

 package metric
 
 import (
   "context"
 
   otelmetric "go.opentelemetry.io/otel/metric"
 )
 
 // A noop metric provider so we do not need to do nil checks for each provider call from the store
 type noopEventMetricProvider struct{}
 
 func (n *noopEventMetricProvider) Publish(ctx context.Context, opts ...otelmetric.AddOption)        {}
 func (n *noopEventMetricProvider) PublishFailure(ctx context.Context, opts ...otelmetric.AddOption) {}
 func (n *noopEventMetricProvider) MessagesReceived(ctx context.Context, opts ...otelmetric.AddOption) {
 }
 
 func (n *noopEventMetricProvider) NatsRequest(ctx context.Context, opts ...otelmetric.AddOption) {}
 func (n *noopEventMetricProvider) NatsRequestFailure(ctx context.Context, opts ...otelmetric.AddOption) {
 }
 func (n *noopEventMetricProvider) Flush(ctx context.Context) error { return nil }
 func (n *noopEventMetricProvider) Shutdown() error                 { return nil }
 
 type NoopEventMetricStore struct{}
 
 func (n *NoopEventMetricStore) KafkaPublish(ctx context.Context, providerID string, topic string) {}
 func (n *NoopEventMetricStore) KafkaPublishFailure(ctx context.Context, providerID string, topic string) {
 }
 func (n *NoopEventMetricStore) KafkaMessageReceived(ctx context.Context, providerID string, topic string) {
 }
 
 func (n *NoopEventMetricStore) RedisPublish(ctx context.Context, providerID string, channel string) {}
 func (n *NoopEventMetricStore) RedisPublishFailure(ctx context.Context, providerID string, channel string) {
 }
 func (n *NoopEventMetricStore) RedisMessageReceived(ctx context.Context, providerID string, channel string) {
 }
 
 func (n *NoopEventMetricStore) NatsPublish(ctx context.Context, providerID string, subject string) {}
 func (n *NoopEventMetricStore) NatsPublishFailure(ctx context.Context, providerID string, subject string) {
 }
 func (n *NoopEventMetricStore) NatsMessageReceived(ctx context.Context, providerID string, subject string) {
 }
 
 func (n *NoopEventMetricStore) NatsRequest(ctx context.Context, providerID string, subject string) {}
 func (n *NoopEventMetricStore) NatsRequestFailure(ctx context.Context, providerID string, subject string) {
 }
 
 func (n *NoopEventMetricStore) Flush(ctx context.Context) error    { return nil }
 func (n *NoopEventMetricStore) Shutdown(ctx context.Context) error { return nil }
 
 func NewNoopEventMetricStore() *NoopEventMetricStore { return &NoopEventMetricStore{} }
+
+// Compile-time assertions
+var _ EventMetricStore = (*NoopEventMetricStore)(nil)
+var _ EventMetricProvider = (*noopEventMetricProvider)(nil)

@wundergraph wundergraph deleted a comment from coderabbitai bot Aug 14, 2025
@wundergraph wundergraph deleted a comment from coderabbitai bot Aug 14, 2025
@wundergraph wundergraph deleted a comment from coderabbitai bot Aug 14, 2025
@wundergraph wundergraph deleted a comment from coderabbitai bot Aug 14, 2025
@wundergraph wundergraph deleted a comment from coderabbitai bot Aug 14, 2025
@wundergraph wundergraph deleted a comment from coderabbitai bot Aug 14, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
router-tests/prometheus_event_metrics_test.go (4)

26-26: Rename the test for clarity (it’s not intended to be flaky).

Consider renaming to better reflect intent and avoid confusion during triage.

Apply:

-func TestFlakyEventMetrics(t *testing.T) {
+func TestPrometheusEventMetrics(t *testing.T) {

51-69: Stabilize metric assertions with eventual/polling checks

Prometheus collection via OTel can be slightly eventual; asserting immediately after producing/consuming can be flaky under load. Consider polling Gather until the family appears and the expected count matches or a timeout is reached.

If you start seeing CI flakes, add a tiny helper and replace direct Gather calls:

// Add once in this file:
func awaitCounterValue(t *testing.T, reg *prometheus.Registry, metricName string, expected float64, timeout time.Duration) []*io_prometheus_client.Metric {
	t.Helper()
	deadline := time.Now().Add(timeout)
	for {
		mf, err := reg.Gather()
		require.NoError(t, err)
		family := findMetricFamilyByName(mf, metricName)
		if family != nil {
			metrics := family.GetMetric()
			if len(metrics) > 0 && metrics[0].GetCounter() != nil && metrics[0].GetCounter().GetValue() == expected {
				return metrics
			}
		}
		if time.Now().After(deadline) {
			require.Failf(t, "metric not observed", "metric %s did not reach value %v before timeout", metricName, expected)
		}
		time.Sleep(50 * time.Millisecond)
	}
}

Then, e.g.:

- mf, err := promRegistry.Gather()
- require.NoError(t, err)
- family := findMetricFamilyByName(mf, "messaging_client_sent_messages_total")
- require.NotNil(t, family)
- metrics := family.GetMetric()
+ metrics := awaitCounterValue(t, promRegistry, "messaging_client_sent_messages_total", 2, WaitTimeout)

Note: you’ll need to import io_prometheus_client: "github.com/prometheus/client_model/go" as io_prometheus_client.

Also applies to: 118-136, 168-185, 210-228, 284-301, 335-353, 401-418


27-27: Nested t.Parallel() may over-stress shared brokers in CI

Running all backends and their subtests in parallel can increase flakiness on constrained CI hosts (Kafka/NATS/Redis are shared Docker services). Consider reducing parallelism at one level or gating with a single top-level Parallel.

Would you like me to propose a table-driven structure with selective parallelism to keep broker load bounded?

Also applies to: 29-31, 72-74, 147-149, 188-190, 231-233, 317-319, 356-358


241-242: 1s WebSocketClientReadTimeout might be too aggressive

On slow CI, a 1s read timeout can prematurely trigger disconnects. Consider 2–5s unless there’s a strong reason to keep it at 1s.

- ModifyEngineExecutionConfiguration: func(ec *config.EngineExecutionConfiguration) { ec.WebSocketClientReadTimeout = time.Second },
+ ModifyEngineExecutionConfiguration: func(ec *config.EngineExecutionConfiguration) { ec.WebSocketClientReadTimeout = 3 * time.Second },
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e72b9ae and 2ac4afa.

📒 Files selected for processing (3)
  • router-tests/prometheus_event_metrics_test.go (1 hunks)
  • router-tests/telemetry/event_metrics_test.go (1 hunks)
  • router/pkg/pubsub/nats/adapter.go (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router/pkg/pubsub/nats/adapter.go
  • router-tests/telemetry/event_metrics_test.go
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
🧬 Code Graph Analysis (1)
router-tests/prometheus_event_metrics_test.go (5)
router-tests/testenv/testenv.go (8)
  • Run (105-122)
  • Config (284-340)
  • ConfigWithEdfsKafkaJSONTemplate (85-85)
  • MetricOptions (263-277)
  • Environment (1727-1763)
  • GraphQLRequest (1895-1903)
  • ConfigWithEdfsNatsJSONTemplate (87-87)
  • ConfigWithEdfsRedisJSONTemplate (89-89)
demo/pkg/subgraphs/subgraphs.go (1)
  • Config (79-84)
router-tests/events/event_helpers.go (2)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router/pkg/config/config.go (1)
  • EngineExecutionConfiguration (370-398)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router-tests/prometheus_event_metrics_test.go (1)

29-69: Strong end-to-end coverage and label assertions across Kafka, NATS, and Redis

Good integration coverage validating both producer and consumer counters, per-backend operation names, and destination/system labels. Per-test Prometheus registry and manual reader scoping reduces cross-test interference. Subscription lifecycle is handled cleanly with WaitForSubscriptionCount/ConnectionCount and client.Close waits.

Also applies to: 147-185, 188-229, 231-311, 317-354, 356-425

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (1)
router/pkg/pubsub/kafka/adapter.go (1)

145-161: Close the per-subscription consumer client when the poller exits

Each Subscribe creates a client; it’s never closed. Ensure client.Close() is called when the goroutine finishes to avoid leaking resources.

Apply:

 	go func() {
 
 		defer p.closeWg.Done()
+		defer client.Close()
 
 		err := p.topicPoller(ctx, client, updater, PollerOpts{providerId: event.ProviderID})
🧹 Nitpick comments (17)
router/pkg/otel/attributes.go (1)

63-70: Align WG Messaging Attributes with OpenTelemetry SemConv

Per the OpenTelemetry semantic conventions, standard messaging attributes use keys like messaging.system, messaging.operation.name, messaging.operation.type, messaging.destination.name, etc., and errors are represented via error.type, error.message, plus exception event attributes (exception.type, exception.message, exception.stacktrace) on spans.

In router/pkg/otel/attributes.go (lines 63–70), we currently define only vendor-prefixed keys:

const (
  MessagingOperationName   = attribute.Key("wg.messaging.operation.name")
  MessagingSystem          = attribute.Key("wg.messaging.system")
  MessagingError           = attribute.Key("wg.messaging.error")
  MessagingDestinationName = attribute.Key("wg.messaging.destination.name")
  ProviderId               = attribute.Key("wg.provider.id")
)

Recommendations:

  • Emit the standard semconv keys alongside the wg.* keys. For example:
    attribute.Key("messaging.system")
    attribute.Key("messaging.operation.name")
    attribute.Key("messaging.operation.type")
    attribute.Key("messaging.destination.name")
    • …and so on for other relevant attributes.
  • For errors, prefer setting error.type (low-cardinality) and error.message, and record exceptions via the exception semantic event attributes rather than a custom wg.messaging.error key.
  • Optionally, introduce a bridging/mapping layer so that downstream consumers querying only standard semconv keys can work without awareness of the wg.* prefixes.
router/pkg/metric/messaging_event_measurements.go (2)

14-21: Tighten counter descriptions for clarity

Descriptions are a bit awkward and include “event” redundantly. Make them clear and user-facing.

-	messagingSentMessagesOptions = []otelmetric.Int64CounterOption{
-		otelmetric.WithDescription("Number of messaging event sent messages"),
-	}
+	messagingSentMessagesOptions = []otelmetric.Int64CounterOption{
+		otelmetric.WithDescription("Number of messages sent by the client"),
+	}
-	messagingConsumedMessagesOptions = []otelmetric.Int64CounterOption{
-		otelmetric.WithDescription("Number of messaging event consumed messages"),
-	}
+	messagingConsumedMessagesOptions = []otelmetric.Int64CounterOption{
+		otelmetric.WithDescription("Number of messages consumed by the client"),
+	}

37-43: Keep error messages consistent with chosen terminology

If we standardize on “consumed” instead of “received,” update the error string.

-	if err != nil {
-		return nil, fmt.Errorf("failed to create received messages counter: %w", err)
-	}
+	if err != nil {
+		return nil, fmt.Errorf("failed to create consumed messages counter: %w", err)
+	}
router/pkg/metric/oltp_messaging_event_metric_store.go (2)

16-21: Remove unused logger field or put it to use

logger is stored but never used. Either remove it to reduce noise or add debug logs when creating instruments/Flushing.

 type otlpMessagingEventMetrics struct {
 	instruments   *eventInstruments
 	meterProvider *metric.MeterProvider
-	logger        *zap.Logger
 	meter         otelmetric.Meter
 }
 
 func newOtlpMessagingEventMetrics(logger *zap.Logger, meterProvider *metric.MeterProvider) (*otlpMessagingEventMetrics, error) {
@@
 	return &otlpMessagingEventMetrics{
 		instruments:   instruments,
 		meterProvider: meterProvider,
-		logger:        logger,
 		meter:         meter,
 	}, nil
 }

If you plan to add logs shortly, keeping it is fine; otherwise consider dropping it.


50-52: Clarify Flush intent (tests vs prod)

ForceFlush can be expensive with some SDKs/exporters. If intended primarily for tests/shutdown paths, add a brief comment to discourage hot-path use.

router/core/graph_server.go (1)

763-768: Redundant nil check (optional)

Given we initialize with NewNoopEventMetricStore(), this cannot be nil. Safe to leave as-is; optional to remove for brevity.

router-tests/telemetry/messaging_event_metrics_test.go (2)

29-31: Rename test to remove “Flaky” for clarity.

The name suggests instability; consider a neutral name like TestMessagingEventMetrics to avoid confusion in CI dashboards.

Apply this change:

-func TestFlakyEventMetrics(t *testing.T) {
+func TestMessagingEventMetrics(t *testing.T) {

224-275: NATS request path coverage is valuable; consider adding a consumer-side check (optional).

You assert messaging.event.sent.messages for request. If feasible, also add a received-messages assertion on the reply side to cover both directions. Optional, not blocking.

router-tests/prometheus_messaging_event_metrics_test.go (2)

26-28: Rename test to remove “Flaky” for clarity.

Same suggestion as the OTLP test suite: neutral naming reads better in CI and issue triage.

Apply this change:

-func TestFlakyEventMetrics(t *testing.T) {
+func TestPrometheusMessagingEventMetrics(t *testing.T) {

206-253: NATS request: complete Prometheus coverage for send path.

Optional: you could assert that only a single timeseries exists (require.Len == 1 already does) and consider adding a receive-side metric assertion if surfaced.

router/pkg/pubsub/redis/adapter.go (1)

114-118: Don’t terminate the subscription loop on a nil message

Current behavior logs “skipping” but returns, which stops the goroutine. Prefer continuing the loop to avoid tearing down the subscription due to a single nil message.

Apply:

-				if msg == nil {
-					log.Debug("empty message received on subscription update, skipping")
-					return
-				}
+				if msg == nil {
+					log.Debug("empty message received on subscription update, skipping")
+					continue
+				}
router/pkg/pubsub/kafka/adapter.go (2)

24-27: Optional: standardize producer operation name to “publish” for consistency

Kafka uses “produce” while Redis/NATS use “publish”. If you want uniform Prom series and dashboards, consider renaming to “publish”. Not a blocker if you deliberately keep broker-native verbs.

Apply:

-const (
-	kafkaReceive = "receive"
-	kafkaProduce = "produce"
-)
+const (
+	kafkaReceive = "receive"
+	kafkaPublish = "publish"
+)

And update usage:

-			OperationName:   kafkaProduce,
+			OperationName:   kafkaPublish,

102-108: Use the subscription context for metric emission

Using ctx instead of p.ctx ties the metric to the subscription lifecycle and supports finer-grained cancellation/attr propagation (mirrors the select paths).

Apply:

-				p.messagingEventMetricStore.Consume(p.ctx, metric.MessagingEvent{
+				p.messagingEventMetricStore.Consume(ctx, metric.MessagingEvent{
router/pkg/metric/messaging_event_metric_store.go (2)

24-31: Go naming: rename ProviderId to ProviderID

Go prefers initialisms to be capitalized (ID vs Id). Consider renaming the field; the attribute key can remain wg.provider.id.

Apply:

 type MessagingEvent struct {
-	ProviderId      string       // The id of the provider defined in the configuration
+	ProviderID      string       // The id of the provider defined in the configuration
 	OperationName   string       // The operation name that is specific to the messaging system
 	MessagingSystem ProviderType // The messaging system type that are supported
 	Error           bool         // Indicates if the operation resulted in an error or not (true or false)
 	DestinationName string       // The name of the destination queue / topic / channel
 }

Follow-ups: update call sites and attribute construction accordingly.

Would you like me to prepare a repo-wide codemod patch to rename the field and update references?


88-105: DRY: unify attribute construction for Produce/Consume

Produce and Consume duplicate attribute building. Extract into a helper to reduce drift and simplify future additions.

Apply within these methods:

 func (e *MessagingEventMetrics) Produce(ctx context.Context, event MessagingEvent) {
-	attrs := []attribute.KeyValue{
-		otel.MessagingOperationName.String(event.OperationName),
-		otel.MessagingSystem.String(string(event.MessagingSystem)),
-		otel.MessagingError.Bool(event.Error),
-	}
-	if event.ProviderId != "" {
-		attrs = append(attrs, otel.ProviderId.String(event.ProviderId))
-	}
-	if event.DestinationName != "" {
-		attrs = append(attrs, otel.MessagingDestinationName.String(event.DestinationName))
-	}
-	opt := e.withAttrs(attrs...)
+	opt := e.withAttrs(e.eventAttrs(event)...)
 	for _, provider := range e.providers {
 		provider.Produce(ctx, opt)
 	}
 }
 
 func (e *MessagingEventMetrics) Consume(ctx context.Context, event MessagingEvent) {
-	attrs := []attribute.KeyValue{
-		otel.MessagingOperationName.String(event.OperationName),
-		otel.MessagingSystem.String(string(event.MessagingSystem)),
-		otel.MessagingError.Bool(event.Error),
-	}
-	if event.ProviderId != "" {
-		attrs = append(attrs, otel.ProviderId.String(event.ProviderId))
-	}
-	if event.DestinationName != "" {
-		attrs = append(attrs, otel.MessagingDestinationName.String(event.DestinationName))
-	}
-	opt := e.withAttrs(attrs...)
+	opt := e.withAttrs(e.eventAttrs(event)...)
 	for _, provider := range e.providers {
 		provider.Consume(ctx, opt)
 	}
 }

Add this helper (outside of the selected lines):

// eventAttrs builds the per-event attributes.
func (e *MessagingEventMetrics) eventAttrs(event MessagingEvent) []attribute.KeyValue {
	attrs := []attribute.KeyValue{
		otel.MessagingOperationName.String(event.OperationName),
		otel.MessagingSystem.String(string(event.MessagingSystem)),
		otel.MessagingError.Bool(event.Error),
	}
	// Keep optional labels sparse to reduce cardinality.
	if event.ProviderId != "" {
		attrs = append(attrs, otel.ProviderId.String(event.ProviderId))
	}
	if event.DestinationName != "" {
		attrs = append(attrs, otel.MessagingDestinationName.String(event.DestinationName))
	}
	return attrs
}

Also applies to: 107-125

router/pkg/pubsub/nats/adapter.go (2)

144-149: Use the subscription context for metric emission (durable path)

Prefer ctx over p.ctx so the emission aligns with the subscription scope and cancellation.

Apply:

-						p.messagingEventMetricStore.Consume(p.ctx, metric.MessagingEvent{
+						p.messagingEventMetricStore.Consume(ctx, metric.MessagingEvent{

187-192: Use the subscription context for metric emission (chan subscribe path)

Same rationale as the durable path.

Apply:

-				p.messagingEventMetricStore.Consume(p.ctx, metric.MessagingEvent{
+				p.messagingEventMetricStore.Consume(ctx, metric.MessagingEvent{
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2ac4afa and fdbe1f8.

📒 Files selected for processing (23)
  • demo/pkg/subgraphs/subgraphs.go (2 hunks)
  • router-tests/prometheus_messaging_event_metrics_test.go (1 hunks)
  • router-tests/telemetry/messaging_event_metrics_test.go (1 hunks)
  • router-tests/testenv/testenv.go (5 hunks)
  • router/core/factoryresolver.go (3 hunks)
  • router/core/graph_server.go (5 hunks)
  • router/core/router.go (2 hunks)
  • router/pkg/config/config.go (2 hunks)
  • router/pkg/config/config.schema.json (2 hunks)
  • router/pkg/config/testdata/config_defaults.json (2 hunks)
  • router/pkg/config/testdata/config_full.json (2 hunks)
  • router/pkg/metric/config.go (2 hunks)
  • router/pkg/metric/messaging_event_measurements.go (1 hunks)
  • router/pkg/metric/messaging_event_metric_store.go (1 hunks)
  • router/pkg/metric/noop_event_metrics.go (1 hunks)
  • router/pkg/metric/oltp_messaging_event_metric_store.go (1 hunks)
  • router/pkg/metric/prom_messaging_event_metric_store.go (1 hunks)
  • router/pkg/otel/attributes.go (2 hunks)
  • router/pkg/pubsub/datasource/provider.go (2 hunks)
  • router/pkg/pubsub/kafka/adapter.go (7 hunks)
  • router/pkg/pubsub/nats/adapter.go (9 hunks)
  • router/pkg/pubsub/pubsub.go (7 hunks)
  • router/pkg/pubsub/redis/adapter.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • router/pkg/config/testdata/config_full.json
  • demo/pkg/subgraphs/subgraphs.go
  • router/pkg/pubsub/datasource/provider.go
  • router/pkg/metric/config.go
  • router/pkg/metric/noop_event_metrics.go
  • router/pkg/config/config.go
  • router/pkg/config/testdata/config_defaults.json
  • router/core/router.go
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/pkg/metric/oltp_messaging_event_metric_store.go
  • router/pkg/metric/prom_messaging_event_metric_store.go
  • router-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:46:00.905Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/pkg/metric/oltp_messaging_event_metric_store.go
  • router/pkg/metric/prom_messaging_event_metric_store.go
  • router/core/graph_server.go
  • router/pkg/metric/messaging_event_metric_store.go
  • router-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:45:57.468Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/pkg/metric/oltp_messaging_event_metric_store.go
  • router/pkg/metric/prom_messaging_event_metric_store.go
  • router/core/graph_server.go
  • router/pkg/metric/messaging_event_metric_store.go
  • router-tests/testenv/testenv.go
📚 Learning: 2025-08-14T16:47:03.713Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.713Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/pkg/metric/oltp_messaging_event_metric_store.go
  • router/pkg/pubsub/kafka/adapter.go
🧬 Code Graph Analysis (10)
router-tests/telemetry/messaging_event_metrics_test.go (5)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-81)
router-tests/utils.go (2)
  • GetMetricScopeByName (88-95)
  • GetMetricByName (79-86)
demo/cmd/subscriptiontest/main.go (1)
  • Data (32-34)
router/pkg/otel/attributes.go (5)
  • MessagingOperationName (65-65)
  • MessagingSystem (66-66)
  • MessagingDestinationName (68-68)
  • ProviderId (69-69)
  • MessagingError (67-67)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router-tests/prometheus_messaging_event_metrics_test.go (4)
router-tests/testenv/testenv.go (3)
  • Run (105-122)
  • MetricOptions (263-277)
  • Environment (1727-1763)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-81)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router/pkg/config/config.go (1)
  • EngineExecutionConfiguration (370-398)
router/core/graph_server.go (5)
router/pkg/metric/messaging_event_metric_store.go (3)
  • MessagingEventMetricStore (41-47)
  • MessagingEventMetrics (50-54)
  • NewMessagingEventMetricStore (56-81)
router/pkg/metric/noop_metrics.go (1)
  • NewNoopMetrics (63-65)
router/pkg/metric/noop_event_metrics.go (1)
  • NewNoopEventMetricStore (15-15)
router/pkg/metric/config.go (1)
  • OpenTelemetry (69-84)
router/pkg/config/config.go (1)
  • Prometheus (97-111)
router/core/factoryresolver.go (2)
router/pkg/config/config.go (4)
  • EngineExecutionConfiguration (370-398)
  • HeaderRules (250-255)
  • EventsConfiguration (627-629)
  • SubgraphErrorPropagationConfiguration (746-758)
router/pkg/metric/messaging_event_metric_store.go (1)
  • MessagingEventMetricStore (41-47)
router/pkg/metric/messaging_event_metric_store.go (3)
router/pkg/otel/attributes.go (5)
  • ProviderId (69-69)
  • MessagingSystem (66-66)
  • MessagingOperationName (65-65)
  • MessagingError (67-67)
  • MessagingDestinationName (68-68)
router/pkg/config/config.go (2)
  • Config (979-1053)
  • Prometheus (97-111)
router/pkg/metric/config.go (2)
  • Config (107-131)
  • OpenTelemetry (69-84)
router/pkg/pubsub/pubsub.go (6)
router/pkg/metric/messaging_event_metric_store.go (1)
  • MessagingEventMetricStore (41-47)
router/pkg/pubsub/datasource/provider.go (3)
  • Provider (18-24)
  • ProviderBuilder (27-34)
  • ProviderOpts (36-38)
router/pkg/metric/noop_event_metrics.go (1)
  • NewNoopEventMetricStore (15-15)
router/pkg/pubsub/kafka/provider_builder.go (1)
  • ProviderBuilder (21-27)
router/pkg/pubsub/nats/provider_builder.go (1)
  • ProviderBuilder (18-24)
router/pkg/pubsub/redis/provider_builder.go (1)
  • ProviderBuilder (16-22)
router/pkg/pubsub/nats/adapter.go (5)
router/pkg/metric/messaging_event_metric_store.go (3)
  • MessagingEventMetricStore (41-47)
  • MessagingEvent (25-31)
  • ProviderTypeNats (20-20)
router/pkg/pubsub/kafka/adapter.go (2)
  • ProviderAdapter (42-50)
  • Adapter (30-35)
router/pkg/pubsub/nats/engine_datasource.go (1)
  • PublishAndRequestEventConfiguration (28-32)
router/pkg/pubsub/datasource/error.go (2)
  • Error (3-6)
  • NewError (12-17)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/pubsub/redis/adapter.go (3)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/metric/messaging_event_metric_store.go (3)
  • MessagingEventMetricStore (41-47)
  • MessagingEvent (25-31)
  • ProviderTypeRedis (21-21)
router/internal/persistedoperation/operationstorage/redis/rdcloser.go (1)
  • RDCloser (14-16)
router/pkg/pubsub/kafka/adapter.go (5)
router/pkg/metric/messaging_event_metric_store.go (3)
  • MessagingEventMetricStore (41-47)
  • MessagingEvent (25-31)
  • ProviderTypeKafka (19-19)
router/pkg/otel/attributes.go (2)
  • ProviderId (69-69)
  • MessagingSystem (66-66)
router/pkg/pubsub/datasource/error.go (2)
  • Error (3-6)
  • NewError (12-17)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/metric/noop_event_metrics.go (1)
  • NewNoopEventMetricStore (15-15)
router-tests/testenv/testenv.go (5)
router/pkg/config/config.go (1)
  • CircuitBreaker (213-224)
router/pkg/metric/messaging_event_metric_store.go (2)
  • MessagingEventMetrics (50-54)
  • MessagingEventMetricStore (41-47)
router/pkg/pubsub/nats/adapter.go (1)
  • NewAdapter (356-372)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/metric/noop_event_metrics.go (1)
  • NewNoopEventMetricStore (15-15)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (42)
router/pkg/metric/oltp_messaging_event_metric_store.go (1)

42-48: LGTM: produce/consume increment with flexible attributes

Incrementing by 1 and accepting AddOptions enables the store wrapper to attach attributes consistently. Good separation of concerns.

router/core/graph_server.go (3)

521-522: LGTM: graphMux gains MessagingEventMetricStore

The field addition is aligned with how other metric stores are hung off graphMux.


783-786: Good defaulting to Noop store (null object pattern)

Initializing messagingEventMetricStore to a no-op avoids nil checks across the codebase and aligns with existing patterns.


883-895: LGTM: Conditional wiring of messaging event metrics

Creation is gated behind exporter-specific toggles and uses base attributes/meter providers. Error propagation is correct.

router/pkg/config/config.schema.json (1)

1069-1073: LGTM: OTLP messaging event metrics flag added

The naming matches code (MessagingEventMetrics) and the description is clear.

router-tests/testenv/testenv.go (5)

13-13: New datasource import is appropriate for ProviderOpts wiring.

Importing the datasource package here is necessary to pass ProviderOpts to the NATS adapter. Looks good.


273-277: Event-metric toggles added to MetricOptions are well-scoped.

Adding EnableOTLPEventMetrics and EnablePrometheusEventMetrics keeps the testenv interface consistent with other metric toggles and makes enabling/disabling explicit per backend.


1508-1512: Prometheus config: event metrics and exclusions are correctly propagated.

Wiring MessagingEventMetrics and the exclusion lists (metrics, labels, scope info) into PrometheusConfig is consistent with how other metric features are toggled. No issues.


1530-1535: OTLP config: event metrics flag properly plumbed.

Setting Enabled and propagating RouterRuntime, GraphqlCache, MessagingEventMetrics, and ConnectionStats from test options when a manual MetricReader is present matches the pattern used elsewhere.


2827-2829: All NewAdapter calls updated to include ProviderOpts

Verified that every invocation of NewAdapter now passes the new providerOpts argument. No callsites were found using the old signature.

• router/pkg/pubsub/nats/provider_builder.go
• router-tests/testenv/testenv.go
• demo/pkg/subgraphs/subgraphs.go

router-tests/telemetry/messaging_event_metrics_test.go (6)

55-83: Kafka publish metric assertions are precise and aligned with conventions.

Validating scope, metric name, attributes (operation/system/destination/provider/error), and count covers the critical surface. Good use of a manual reader to deterministically collect datapoints.


86-166: End-to-end subscription path validation (Kafka) is solid.

The test synchronizes on subscription establishment, triggers a message, asserts payload, and validates the received.messages counter and attributes. This provides good signal against regressions.


169-222: NATS publish path instrumentation looks correct.

Checking publish operation semantics for NATS with the expected attribute set keeps parity with Kafka and Redis coverage. LGTM.


277-370: NATS subscribe path: stability tweak and assertions are appropriate.

Lowering WebSocketClientReadTimeout and explicitly waiting for subscription/connection counts helps reduce flakes. Attribute checks are consistent with the publish/request tests. Good.


373-424: Redis publish path assertions mirror other backends well.

The label set and counter value assertions look correct and consistent. LGTM.


426-505: Redis subscribe path: complete and consistent.

Good synchronization, payload validation, and metric assertions. No issues spotted.

router/pkg/metric/prom_messaging_event_metric_store.go (1)

16-21: Instrument names verified and correct. The newMessagingEventInstruments function registers counters named

  • messaging.event.sent.messages
  • messaging.event.received.messages

The Prometheus exporter will convert these to messaging_event_sent_messages_total and messaging_event_received_messages_total, matching the tests. No changes required.

router/core/factoryresolver.go (3)

7-7: Metric import alias addition is correct.

Bringing in rmetric for the new store wiring is appropriate here.


210-216: RouterEngineConfiguration: new MessagingEventMetricStore field is well-placed.

Placing the store alongside Events and other execution config keeps the loader’s dependencies cohesive.


472-480: All BuildProvidersAndDataSources calls match the updated signature
Both the resolver and all tests now pass the store (or nil) before the logger, matching:

BuildProvidersAndDataSources(
  ctx,                     // 1
  config.EventsConfiguration, // 2
  store,                   // 3 ← MessagingEventMetricStore or nil
  logger,                  // 4
  dsConfs,                 // 5
  hostName,                // 6
  listenAddr,              // 7
)

Verified locations:

  • router/core/factoryresolver.go (line 472)
  • router/pkg/pubsub/pubsub_test.go (all three tests)

No further changes required.

router-tests/prometheus_messaging_event_metrics_test.go (7)

35-75: Kafka publish: Prometheus metric and label assertions are on point.

Validating metric family name, label set, and counter value demonstrates the end-to-end exposition is correct.


81-153: Kafka subscribe: solid E2E and Prom assertions.

Subscription establishment, message production, and subsequent registry scrape/verify pattern is good. LGTM.


162-204: NATS publish Prometheus assertions mirror OTLP coverage well.

Checks are consistent and comprehensive for the label set. Looks good.


255-341: NATS subscribe: stable pattern with explicit waits and full label verification.

This matches the approach in OTLP tests and should be robust in CI.


347-390: Redis publish: label/value assertions are correct.

Good parity with Kafka and NATS tests. No issues.


392-467: Redis subscribe: end-to-end and Prom exposition verification are consistent.

Solid assertions and cleanup. LGTM.


51-75: Helper functions confirmed
The tests in router-tests/prometheus_messaging_event_metrics_test.go use findMetricFamilyByName and findMetricLabelByName, which are defined in router-tests/prometheus_test.go (lines 5527–5536). No changes needed.

router/pkg/pubsub/redis/adapter.go (5)

16-19: Good: backend-specific operation name constants defined

Defining redisPublish/redisReceive improves readability and reduces typos across emit sites.


33-42: Wiring metrics via ProviderOpts is correct (null-object pattern assumed)

Accepting ProviderOpts and storing opts.MessagingEventMetricStore aligns with the new build flow. Given the repo uses a no-op store when disabled, keeping this field without extra nil checks is fine.


119-124: Consume metrics include required attributes

Emits ProviderId, operation, system, and destination. This satisfies the shared label set and matches the store’s attribute handling.


163-169: Produce error path metrics look good

Captures ProviderId, operation, system, destination, and sets Error=true. Matches the store schema.


173-178: Produce success path metrics look good

Same attribute set without the error flag. Consistent with the error path emission.

router/pkg/pubsub/kafka/adapter.go (3)

197-206: Produce error path metrics look good

Captures ProviderId, operation, system, destination, and sets Error=true. Matches the store schema and Kafka adapter semantics.


210-215: Produce success path metrics look good

Same attribute set without the error flag. Consistent with the error path emission.


257-277: Good: no-op fallback when providerOpts store is nil

This guarantees messagingEventMetricStore is always usable without nil checks (aligns with the project’s null-object pattern).

router/pkg/metric/messaging_event_metric_store.go (1)

140-149: LGTM: Shutdown performs a flush; provider shutdown is orchestrated elsewhere

Per project conventions, meter providers are shut down at the router level. Keeping Shutdown to just a Flush is correct.

router/pkg/pubsub/pubsub.go (2)

55-59: Good: defaulting to a no-op store ensures adapters don’t need nil checks

This maintains a consistent contract across all providers and adapters.


114-120: Correct propagation of MessagingEventMetricStore through builder pipeline

Forwarding the store into each provider via ProviderOpts is aligned with the new interfaces and keeps wiring centralized.

Also applies to: 140-142

router/pkg/pubsub/nats/adapter.go (4)

21-25: Good: explicit operation name constants

natsRequest/natsPublish/natsReceive reduce typos and improve observability consistency within the backend.


221-252: Publish metrics (error/success) look correct

Emits ProviderId, operation, system, destination, and toggles Error as expected. Mirrors Redis/Kafka patterns.


270-289: Request metrics (error/success) look correct

Emits the same attribute set as Publish. Comment clarifies writer errors aren’t counted, which is reasonable.


356-371: Adapter construction: metrics store wiring aligns with ProviderOpts

Wires the store from providerOpts; consistent with the rest of the providers.

@SkArchon SkArchon requested a review from StarpTech August 18, 2025 21:16
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
router/pkg/metric/prom_messaging_event_metric_store.go (1)

50-52: Add no-op Shutdown to align with lifecycle and interface expectations.

Per repository conventions, meter providers are shut down centrally at the router level. This provider should expose a no-op Shutdown(ctx) error to conform to the expected interface and avoid subtle breakages.

Apply this diff:

 func (p *promMessagingEventMetrics) Flush(ctx context.Context) error {
   return p.meterProvider.ForceFlush(ctx)
 }
+
+// Shutdown is intentionally a no-op; meter providers are shut down at the router level.
+func (p *promMessagingEventMetrics) Shutdown(ctx context.Context) error {
+  return nil
+}
🧹 Nitpick comments (3)
router-tests/testenv/testenv.go (1)

2827-2829: Confirm intent: using Noop store in subgraph NATS adapters.

Given tests validate router-level event metrics, it’s fine for demo subgraph adapters to use a no-op store. If you later want to assert adapter-level producer metrics from demo subgraphs, consider threading a real MessagingEventMetricStore here when event metrics are enabled in the test env.

Would you like me to propose a small extension to SubgraphOptions and test env config to optionally inject a real metrics store into these adapters for deeper end-to-end coverage?

router-tests/prometheus_messaging_event_metrics_test.go (2)

26-27: Rename test to reflect scope and avoid “flaky” in name.

Naming a test “Flaky” sets the wrong expectation. Consider a neutral, descriptive name.

Apply this diff:

-func TestFlakyEventMetrics(t *testing.T) {
+func TestPrometheusMessagingEventMetrics(t *testing.T) {

29-466: Reduce duplication with small helper(s) for metric assertions.

There is repeated code across publish/subscribe variants for:

  • Gathering metrics
  • Finding a family
  • Validating common labels like wg_messaging_error, wg_messaging_system, destination suffix, wg_provider_id
  • Checking the counter value

Extracting a helper would improve readability and future maintainability.

I can sketch a helper like assertMessagingCounter(t, reg, family, expectedOp, system, destSuffix, provider, expectedVal) if helpful.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fdbe1f8 and e945f7d.

📒 Files selected for processing (11)
  • demo/pkg/subgraphs/subgraphs.go (2 hunks)
  • router-tests/prometheus_messaging_event_metrics_test.go (1 hunks)
  • router-tests/telemetry/messaging_event_metrics_test.go (1 hunks)
  • router-tests/testenv/testenv.go (5 hunks)
  • router/core/plan_generator.go (2 hunks)
  • router/pkg/config/config.schema.json (2 hunks)
  • router/pkg/metric/messaging_event_measurements.go (1 hunks)
  • router/pkg/metric/messaging_event_metric_store.go (1 hunks)
  • router/pkg/metric/oltp_messaging_event_metric_store.go (1 hunks)
  • router/pkg/metric/prom_messaging_event_metric_store.go (1 hunks)
  • router/pkg/otel/attributes.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • demo/pkg/subgraphs/subgraphs.go
  • router/pkg/metric/messaging_event_measurements.go
  • router-tests/telemetry/messaging_event_metrics_test.go
  • router/pkg/metric/messaging_event_metric_store.go
  • router/pkg/otel/attributes.go
  • router/pkg/metric/oltp_messaging_event_metric_store.go
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
📚 Learning: 2025-08-14T16:47:03.713Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.713Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/core/plan_generator.go
  • router/pkg/metric/prom_messaging_event_metric_store.go
📚 Learning: 2025-08-14T17:45:57.468Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/pkg/metric/prom_messaging_event_metric_store.go
  • router-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:46:00.905Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/pkg/metric/prom_messaging_event_metric_store.go
  • router-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router-tests/testenv/testenv.go
🧬 Code Graph Analysis (2)
router/core/plan_generator.go (3)
router/core/factoryresolver.go (1)
  • RouterEngineConfiguration (210-216)
router/pkg/metric/messaging_event_metric_store.go (1)
  • MessagingEventMetricStore (41-47)
router/pkg/metric/noop_event_metrics.go (1)
  • NewNoopEventMetricStore (15-15)
router-tests/testenv/testenv.go (4)
router/pkg/metric/messaging_event_metric_store.go (2)
  • MessagingEventMetrics (50-54)
  • MessagingEventMetricStore (41-47)
router/pkg/pubsub/nats/adapter.go (1)
  • NewAdapter (356-372)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/metric/noop_event_metrics.go (1)
  • NewNoopEventMetricStore (15-15)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (16)
router/pkg/config/config.schema.json (2)

1069-1073: Consistency fix looks good: unified key and description for messaging event metrics (OTLP).

The property name and description now align with the code and Prometheus schema. This resolves the previously noted mismatch.


1179-1183: Approve unified messaging_event_metrics schema

All references to the old key have been removed and no stale occurrences were found in markdown docs or the router code. Ready to merge.

router/core/plan_generator.go (2)

10-11: Importing metric package is appropriate for default no-op wiring.

Import is used below to set a default store; no unused import issues.


258-260: Good default: initialize MessagingEventMetricStore with no-op.

This adheres to the null-object pattern used across the codebase and avoids nil checks downstream.

router/pkg/metric/prom_messaging_event_metric_store.go (1)

42-48: No interface mismatch: promMessagingEventMetrics implements the Provider interface

All Prom and OTLP metric types (promMessagingEventMetrics and otlpMessagingEventMetrics) define:

  • Produce(ctx context.Context, opts ...otelmetric.AddOption)
  • Consume(ctx context.Context, opts ...otelmetric.AddOption)
  • Flush(ctx context.Context) error

These signatures exactly match the MessagingEventMetricProvider interface in messaging_event_metric_store.go (lines 34–39). The MessagingEventMetricStore interface (lines 41–44) is implemented by MessagingEventMetrics (which takes a MessagingEvent), so there is no breaking mismatch here. No changes required.

router-tests/testenv/testenv.go (4)

66-67: New import is correct for ProviderOpts wiring.

Needed for passing MessagingEventMetricStore into adapters.


273-277: MetricOptions extended with event metrics flags: LGTM.

The additions are clear and symmetric across OTLP/Prom backends.


1507-1512: Prometheus config now exposes MessagingEventMetrics: wiring looks correct.

Flag is plumbed through to PrometheusConfig.MessagingEventMetrics and coexists with other exclusions. No ordering issues observed.


1530-1535: OTLP config receives MessagingEventMetrics flag: LGTM.

Consistent naming and placement alongside other toggles.

router-tests/prometheus_messaging_event_metrics_test.go (7)

32-75: Kafka publish flow assertions are precise and correct.

  • Validates counter family, labels (operation/system/destination/provider), and value.
  • Checks wg_messaging_error = "false".

Solid coverage for the success path.


78-153: Kafka subscribe flow is well-synchronized and asserts expected labels.

Use of WaitForSubscriptionCount and channel-based Await keeps flakiness low. Label/value assertions are consistent with conventions.


159-204: NATS publish path mirrors Kafka assertions correctly.

Operation label “publish” and destination naming check look good.


206-253: NATS request path is a nice addition; label/value checks look right.

Covers an additional operation mode and validates “request” label.


255-341: NATS subscribe flow is carefully coordinated; assertions are correct.

  • The temporary read timeout tweak is a pragmatic stabilizer.
  • Ensures provider “default” for the default connection.

All good.


347-390: Redis publish flow is consistent with other backends.

Assertions on labels and counter value are appropriate.


392-466: Redis subscribe flow follows the same robust pattern.

Label checks and counter validation are consistent.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/pkg/pubsub/kafka/adapter.go (1)

56-71: Subscription cancel not respected; PollRecords should stop when ctx is canceled

topicPoller passes p.ctx to client.PollRecords, so if the subscription ctx is canceled but the adapter context remains active, PollRecords may block indefinitely until new records arrive or adapter shutdown, leaking goroutines and delaying unsubscribe.

Use a merged context that cancels when either p.ctx or ctx is done.

Apply this diff:

-func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, pollerOpts PollerOpts) error {
-	for {
-		select {
-		case <-p.ctx.Done(): // Close the poller if the application context was canceled
-			return p.ctx.Err()
-		case <-ctx.Done(): // Close the poller if the subscription context was canceled
-			return ctx.Err()
-		default:
-			// Try to fetch max records from any subscribed topics
-			fetches := client.PollRecords(p.ctx, 10_000)
+func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, pollerOpts PollerOpts) error {
+	// pollCtx cancels when either the adapter context or the subscription context is canceled.
+	pollCtx, cancel := context.WithCancel(p.ctx)
+	defer cancel()
+	go func() {
+		select {
+		case <-ctx.Done():
+			cancel()
+		case <-pollCtx.Done():
+			// adapter shutdown
+		}
+	}()
+	for {
+		select {
+		case <-pollCtx.Done():
+			return pollCtx.Err()
+		default:
+			// Try to fetch max records from any subscribed topics
+			fetches := client.PollRecords(pollCtx, 10_000)
@@
-				p.messagingEventMetricStore.Consume(p.ctx, metric.MessagingEvent{
+				p.messagingEventMetricStore.Consume(pollCtx, metric.MessagingEvent{

Also applies to: 98-109

🧹 Nitpick comments (8)
router-tests/telemetry/messaging_event_metrics_test.go (3)

27-27: Minor: Consolidate WaitTimeout to a shared test util

WaitTimeout is duplicated across test files. Consider placing a single const in a shared testutils package (e.g., router-tests/testenv or router-tests/utils) and reusing it here to avoid drift.


65-76: Also assert attribute presence for core keys

For operation/system/destination, assert the “found” bool returned by Attributes.Value(...) before comparing strings to surface clearer failures when attributes are absent.

Example pattern:

  • operation, ok := attrs.Value(otelattrs.WgMessagingOperationName); require.True(t, ok)
  • system, ok := attrs.Value(otelattrs.WgMessagingSystem); require.True(t, ok)
  • destination, ok := attrs.Value(otelattrs.WgMessagingDestinationName); require.True(t, ok)

This is a robustness improvement; apply similarly to each block.

Also applies to: 142-153, 201-213, 253-265, 338-350, 400-412, 476-488


29-36: Parallelism layering may increase flakiness; consider dialing it back

You mark the top test and each nested subtest with t.Parallel(). Given shared external deps (brokers), heavy parallelism can increase nondeterminism. Consider parallelizing only the backend groups (kafka/nats/redis) or only the leaf tests.

Would you like me to generate a quick test plan to run these subtests serially per backend to compare stability?

router-tests/prometheus_messaging_event_metrics_test.go (2)

24-24: Minor: Consolidate WaitTimeout to a shared test util

Same as the OTLP tests, consider centralizing WaitTimeout to avoid duplication.


159-172: Avoid blind assumptions on labels; assert presence first

Before reading label.GetValue(), consider asserting the label exists (non-nil) for all core labels like wg_messaging_operation_name, wg_messaging_system, wg_messaging_destination_name, wg_provider_id to produce clearer failures.

Also applies to: 210-220, 259-268, 351-362, 396-405

router/pkg/metric/messaging_event_metric_store.go (2)

50-54: Nit: logger field is unused in the struct

logger is only needed during provider construction. Consider removing it from the struct to reduce state footprint, unless you plan to log within Produce/Consume later.

 type MessagingEventMetrics struct {
-	baseAttributes []attribute.KeyValue
-	logger         *zap.Logger
-	providers      []MessagingEventMetricProvider
+	baseAttributes []attribute.KeyValue
+	providers      []MessagingEventMetricProvider
 }
 
 func NewMessagingEventMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *metric.MeterProvider, metricsConfig *Config) (*MessagingEventMetrics, error) {
@@
 	store := &MessagingEventMetrics{
 		baseAttributes: baseAttributes,
-		logger:         logger,
 		providers:      providers,
 	}

59-66: Optional: guard against nil providers if configs are misaligned

If metricsConfig enables a backend but its MeterProvider is nil due to upstream misconfiguration, newOtlpMessagingEventMetrics/newPromMessagingEventMetrics may panic. Consider a short guard with a helpful error.

Also applies to: 67-74

router/pkg/pubsub/kafka/adapter.go (1)

178-193: Minor: logging full message payloads may expose PII

log.Debug("publish", zap.ByteString("data", event.Data)) can leak message contents under debug logs. Consider logging size or a hash instead, unless you’re certain messages don’t contain sensitive data.

-	log.Debug("publish", zap.ByteString("data", event.Data))
+	log.Debug("publish", zap.Int("payload_bytes", len(event.Data)))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e945f7d and eed45fc.

📒 Files selected for processing (8)
  • router-tests/prometheus_messaging_event_metrics_test.go (1 hunks)
  • router-tests/telemetry/messaging_event_metrics_test.go (1 hunks)
  • router/pkg/metric/messaging_event_measurements.go (1 hunks)
  • router/pkg/metric/messaging_event_metric_store.go (1 hunks)
  • router/pkg/otel/attributes.go (2 hunks)
  • router/pkg/pubsub/kafka/adapter.go (7 hunks)
  • router/pkg/pubsub/nats/adapter.go (9 hunks)
  • router/pkg/pubsub/redis/adapter.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • router/pkg/metric/messaging_event_measurements.go
  • router/pkg/otel/attributes.go
  • router/pkg/pubsub/nats/adapter.go
  • router/pkg/pubsub/redis/adapter.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-14T16:47:03.713Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.713Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/pkg/pubsub/kafka/adapter.go
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/pkg/metric/messaging_event_metric_store.go
📚 Learning: 2025-08-14T17:45:57.468Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/pkg/metric/messaging_event_metric_store.go
📚 Learning: 2025-08-14T17:46:00.905Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/pkg/metric/messaging_event_metric_store.go
🧬 Code Graph Analysis (4)
router/pkg/pubsub/kafka/adapter.go (5)
router/pkg/metric/messaging_event_metric_store.go (3)
  • MessagingEventMetricStore (41-47)
  • MessagingEvent (25-31)
  • ProviderTypeKafka (19-19)
router/pkg/pubsub/nats/adapter.go (1)
  • ProviderAdapter (42-54)
router/pkg/pubsub/redis/adapter.go (2)
  • ProviderAdapter (56-65)
  • NewProviderAdapter (33-54)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (36-38)
router/pkg/metric/noop_event_metrics.go (1)
  • NewNoopEventMetricStore (15-15)
router/pkg/metric/messaging_event_metric_store.go (3)
router/pkg/config/config.go (2)
  • Config (979-1053)
  • Prometheus (97-111)
router/pkg/metric/config.go (2)
  • Config (107-131)
  • OpenTelemetry (69-84)
router/pkg/otel/attributes.go (5)
  • WgMessagingOperationName (65-65)
  • WgMessagingSystem (66-66)
  • WgErrorType (70-70)
  • WgProviderId (69-69)
  • WgMessagingDestinationName (68-68)
router-tests/telemetry/messaging_event_metrics_test.go (6)
router-tests/prometheus_messaging_event_metrics_test.go (2)
  • TestFlakyEventMetrics (26-468)
  • WaitTimeout (24-24)
router-tests/testenv/testenv.go (4)
  • Run (105-122)
  • Config (284-340)
  • MetricOptions (263-277)
  • Environment (1727-1763)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-81)
router-tests/utils.go (2)
  • GetMetricScopeByName (88-95)
  • GetMetricByName (79-86)
router/pkg/otel/attributes.go (5)
  • WgMessagingOperationName (65-65)
  • WgMessagingSystem (66-66)
  • WgMessagingDestinationName (68-68)
  • WgProviderId (69-69)
  • WgErrorType (70-70)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router-tests/prometheus_messaging_event_metrics_test.go (5)
router-tests/telemetry/messaging_event_metrics_test.go (1)
  • TestFlakyEventMetrics (29-500)
router-tests/testenv/testenv.go (6)
  • Run (105-122)
  • Config (284-340)
  • ConfigWithEdfsKafkaJSONTemplate (85-85)
  • MetricOptions (263-277)
  • Environment (1727-1763)
  • GraphQLRequest (1895-1903)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-81)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router/pkg/config/config.go (1)
  • EngineExecutionConfiguration (370-398)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: build_test
  • GitHub Check: Analyze (go)
🔇 Additional comments (6)
router-tests/prometheus_messaging_event_metrics_test.go (1)

51-75: Helper Functions Already Present in Router-Tests Package

The findMetricFamilyByName and findMetricLabelByName helpers are defined in router-tests/prometheus_test.go and live in the same router-tests package as prometheus_messaging_event_metrics_test.go. No additional imports or definitions are needed here.

router/pkg/metric/messaging_event_metric_store.go (3)

56-81: LGTM: Pluggable providers and construction flow look sound

Conditional wiring of OTLP/Prom providers is clean and errors are surfaced early. Null-object pattern downstream keeps call sites simple.


83-87: Attribute construction matches tests and keeps cardinality under control

Only sets provider, destination, and error.type when present, and centralizes base attributes via withAttrs. Nice.

Also applies to: 88-107, 109-129


131-142: Good: explicit Flush/Shutdown with error aggregation

Using errors.Join preserves all provider errors and avoids partial masking. Consistent with the central lifecycle design.

Also applies to: 144-153

router/pkg/pubsub/kafka/adapter.go (2)

174-215: LGTM: publish path emits success/error metrics with expected attributes

On error you emit ErrorType "publish_error"; on success you omit it. OperationName is "produce" for Kafka, which matches the tests.


257-277: LGTM: No-op metric store fallback keeps adapter code simple

Choosing a noop store when provider opts are absent avoids nil checks at call-sites and aligns with the project’s pattern.

@SkArchon SkArchon changed the title feat: event metrics feat: stream metrics Aug 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
router/pkg/pubsub/pubsub_test.go (1)

66-66: Fix mock signature: add providerOpts argument

BuildProvider now accepts providerOpts; this expectation will not match and will fail.

-mockBuilder.On("BuildProvider", natsEventSources[0]).Return(mockPubSubProvider, nil)
+mockBuilder.On("BuildProvider", natsEventSources[0], mock.Anything).Return(mockPubSubProvider, nil)
router/pkg/pubsub/kafka/adapter.go (1)

56-113: Fix poller cancellation and close the read client to avoid resource leaks

  • PollRecords currently uses p.ctx. If the subscription ctx (ctx) is canceled while p.ctx remains active, the poller can block longer than necessary.
  • The per-subscription client created in Subscribe is never closed, leaking connections.

Use the subscription ctx in PollRecords and ensure client.Close() is called when the poller exits.

Apply:

-func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, pollerOpts PollerOpts) error {
+func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, pollerOpts PollerOpts) error {
+	// ensure per-subscription client is closed
+	defer client.Close()
 	for {
@@
-			// Try to fetch max records from any subscribed topics
-			fetches := client.PollRecords(p.ctx, 10_000)
+			// Try to fetch max records from any subscribed topics
+			fetches := client.PollRecords(ctx, 10_000)
@@
-				p.streamMetricStore.Consume(p.ctx, metric.StreamsEvent{
+				p.streamMetricStore.Consume(ctx, metric.StreamsEvent{
 					ProviderId:          pollerOpts.providerId,
 					StreamOperationName: kafkaReceive,
 					ProviderType:        metric.ProviderTypeKafka,
 					DestinationName:     r.Topic,
 				})
♻️ Duplicate comments (1)
router/pkg/otel/attributes.go (1)

65-70: Use correct Go acronym casing: WgProviderID (not WgProviderId)

Matches existing conventions like WgFederatedGraphID and prior guidance.

Apply this diff:

-	WgProviderId          = attribute.Key("wg.provider.id")
+	WgProviderID          = attribute.Key("wg.provider.id")

Note: update all call sites accordingly.

🧹 Nitpick comments (21)
router/pkg/metric/stream_measurements.go (1)

15-21: Tighten descriptions and use consistent terminology ("received" vs "consumed")

Descriptions read awkwardly and mix terms with the metric name. Prefer succinct, consistent phrasing.

Apply this diff:

-messagingSentMessagesOptions = []otelmetric.Int64CounterOption{
-	otelmetric.WithDescription("Number of stream sent messages"),
-}
+messagingSentMessagesOptions = []otelmetric.Int64CounterOption{
+	otelmetric.WithDescription("Number of messages sent"),
+}-messagingConsumedMessagesOptions = []otelmetric.Int64CounterOption{
-	otelmetric.WithDescription("Number of stream consumed messages"),
-}
+messagingConsumedMessagesOptions = []otelmetric.Int64CounterOption{
+	otelmetric.WithDescription("Number of messages received"),
+}
router/core/graph_server.go (1)

763-767: Nit: redundant nil-check for streamMetricStore

gm.streamMetricStore is initialized with a Noop in buildGraphMux, so the nil-guard isn’t needed. Safe to keep, but it adds noise.

router/pkg/metric/prom_stream_metric_store.go (1)

11-14: Optional: unify meter naming across providers

OTLP uses "cosmo.router.streams", Prom uses "cosmo.router.streams.prometheus". Consider a consistent scheme (either suffix provider for both or neither) to make source filtering predictable in pipelines.

router/pkg/metric/noop_stream_metrics.go (1)

7-15: Add compile-time interface assertion and brief doc comments

A small safeguard and docs improve maintainability and catch drift if the interface changes.

 type NoopStreamMetricStore struct{}
 
+// Ensure NoopStreamMetricStore implements StreamMetricStore at compile time.
+var _ StreamMetricStore = (*NoopStreamMetricStore)(nil)
+
-func (n *NoopStreamMetricStore) Produce(ctx context.Context, event StreamsEvent) {}
+// Produce is a no-op.
+func (n *NoopStreamMetricStore) Produce(ctx context.Context, event StreamsEvent) {}
-func (n *NoopStreamMetricStore) Consume(ctx context.Context, event StreamsEvent) {}
+// Consume is a no-op.
+func (n *NoopStreamMetricStore) Consume(ctx context.Context, event StreamsEvent) {}
 
-func (n *NoopStreamMetricStore) Flush(ctx context.Context) error    { return nil }
-func (n *NoopStreamMetricStore) Shutdown(ctx context.Context) error { return nil }
+// Flush is a no-op.
+func (n *NoopStreamMetricStore) Flush(ctx context.Context) error    { return nil }
+// Shutdown is a no-op.
+func (n *NoopStreamMetricStore) Shutdown(ctx context.Context) error { return nil }
 
-func NewNoopStreamMetricStore() *NoopStreamMetricStore { return &NoopStreamMetricStore{} }
+// NewNoopStreamMetricStore returns a StreamMetricStore that discards all metrics.
+func NewNoopStreamMetricStore() *NoopStreamMetricStore { return &NoopStreamMetricStore{} }
router/pkg/pubsub/pubsub_test.go (1)

296-296: Prefer Noop StreamMetricStore over nil to follow null-object pattern

Tests currently pass nil for streamMetricStore in BuildProvidersAndDataSources. Aligning with the repo’s null-object pattern (using NewNoopStreamMetricStore) avoids nil-handling across the stack and reduces surprises if defaults change.

-}, nil, zap.NewNop(), dsConfs, "host", "addr")
+}, rmetric.NewNoopStreamMetricStore(), zap.NewNop(), dsConfs, "host", "addr")

Apply the same change at Lines 296, 349, and 402.

Also applies to: 349-349, 402-402

router-tests/prometheus_stream_metrics_test.go (2)

54-56: Guard against missing metric families before dereferencing

If findMetricFamilyByName returns nil, dereferencing will panic. Add an explicit NotNil assertion for clearer failures.

- family := findMetricFamilyByName(mf, "streams_sent_messages_total")
+ family := findMetricFamilyByName(mf, "streams_sent_messages_total")
+ require.NotNil(t, family, "expected metric family streams_sent_messages_total to be present")

- family := findMetricFamilyByName(mf, "streams_received_messages_total")
+ family := findMetricFamilyByName(mf, "streams_received_messages_total")
+ require.NotNil(t, family, "expected metric family streams_received_messages_total to be present")

Apply in each location indicated by the selected ranges.

Also applies to: 127-130, 183-186, 232-235, 312-314, 369-372, 441-444


35-37: Optional: drop MetricReader to keep tests Prometheus-only and reduce overhead

Providing a ManualReader enables OTLP as well; it’s not needed when only Prometheus metrics are asserted. Dropping it simplifies setup and may speed up tests.

- metricReader := metric.NewManualReader()
- promRegistry := prometheus.NewRegistry()
+ promRegistry := prometheus.NewRegistry()
 ...
- MetricReader:             metricReader,
+ // MetricReader omitted: Prometheus-only test
  PrometheusRegistry:       promRegistry,

Also applies to: 81-83, 162-164, 210-212, 259-261, 351-353, 396-398

router/core/factoryresolver.go (2)

216-217: Expose StreamMetricStore on RouterEngineConfiguration

This cleanly threads the store through the engine. Consider defaulting to a no-op store where the struct is constructed to reduce nil surface area.


473-481: Default to NoopStreamMetricStore when none is provided

Keeps with the null-object pattern used elsewhere and prevents downstream nil checks.

- factoryProviders, factoryDataSources, err := pubsub.BuildProvidersAndDataSources(
+ store := routerEngineConfig.StreamMetricStore
+ if store == nil {
+ 	store = rmetric.NewNoopStreamMetricStore()
+ }
+ factoryProviders, factoryDataSources, err := pubsub.BuildProvidersAndDataSources(
  	l.ctx,
  	routerEngineConfig.Events,
- 	routerEngineConfig.StreamMetricStore,
+ 	store,
  	l.logger,
  	pubSubDS,
  	l.resolver.InstanceData().HostName,
  	l.resolver.InstanceData().ListenAddress,
  )
router/pkg/pubsub/redis/adapter.go (1)

125-129: Don't terminate the subscription loop on nil messages

Returning on msg == nil ends the goroutine; better to continue and wait for the next message. Channel-closure is already handled via !ok.

Apply:

-				if msg == nil {
-					log.Debug("empty message received on subscription update, skipping")
-					return
-				}
+				if msg == nil {
+					log.Debug("empty message received on subscription update, skipping")
+					continue
+				}
router-tests/telemetry/stream_metrics_test.go (2)

29-31: Consider renaming the test to avoid “Flaky” in the name

The test appears deterministic; the name might confuse readers or tooling that tracks flaky tests.

Apply:

-func TestFlakyEventMetrics(t *testing.T) {
+func TestEventMetrics(t *testing.T) {

170-219: NATS tests cover publish, request, and subscribe paths thoroughly

Great coverage across all operations, labels, and counts. Consider a follow-up negative-path test for request/publish error to assert wg.error.type is present.

If helpful, I can add a failing publish/request (e.g., to a non-existent subject or by injecting a client error) and extend assertions for WgErrorType.

Also applies to: 221-271, 273-366

router/pkg/pubsub/nats/adapter.go (2)

144-149: Prefer subscription context when emitting consume metrics (durable path)

Using ctx ties metric lifetimes to the subscription, mirroring shutdown behavior and avoiding emissions after cancellation.

Apply:

-						p.streamMetricStore.Consume(p.ctx, metric.StreamsEvent{
+						p.streamMetricStore.Consume(ctx, metric.StreamsEvent{
 							ProviderId:          event.ProviderID,
 							StreamOperationName: natsReceive,
 							ProviderType:        metric.ProviderTypeNats,
 							DestinationName:     msg.Subject(),
 						})

187-193: Same as above for channel subscription path: use ctx over p.ctx

Keeps emission scoped to the subscription lifecycle.

Apply:

-				p.streamMetricStore.Consume(p.ctx, metric.StreamsEvent{
+				p.streamMetricStore.Consume(ctx, metric.StreamsEvent{
 					ProviderId:          event.ProviderID,
 					StreamOperationName: natsReceive,
 					ProviderType:        metric.ProviderTypeNats,
 					DestinationName:     msg.Subject,
 				})
router/pkg/pubsub/pubsub.go (1)

126-134: Minor: use a set for provider ID membership checks

Current code does O(n^2) slices.Contains scans. Replace with maps for clarity and linear time; negligible at small sizes but cleaner.

Apply:

-	usedProviderIds := []string{}
-	for _, dsConf := range dsConfs {
-		for _, event := range dsConf.events {
-			if !slices.Contains(usedProviderIds, event.GetEngineEventConfiguration().GetProviderId()) {
-				usedProviderIds = append(usedProviderIds, event.GetEngineEventConfiguration().GetProviderId())
-			}
-		}
-	}
+	usedProviderIds := make(map[string]struct{})
+	for _, dsConf := range dsConfs {
+		for _, event := range dsConf.events {
+			usedProviderIds[event.GetEngineEventConfiguration().GetProviderId()] = struct{}{}
+		}
+	}
@@
-	providerIds := []string{}
+	providerIds := make(map[string]struct{})
 	for _, providerData := range providersData {
-		if !slices.Contains(usedProviderIds, providerData.GetID()) {
+		if _, ok := usedProviderIds[providerData.GetID()]; !ok {
 			continue
 		}
 		provider, err := builder.BuildProvider(providerData, pubsub_datasource.ProviderOpts{
 			StreamMetricStore: store,
 		})
 		if err != nil {
 			return nil, nil, err
 		}
 		pubSubProviders = append(pubSubProviders, provider)
-		providerIds = append(providerIds, provider.ID())
+		providerIds[provider.ID()] = struct{}{}
 	}
@@
-	for _, providerId := range usedProviderIds {
-		if !slices.Contains(providerIds, providerId) {
+	for providerId := range usedProviderIds {
+		if _, ok := providerIds[providerId]; !ok {
 			return pubSubProviders, nil, &ProviderNotDefinedError{
 				ProviderID:     providerId,
 				ProviderTypeID: builder.TypeID(),
 			}
 		}
 	}

Also applies to: 136-149, 151-159

router/pkg/metric/stream_metric_store.go (6)

10-11: Alias sdk/metric as sdkmetric to avoid confusion with local package metric.

The codebase elsewhere (e.g., config.go) aliases OpenTelemetry's sdk/metric as sdkmetric. Using metric here is confusing since this file is in package metric. Aligning the alias improves readability and consistency.

Apply:

-	"go.opentelemetry.io/otel/sdk/metric"
+	sdkmetric "go.opentelemetry.io/otel/sdk/metric"
@@
-func NewStreamMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *metric.MeterProvider, metricsConfig *Config) (*StreamMetrics, error) {
+func NewStreamMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *sdkmetric.MeterProvider, metricsConfig *Config) (*StreamMetrics, error) {

Also applies to: 56-56


83-86: DRY up attribute construction and preallocate to reduce allocations.

Produce and Consume duplicate attribute building. Factor it into a helper; also preallocate the combined slice to avoid intermediate allocations in hot paths.

 func (e *StreamMetrics) withAttrs(attrs ...attribute.KeyValue) otelmetric.AddOption {
-	copied := append([]attribute.KeyValue{}, e.baseAttributes...)
-	return otelmetric.WithAttributes(append(copied, attrs...)...)
+	copied := make([]attribute.KeyValue, 0, len(e.baseAttributes)+len(attrs))
+	copied = append(copied, e.baseAttributes...)
+	copied = append(copied, attrs...)
+	return otelmetric.WithAttributes(copied...)
 }
 
+func (e *StreamMetrics) addOptionFor(event StreamsEvent) otelmetric.AddOption {
+	attrs := []attribute.KeyValue{
+		otel.WgStreamOperationName.String(event.StreamOperationName),
+		otel.WgProviderType.String(string(event.ProviderType)),
+	}
+	if event.ErrorType != "" {
+		attrs = append(attrs, otel.WgErrorType.String(event.ErrorType))
+	}
+	if event.ProviderId != "" {
+		attrs = append(attrs, otel.WgProviderId.String(event.ProviderId))
+	}
+	if event.DestinationName != "" {
+		attrs = append(attrs, otel.WgDestinationName.String(event.DestinationName))
+	}
+	return e.withAttrs(attrs...)
+}
+
 func (e *StreamMetrics) Produce(ctx context.Context, event StreamsEvent) {
-	attrs := []attribute.KeyValue{
-		otel.WgStreamOperationName.String(event.StreamOperationName),
-		otel.WgProviderType.String(string(event.ProviderType)),
-	}
-	if event.ErrorType != "" {
-		attrs = append(attrs, otel.WgErrorType.String(event.ErrorType))
-	}
-	if event.ProviderId != "" {
-		attrs = append(attrs, otel.WgProviderId.String(event.ProviderId))
-	}
-	if event.DestinationName != "" {
-		attrs = append(attrs, otel.WgDestinationName.String(event.DestinationName))
-	}
-	opt := e.withAttrs(attrs...)
+	opt := e.addOptionFor(event)
 
 	for _, provider := range e.providers {
 		provider.Produce(ctx, opt)
 	}
 }
 
 func (e *StreamMetrics) Consume(ctx context.Context, event StreamsEvent) {
-	attrs := []attribute.KeyValue{
-		otel.WgStreamOperationName.String(event.StreamOperationName),
-		otel.WgProviderType.String(string(event.ProviderType)),
-	}
-	if event.ErrorType != "" {
-		attrs = append(attrs, otel.WgErrorType.String(event.ErrorType))
-	}
-	if event.ProviderId != "" {
-		attrs = append(attrs, otel.WgProviderId.String(event.ProviderId))
-	}
-	if event.DestinationName != "" {
-		attrs = append(attrs, otel.WgDestinationName.String(event.DestinationName))
-	}
-
-	opt := e.withAttrs(attrs...)
+	opt := e.addOptionFor(event)
 
 	for _, provider := range e.providers {
 		provider.Consume(ctx, opt)
 	}
 }

Also applies to: 88-107, 109-129


73-79: Log when no providers are configured to aid debugging.

If both OTLP and Prometheus streams are disabled, the store becomes a no-op. A debug log helps explain why metrics aren’t observed.

 	providers := make([]StreamMetricProvider, 0)
@@
 	if metricsConfig.Prometheus.Streams {
 		promMetrics, err := newPromStreamEventMetrics(logger, promProvider)
 		if err != nil {
 			return nil, fmt.Errorf("failed to create prometheus stream event metrics: %w", err)
 		}
 		providers = append(providers, promMetrics)
 	}
 
+	if len(providers) == 0 {
+		logger.Debug("no stream metrics providers configured; StreamMetrics will be a no-op")
+	}
+
 	store := &StreamMetrics{
 		baseAttributes: baseAttributes,
 		logger:         logger,
 		providers:      providers,
 	}

135-139: Neutral variable naming in flush loop.

Use a provider-agnostic name instead of errOtlp in a loop over all providers.

-	for _, provider := range e.providers {
-		if errOtlp := provider.Flush(ctx); errOtlp != nil {
-			err = errors.Join(err, fmt.Errorf("failed to flush metrics: %w", errOtlp))
-		}
-	}
+	for _, provider := range e.providers {
+		if flushErr := provider.Flush(ctx); flushErr != nil {
+			err = errors.Join(err, fmt.Errorf("failed to flush metrics: %w", flushErr))
+		}
+	}

144-146: Clarify Shutdown docstring to match behavior and project conventions.

Per repo conventions (router manages meter provider lifecycle), this store’s Shutdown only flushes. Update the comment to avoid implying it stops observers.

-// Shutdown flushes the metrics and stops observers if any.
+// Shutdown flushes the metrics. Provider lifecycle (including stopping observers, if any)
+// is managed at the router level.

26-31: Polish field comments for grammar and clarity.

Minor wording tweaks.

 type StreamsEvent struct {
-	ProviderId          string       // The id of the provider defined in the configuration
-	StreamOperationName string       // The stream operation name that is specific to the messaging system
-	ProviderType        ProviderType // The messaging system type that are supported
-	ErrorType           string       // Optional error type, e.g., "publish_error" or "receive_error". If empty, the attribute is not set
-	DestinationName     string       // The name of the destination queue / topic / channel
+	ProviderId          string       // The provider id defined in the configuration.
+	StreamOperationName string       // The operation name specific to the messaging system.
+	ProviderType        ProviderType // The messaging system type being used.
+	ErrorType           string       // Optional error type (e.g., "publish_error", "receive_error"). If empty, the attribute is not set.
+	DestinationName     string       // The destination name (queue/topic/channel).
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between eed45fc and 1667595.

📒 Files selected for processing (25)
  • demo/pkg/subgraphs/subgraphs.go (2 hunks)
  • router-tests/prometheus_stream_metrics_test.go (1 hunks)
  • router-tests/telemetry/stream_metrics_test.go (1 hunks)
  • router-tests/testenv/testenv.go (5 hunks)
  • router/core/factoryresolver.go (3 hunks)
  • router/core/graph_server.go (5 hunks)
  • router/core/plan_generator.go (3 hunks)
  • router/core/router.go (2 hunks)
  • router/pkg/config/config.go (2 hunks)
  • router/pkg/config/config.schema.json (2 hunks)
  • router/pkg/config/testdata/config_defaults.json (2 hunks)
  • router/pkg/config/testdata/config_full.json (2 hunks)
  • router/pkg/metric/config.go (2 hunks)
  • router/pkg/metric/noop_stream_metrics.go (1 hunks)
  • router/pkg/metric/oltp_stream_metric_store.go (1 hunks)
  • router/pkg/metric/prom_stream_metric_store.go (1 hunks)
  • router/pkg/metric/stream_measurements.go (1 hunks)
  • router/pkg/metric/stream_metric_store.go (1 hunks)
  • router/pkg/otel/attributes.go (2 hunks)
  • router/pkg/pubsub/datasource/provider.go (2 hunks)
  • router/pkg/pubsub/kafka/adapter.go (7 hunks)
  • router/pkg/pubsub/nats/adapter.go (9 hunks)
  • router/pkg/pubsub/pubsub.go (7 hunks)
  • router/pkg/pubsub/pubsub_test.go (8 hunks)
  • router/pkg/pubsub/redis/adapter.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • router/pkg/pubsub/datasource/provider.go
  • router/pkg/config/testdata/config_full.json
  • router/pkg/metric/config.go
  • router/pkg/config/testdata/config_defaults.json
  • router/pkg/config/config.go
  • router/core/router.go
  • demo/pkg/subgraphs/subgraphs.go
  • router/core/plan_generator.go
  • router/pkg/config/config.schema.json
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-14T16:47:03.713Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.713Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/pkg/metric/noop_stream_metrics.go
  • router/pkg/metric/oltp_stream_metric_store.go
  • router/pkg/metric/prom_stream_metric_store.go
  • router/pkg/pubsub/kafka/adapter.go
📚 Learning: 2025-08-14T17:46:00.905Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/pkg/metric/noop_stream_metrics.go
  • router/pkg/metric/stream_metric_store.go
  • router/pkg/metric/oltp_stream_metric_store.go
  • router/pkg/metric/prom_stream_metric_store.go
  • router-tests/testenv/testenv.go
  • router/core/graph_server.go
📚 Learning: 2025-08-14T17:45:57.468Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/pkg/metric/stream_metric_store.go
  • router/pkg/metric/oltp_stream_metric_store.go
  • router/pkg/metric/prom_stream_metric_store.go
  • router-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/pkg/metric/stream_metric_store.go
  • router/pkg/metric/oltp_stream_metric_store.go
  • router-tests/testenv/testenv.go
🧬 Code Graph Analysis (12)
router/pkg/metric/noop_stream_metrics.go (1)
router/pkg/metric/stream_metric_store.go (1)
  • StreamsEvent (25-31)
router-tests/prometheus_stream_metrics_test.go (4)
router-tests/telemetry/stream_metrics_test.go (1)
  • TestFlakyEventMetrics (29-500)
router-tests/testenv/testenv.go (4)
  • Run (105-122)
  • Config (284-340)
  • MetricOptions (263-277)
  • Environment (1727-1763)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-81)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router/pkg/metric/stream_metric_store.go (2)
router/pkg/metric/config.go (2)
  • Config (107-131)
  • OpenTelemetry (69-84)
router/pkg/otel/attributes.go (5)
  • WgStreamOperationName (65-65)
  • WgProviderType (66-66)
  • WgErrorType (69-69)
  • WgProviderId (68-68)
  • WgDestinationName (67-67)
router/core/factoryresolver.go (1)
router/pkg/metric/stream_metric_store.go (1)
  • StreamMetricStore (41-47)
router-tests/telemetry/stream_metrics_test.go (5)
router-tests/prometheus_stream_metrics_test.go (1)
  • TestFlakyEventMetrics (26-468)
router-tests/testenv/testenv.go (4)
  • Run (105-122)
  • Config (284-340)
  • MetricOptions (263-277)
  • Environment (1727-1763)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-81)
router-tests/utils.go (2)
  • GetMetricScopeByName (88-95)
  • GetMetricByName (79-86)
router/pkg/otel/attributes.go (5)
  • WgStreamOperationName (65-65)
  • WgProviderType (66-66)
  • WgDestinationName (67-67)
  • WgProviderId (68-68)
  • WgErrorType (69-69)
router/pkg/pubsub/redis/adapter.go (5)
router/pkg/pubsub/kafka/adapter.go (3)
  • NewProviderAdapter (257-278)
  • Adapter (30-35)
  • ProviderAdapter (42-50)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (37-39)
router/pkg/metric/stream_metric_store.go (4)
  • StreamMetricStore (41-47)
  • StreamsEvent (25-31)
  • ProviderType (16-16)
  • ProviderTypeRedis (21-21)
router/pkg/metric/noop_stream_metrics.go (1)
  • NewNoopStreamMetricStore (15-15)
router/internal/persistedoperation/operationstorage/redis/rdcloser.go (1)
  • RDCloser (14-16)
router/pkg/pubsub/kafka/adapter.go (5)
router/pkg/metric/stream_metric_store.go (4)
  • StreamMetricStore (41-47)
  • StreamsEvent (25-31)
  • ProviderType (16-16)
  • ProviderTypeKafka (19-19)
router/pkg/pubsub/nats/adapter.go (1)
  • ProviderAdapter (42-54)
router/pkg/pubsub/redis/adapter.go (2)
  • ProviderAdapter (56-65)
  • NewProviderAdapter (33-54)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (37-39)
router/pkg/metric/noop_stream_metrics.go (1)
  • NewNoopStreamMetricStore (15-15)
router/pkg/pubsub/pubsub.go (4)
router/pkg/config/config.go (1)
  • EventsConfiguration (627-629)
router/pkg/metric/stream_metric_store.go (1)
  • StreamMetricStore (41-47)
router/pkg/pubsub/datasource/provider.go (3)
  • Provider (19-25)
  • ProviderBuilder (28-35)
  • ProviderOpts (37-39)
router/pkg/metric/noop_stream_metrics.go (1)
  • NewNoopStreamMetricStore (15-15)
router-tests/testenv/testenv.go (4)
router/pkg/pubsub/nats/adapter.go (1)
  • NewAdapter (356-379)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (37-39)
router/pkg/metric/stream_metric_store.go (1)
  • StreamMetricStore (41-47)
router/pkg/metric/noop_stream_metrics.go (1)
  • NewNoopStreamMetricStore (15-15)
router/pkg/pubsub/pubsub_test.go (2)
router/pkg/metric/noop_stream_metrics.go (1)
  • NewNoopStreamMetricStore (15-15)
demo/pkg/subgraphs/subgraphs.go (1)
  • New (207-280)
router/pkg/pubsub/nats/adapter.go (4)
router/pkg/metric/stream_metric_store.go (4)
  • StreamMetricStore (41-47)
  • StreamsEvent (25-31)
  • ProviderType (16-16)
  • ProviderTypeNats (20-20)
router/pkg/pubsub/nats/engine_datasource.go (1)
  • PublishAndRequestEventConfiguration (28-32)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (37-39)
router/pkg/metric/noop_stream_metrics.go (1)
  • NewNoopStreamMetricStore (15-15)
router/core/graph_server.go (5)
router/pkg/metric/stream_metric_store.go (2)
  • StreamMetricStore (41-47)
  • NewStreamMetricStore (56-81)
router/pkg/metric/noop_metrics.go (1)
  • NewNoopMetrics (63-65)
router/pkg/metric/noop_stream_metrics.go (1)
  • NewNoopStreamMetricStore (15-15)
router/pkg/metric/config.go (1)
  • OpenTelemetry (69-84)
router/pkg/config/config.go (1)
  • Prometheus (97-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: image_scan
  • GitHub Check: build-router
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (34)
router/pkg/metric/oltp_stream_metric_store.go (1)

50-52: No-op Shutdown on StreamMetricProvider isn’t required
The StreamMetricProvider interface only defines Produce, Consume, and Flush—it does not include Shutdown. As a result, adding a no-op Shutdown method to otlpStreamEventMetrics (or promStreamEventMetrics) would neither satisfy any interface nor be called by the store. The StreamMetrics store itself implements Shutdown(ctx) by flushing all providers, which aligns with the central shutdown of meter providers at the router level.

Likely an incorrect or invalid review comment.

router/pkg/otel/attributes.go (1)

63-70: Attribute keys diverge from PR’s documented labels; confirm final naming

PR text mentions:

  • wg.messaging.operation.name
  • wg.messaging.system
  • wg.messaging.destination.name
  • wg.provider.id
  • wg.error (boolean)

Current additions are:

  • wg.stream.operation.name
  • wg.provider.type
  • wg.destination.name
  • wg.provider.id
  • wg.error.type

This inconsistency will confuse users and break documented examples. Decide on one scheme (prefer a single “stream” or “messaging” prefix, and boolean wg.error vs error.type).

Would you like a follow-up diff to align keys and the call sites across the metrics store and adapters once you confirm the chosen naming?

router/core/graph_server.go (3)

783-786: Good defaulting to Noop stores

Initializing both metricStore and streamMetricStore to no-op avoids scattered nil checks and simplifies shutdown paths.


883-895: Stream store wiring looks correct; verify config gating matches intent

Creation is gated on OpenTelemetry.Streams || Prometheus.Streams, independent of the top-level metricsEnabled. If intended to allow enabling streams independently, this is good. Otherwise, consider also gating on metricsEnabled.


1135-1140: StreamMetricStore set in all RouterEngineConfiguration literals

Verified that every literal includes the new field:

  • router/core/plan_generator.go (line 258): StreamMetricStore: metric.NewNoopStreamMetricStore()
  • router/core/graph_server.go (lines 1135–1140): StreamMetricStore: gm.streamMetricStore

No missing instances found.

router/pkg/metric/noop_stream_metrics.go (1)

7-15: Solid null-object implementation; fits the codebase pattern

NoopStreamMetricStore cleanly satisfies the StreamMetricStore contract with no side effects. Matches the router-wide approach of central provider lifecycle management and no-op Shutdown/Flush.

router/pkg/pubsub/pubsub_test.go (3)

70-71: Correct: pass StreamMetricStore into build()

Good call to thread rmetric.NewNoopStreamMetricStore() into build for deterministic tests.


123-124: Correct: error-path mock updated to include providerOpts

The expectation now matches the updated BuildProvider signature.


241-243: Correct: OK-path mock updated to include providerOpts

Expectation and line break formatting look good.

router-tests/testenv/testenv.go (5)

66-68: Correct imports for ProviderOpts and metrics

New imports are used below to inject a no-op stream metric store into NATS adapters during tests.


273-277: Well-scoped flags for enabling stream metrics

EnableOTLPEventMetrics and EnablePrometheusEventMetrics integrate cleanly with the existing MetricOptions without coupling.


1494-1516: Prometheus config: proper wiring of Streams flag and exclusions

The Streams toggle flows into PrometheusConfig; consistent with other metric groups and exclusions.


1519-1547: OTLP config: proper wiring of Streams flag and test reader

Streams toggle and TestReader integration are consistent with the rest of the telemetry setup.


2827-2829: Good: inject NoopStreamMetricStore into NATS adapters in tests

This matches the codebase’s null-object pattern for metric stores; avoids nil checks and keeps adapters simple. Referencing your prior approach for event metrics is consistent.

router-tests/prometheus_stream_metrics_test.go (2)

24-24: Test timeout is reasonable

30s affords enough time for external brokers in CI; keep as-is unless flakes are observed.


259-268: Nice: shorter read timeout improves NATS subscribe determinism

Tuning WebSocketClientReadTimeout to 1s helps these subscription tests complete reliably.

router/core/factoryresolver.go (1)

11-12: Import alias is appropriate

Using rmetric keeps the StreamMetricStore type reference clear and avoids collision.

router/pkg/pubsub/kafka/adapter.go (4)

24-27: Operation-name constants are clear and aligned with tests

Defining kafkaReceive/ kafkaProduce centrally is good. Kafka uses "produce" (as asserted in tests) while Redis/NATS use "publish"/"request".


143-161: Goroutine wiring and provider ID propagation look good

Spawning a poller per subscription and threading providerId via PollerOpts is clean and keeps metrics attributable per provider.


197-215: Produce-path metrics are complete (provider labels and error type included)

Both success and error paths emit StreamOperationName and include ProviderType, ProviderId, DestinationName; error path sets "publish_error". Matches OTLP/Prom expectations and tests.


257-277: Null-object fallback for StreamMetricStore is correct and consistent

Defaulting to NewNoopStreamMetricStore when no store is provided aligns with the codebase’s pattern (no nil checks at call sites).

router/pkg/pubsub/redis/adapter.go (3)

33-54: ProviderOpts plumbing and Noop metric store fallback are solid

Constructor threads StreamMetricStore through and defaults to Noop, ensuring non-nil store everywhere.


130-136: Consume-path metrics include provider and destination attributes

Emitting ProviderId, ProviderType=redis, operation=receive, and DestinationName=channel is correct and matches tests.


174-180: Publish-path metrics are complete and consistent

Error path uses ErrorType="publish_error"; both paths include ProviderId/Type and DestinationName. Good coverage.

Also applies to: 184-189

router-tests/telemetry/stream_metrics_test.go (3)

38-82: Kafka publish assertions are precise and aligned with instrumentation

Validates operation="produce", provider/type/destination labels, and correct count aggregation. Nice.


88-164: Kafka subscribe test synchronizes well before collecting metrics

Waiting for subscription, producing one message, then Collect-ing is a good pattern to reduce flakiness.


371-418: Redis publish/subscribe tests are consistent with the metric spec

Attributes and counts match the store’s semantics. Good use of testenv helpers.

Also applies to: 420-499

router/pkg/pubsub/nats/adapter.go (3)

221-255: Publish metrics include provider labels and error classification

Both success and error paths set ProviderId/Type, destination, and appropriate operation; error path uses "publish_error". Matches tests.


257-299: Request metrics are complete; error type is specific

Operation="request" and "request_error" classification is useful; provider/destination labels included. Good.


356-379: Constructor uses Noop store fallback and scopes logger to “nats”

Consistent with other providers; avoids nil checks downstream.

router/pkg/pubsub/pubsub.go (2)

56-60: Good: default to Noop stream metric store

Guarding against nil here ensures provider builders always receive a non-nil store, matching the project’s null-object pattern.


73-76: Passing the store through build() per provider is the right abstraction

Centralizes metric plumbing without leaking concerns to data source planning.

Also applies to: 89-92, 105-108

router/pkg/metric/stream_metric_store.go (2)

90-92: Verify attribute set for cardinality and conventions.

  • Destination name and provider id as labels can increase cardinality. Ensure this aligns with your Prom/OTLP backends and that Config.CardinalityLimit is set appropriately where applicable.
  • Confirm wg.stream.operation.name is the intended attribute (product notes mentioned omitting operation name). If you plan to exclude it, implement that upstream or gate it via config.

Also applies to: 96-101, 110-113, 118-122


33-39: Interfaces look clean and future-proof.

Separation between provider and store (with only Flush/Shutdown on the store) aligns with the router-level provider lifecycle. Good abstraction.

Also applies to: 41-47

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
router-tests/telemetry/stream_metrics_test.go (5)

60-62: Assert type assertions for metric sums to fail fast with a clear message

Currently the code discards the ok result of the type assertion. If the metric type changes, tests will fail with a vague "len=0" rather than an explicit type mismatch. Assert ok to get actionable failures.

Apply this diff (pattern repeated across all occurrences):

-				sum, _ := metricEntry.Data.(metricdata.Sum[int64])
-				require.Len(t, sum.DataPoints, 1)
+				sum, ok := metricEntry.Data.(metricdata.Sum[int64])
+				require.True(t, ok, "expected router stream metric %q to be Sum[int64], got %T", metricEntry.Name, metricEntry.Data)
+				require.Len(t, sum.DataPoints, 1)

Also applies to: 137-140, 197-200, 249-252, 333-337, 395-399, 471-475


281-283: Increase WebSocket client read timeout to reduce CI flakiness

A 1s WebSocket client read timeout can be tight under CI load. Bumping this slightly gives headroom without materially slowing the test.

-				ModifyEngineExecutionConfiguration: func(ec *config.EngineExecutionConfiguration) { ec.WebSocketClientReadTimeout = time.Second },
+				ModifyEngineExecutionConfiguration: func(ec *config.EngineExecutionConfiguration) { ec.WebSocketClientReadTimeout = 3 * time.Second },

101-109: Prefer typed assertions over raw JSON string comparisons for subscription payloads

You pass a typed selection struct to the subscription but only assert on the raw JSON bytes. Decoding into a typed struct (or asserting on typed fields) yields clearer diffs and compile-time safety. At minimum, align the ID type with GraphQL Int.

Apply this small type improvement:

-					employeeUpdatedMyKafka struct {
-						ID      float64 `graphql:"id"`
+					employeeUpdatedMyKafka struct {
+						ID      int `graphql:"id"`
 						Details struct {
 							Forename string `graphql:"forename"`
 							Surname  string `graphql:"surname"`
 						} `graphql:"details"`
 					} `graphql:"employeeUpdatedMyKafka(employeeID: 3)"`
-					employeeUpdated struct {
-						ID      float64 `graphql:"id"`
+					employeeUpdated struct {
+						ID      int `graphql:"id"`
 						Details struct {
 							Forename string `graphql:"forename"`
 							Surname  string `graphql:"surname"`
 						} `graphql:"details"`
 					} `graphql:"employeeUpdated(employeeID: 3)"`
-					employeeUpdates struct {
-						ID      float64 `graphql:"id"`
+					employeeUpdates struct {
+						ID      int `graphql:"id"`
 						Details struct {
 							Forename string `graphql:"forename"`
 							Surname  string `graphql:"surname"`
 						} `graphql:"details"`
 					} `graphql:"employeeUpdates"`

Optionally, unmarshal args.dataValue into a small typed struct and assert fields rather than the whole JSON blob.

Also applies to: 284-292, 433-441


52-82: Validate metric temporality/monotonicity for completeness (optional)

You already validate the counter values and attributes. As a nice-to-have, also assert that the metric is monotonic and has cumulative temporality to match OTel semantic expectations for counters.

Example (apply similarly in other sections):

require.True(t, sum.IsMonotonic, "expected counter to be monotonic")
require.Equal(t, metricdata.CumulativeTemporality, sum.Temporality)

1-500: Reduce duplication with a small assertion helper (optional)

The scope/metric lookup, type assertion, attribute checks, and value assertions are repeated across backends. A helper would make failures uniform and the tests shorter.

Example helper (place in router-tests/telemetry/assert_helpers.go):

package telemetry

import (
	"strings"
	"testing"

	"github.com/stretchr/testify/require"
	"go.opentelemetry.io/otel/sdk/metric/metricdata"
)

type StreamMetricExpect struct {
	MetricName      string
	Operation       string
	ProviderType    string
	DestinationSfx  string
	ProviderID      string
	ExpectedValue   int64
}

func AssertOTLPStreamCounter(t *testing.T, rm metricdata.ResourceMetrics, exp StreamMetricExpect) {
	t.Helper()
	scope := integration.GetMetricScopeByName(rm.ScopeMetrics, "cosmo.router.streams")
	require.NotNil(t, scope)
	metricEntry := integration.GetMetricByName(scope, exp.MetricName)
	require.NotNil(t, metricEntry)

	sum, ok := metricEntry.Data.(metricdata.Sum[int64])
	require.True(t, ok, "expected %q to be Sum[int64], got %T", metricEntry.Name, metricEntry.Data)

	require.Len(t, sum.DataPoints, 1)
	dp := sum.DataPoints[0]
	attrs := dp.Attributes

	op, _ := attrs.Value(otelattrs.WgStreamOperationName)
	sys, _ := attrs.Value(otelattrs.WgProviderType)
	dst, _ := attrs.Value(otelattrs.WgDestinationName)
	provider, hasProvider := attrs.Value(otelattrs.WgProviderId)
	_, hasErr := attrs.Value(otelattrs.WgErrorType)

	require.Equal(t, exp.Operation, op.AsString())
	require.Equal(t, exp.ProviderType, sys.AsString())
	require.True(t, strings.HasSuffix(dst.AsString(), exp.DestinationSfx), dst.AsString())
	require.True(t, hasProvider)
	require.Equal(t, exp.ProviderID, provider.AsString())
	require.False(t, hasErr)
	require.Equal(t, exp.ExpectedValue, dp.Value)
}

Use in tests:

rm := metricdata.ResourceMetrics{}
require.NoError(t, metricReader.Collect(context.Background(), &rm))
AssertOTLPStreamCounter(t, rm, StreamMetricExpect{
  MetricName: "router.streams.sent.messages",
  Operation: "produce",
  ProviderType: "kafka",
  DestinationSfx: "employeeUpdated",
  ProviderID: "my-kafka",
  ExpectedValue: 2,
})

I can push a refactor PR on top of this if helpful.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1667595 and f4393c0.

📒 Files selected for processing (3)
  • router-tests/prometheus_stream_metrics_test.go (1 hunks)
  • router-tests/telemetry/stream_metrics_test.go (1 hunks)
  • router/pkg/metric/stream_measurements.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router/pkg/metric/stream_measurements.go
  • router-tests/prometheus_stream_metrics_test.go
🧰 Additional context used
🧬 Code Graph Analysis (1)
router-tests/telemetry/stream_metrics_test.go (6)
router-tests/prometheus_stream_metrics_test.go (2)
  • TestFlakyEventMetrics (26-468)
  • WaitTimeout (24-24)
router-tests/testenv/testenv.go (4)
  • Run (105-122)
  • Config (284-340)
  • MetricOptions (263-277)
  • Environment (1727-1763)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-81)
router-tests/utils.go (2)
  • GetMetricScopeByName (88-95)
  • GetMetricByName (79-86)
router/pkg/otel/attributes.go (5)
  • WgStreamOperationName (65-65)
  • WgProviderType (66-66)
  • WgDestinationName (67-67)
  • WgProviderId (68-68)
  • WgErrorType (69-69)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
  • GitHub Check: build_push_image
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)

Copy link
Contributor

@StarpTech StarpTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SkArchon SkArchon changed the title feat: stream metrics feat: edfs stream metrics Aug 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
router-tests/prometheus_stream_metrics_test.go (1)

32-154: Fix race condition: Kafka subtests share topics.

The "publish" and "subscribe" Kafka subtests run in parallel but operate on the same topic ("employeeUpdated"). Since EnsureTopicExists deletes and recreates the topic, concurrent execution causes race conditions.

Remove parallel execution from the Kafka subtests:

 	t.Run("publish", func(t *testing.T) {
-		t.Parallel()
+		// Not parallel: shares topic with subscribe test

 	t.Run("subscribe", func(t *testing.T) {
-		t.Parallel()
+		// Not parallel: shares topic with publish test
router-tests/telemetry/stream_metrics_test.go (1)

35-37: Fix race condition: Kafka subtests share topics.

Same issue as in the Prometheus test - the Kafka subtests run in parallel but share the same topic.

Remove parallel execution from the Kafka subtests:

 	t.Run("publish", func(t *testing.T) {
-		t.Parallel()
+		// Not parallel: shares topic with subscribe test

 	t.Run("subscribe", func(t *testing.T) {
-		t.Parallel()
+		// Not parallel: shares topic with publish test
🧹 Nitpick comments (3)
router-tests/prometheus_stream_metrics_test.go (2)

26-26: Consider renaming test to reflect its purpose.

The test name TestFlakyEventMetrics suggests it's testing flaky behavior, but it appears to be testing normal stream metrics functionality. Consider renaming to something more descriptive like TestPrometheusStreamMetrics.

-func TestFlakyEventMetrics(t *testing.T) {
+func TestPrometheusStreamMetrics(t *testing.T) {

54-75: Add helper functions for metric assertions.

The metric assertion logic is repeated throughout the test with only minor variations. Consider extracting helper functions to reduce duplication and improve readability.

Add these helper functions at the beginning of the file:

func findMetricFamilyByName(mf []*dto.MetricFamily, name string) *dto.MetricFamily {
	for _, family := range mf {
		if family.GetName() == name {
			return family
		}
	}
	return nil
}

func findMetricLabelByName(metrics []*dto.Metric, name string) *dto.LabelPair {
	if len(metrics) == 0 {
		return nil
	}
	for _, label := range metrics[0].GetLabel() {
		if label.GetName() == name {
			return label
		}
	}
	return nil
}

Then use them consistently throughout the test.

router-tests/telemetry/stream_metrics_test.go (1)

29-29: Consider renaming test to reflect its purpose.

Similar to the Prometheus test, the name TestFlakyEventMetrics is misleading. Consider renaming to TestOTLPStreamMetrics for clarity.

-func TestFlakyEventMetrics(t *testing.T) {
+func TestOTLPStreamMetrics(t *testing.T) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 83bc1f8 and 0da26e7.

📒 Files selected for processing (3)
  • router-tests/prometheus_stream_metrics_test.go (1 hunks)
  • router-tests/telemetry/stream_metrics_test.go (1 hunks)
  • router-tests/testenv/testenv.go (5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router-tests/testenv/testenv.go
🧬 Code Graph Analysis (2)
router-tests/prometheus_stream_metrics_test.go (3)
router-tests/testenv/testenv.go (7)
  • Run (105-122)
  • Config (284-340)
  • ConfigWithEdfsKafkaJSONTemplate (85-85)
  • MetricOptions (263-277)
  • Environment (1727-1763)
  • ConfigWithEdfsNatsJSONTemplate (87-87)
  • ConfigWithEdfsRedisJSONTemplate (89-89)
router-tests/events/event_helpers.go (3)
  • EnsureTopicExists (37-54)
  • ProduceKafkaMessage (16-35)
  • ProduceRedisMessage (56-81)
router-tests/testenv/utils.go (1)
  • AwaitChannelWithT (10-19)
router-tests/testenv/testenv.go (4)
router/pkg/pubsub/nats/adapter.go (1)
  • NewAdapter (356-379)
router/pkg/pubsub/datasource/provider.go (1)
  • ProviderOpts (37-39)
router/pkg/metric/stream_metric_store.go (1)
  • StreamMetricStore (41-47)
router/pkg/metric/noop_stream_metrics.go (1)
  • NewNoopStreamMetricStore (15-15)
🔇 Additional comments (4)
router-tests/testenv/testenv.go (4)

273-276: LGTM! Stream metrics configuration added correctly.

The addition of EnableOTLPStreamMetrics and EnablePrometheusStreamMetrics fields to the MetricOptions struct properly extends the test environment to support stream metrics for both telemetry exporters.


1510-1510: LGTM! Prometheus stream metrics wired correctly.

The Streams field is properly configured from EnablePrometheusStreamMetrics.


1533-1533: LGTM! OTLP stream metrics wired correctly.

The Streams field is properly configured from EnableOTLPStreamMetrics.


2827-2829: LGTM! NATS adapter initialization with stream metrics.

The NATS adapter is properly initialized with the ProviderOpts containing the no-op stream metric store for testing.

@SkArchon SkArchon merged commit 49df81f into main Aug 19, 2025
65 of 66 checks passed
@SkArchon SkArchon deleted the milinda/eng-7785-subscriptions-metrics branch August 19, 2025 19:44
@Noroth Noroth mentioned this pull request Sep 30, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants