-
Notifications
You must be signed in to change notification settings - Fork 193
feat: edfs stream metrics #2137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Caution Review failedThe pull request is closed. WalkthroughThreads a new StreamMetricStore through the pubsub/provider/adapter stack and metric subsystem: adds stream metrics implementations (OTLP, Prometheus, Noop), introduces datasource.ProviderOpts to pass the store to provider adapters (Kafka, NATS, Redis), instruments adapters to emit stream events, extends config/schema with Streams flags, and updates tests and helpers. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: CodeRabbit UI 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (3)
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🔭 Outside diff range comments (17)
router/pkg/metric/prom_connection_metric_store.go (2)
55-61: Initial max-connections observations omit base attributesFor consistency with the callback (which includes both base and per-subgraph attrs), include the provided base attributes in the initial MeasureMaxConnections calls. Otherwise, time series will differ in label sets between initial and subsequent observations.
Apply:
- h.MeasureMaxConnections(context.Background(), maxConns, otelmetric.WithAttributes(attrs...)) + h.MeasureMaxConnections( + context.Background(), + maxConns, + otelmetric.WithAttributes(attributes...), + otelmetric.WithAttributes(attrs...), + )
101-104: errors.Join usage drops previously collected errorsYou overwrite err each iteration with errors.Join(regErr), losing prior errors. Accumulate instead.
Apply:
- if regErr := reg.Unregister(); regErr != nil { - err = errors.Join(regErr) - } + if regErr := reg.Unregister(); regErr != nil { + err = errors.Join(err, regErr) + }router/pkg/metric/oltp_connection_metric_store.go (2)
55-61: Include base attributes in initial max-connections measurementsMatch the callback’s attribute set by adding baseAttributes to the initial MeasureMaxConnections calls. This avoids label-set drift between initial and subsequent observations.
Apply:
- h.MeasureMaxConnections(context.Background(), maxConns, otelmetric.WithAttributes(attrs...)) + h.MeasureMaxConnections( + context.Background(), + maxConns, + otelmetric.WithAttributes(attributes...), + otelmetric.WithAttributes(attrs...), + )
102-105: errors.Join usage drops previously collected errorsSame issue as in the Prometheus implementation—accumulate unregister errors instead of overwriting.
Apply:
- if regErr := reg.Unregister(); regErr != nil { - err = errors.Join(regErr) - } + if regErr := reg.Unregister(); regErr != nil { + err = errors.Join(err, regErr) + }router-tests/events/nats_events_test.go (2)
87-95: Build is broken: undefined typenatsSubscriptionArgs(reintroduce the helper type).The local helper type was removed but is still referenced throughout this file, causing the Router CI failure:
undefined: natsSubscriptionArgsReintroduce a minimal local type to unblock builds.
Apply this diff near the top (after the timeout const) to restore the type:
const NatsWaitTimeout = time.Second * 30 +// helper args container for subscription callbacks +type natsSubscriptionArgs struct { + dataValue []byte + errValue error +}
1548-1555: Slice preallocation bug: using make with len and then appending doubles the slice size with zero entries.You preallocate
natsEventSourceswith length and then append, leaving leading zero-valued entries. This yields invalid provider entries and can alter test behavior.Apply this diff to preallocate capacity only:
- natsEventSources := make([]config.NatsEventSource, len(testenv.DemoNatsProviders)) - for _, sourceName := range testenv.DemoNatsProviders { - natsEventSources = append(natsEventSources, config.NatsEventSource{ + natsEventSources := make([]config.NatsEventSource, 0, len(testenv.DemoNatsProviders)) + for _, sourceName := range testenv.DemoNatsProviders { + natsEventSources = append(natsEventSources, config.NatsEventSource{ ID: sourceName, URL: url, }) }router/pkg/config/config.go (1)
103-111: Apply Prometheus EventMetrics env var fix in config.goScan for
METRICS_OTLP_EVENT_METRICSshows two definitions ofEventMetricsinrouter/pkg/config/config.go. Only the Prometheus one (lines 103–111) should be updated; the OTLP toggle at lines 141–143 must remain usingMETRICS_OTLP_EVENT_METRICS.• Modify Prometheus config at ~line 103:
- EventMetrics bool `yaml:"event_metrics" envDefault:"false" env:"METRICS_OTLP_EVENT_METRICS"` + EventMetrics bool `yaml:"event_metrics" envDefault:"false" env:"PROMETHEUS_EVENT_METRICS"`• Leave the OTLP config (lines 141–143) unchanged.
• Updateconfig.schema.jsonand any docs to reflectPROMETHEUS_EVENT_METRICS.
• Add an integration test that setsPROMETHEUS_EVENT_METRICS=trueto guard against regressions.router/pkg/pubsub/datasource/mocks.go (1)
795-853: Mock BuildProvider must forward providerOpts to the testify mockThe autogenerated mock for BuildProvider currently ignores the providerOpts parameter when calling _mock.Called, so any expectations on providerOpts will silently be dropped. Please update router/pkg/pubsub/datasource/mocks.go to:
- In the MockProviderBuilder.BuildProvider implementation (around line 795):
• Change
ret := _mock.Called(options)
to
ret := _mock.Called(options, providerOpts)
• Adjust the return‐value logic to first handle callbacks of type
func(P, ProviderOpts) (Provider, error)
and
func(P, ProviderOpts) Provider
before falling back to the old single‐arg signatures.- In the expecter helper MockProviderBuilder_Expecter.BuildProvider (around line 829):
• Add a second parameter for providerOpts interface{}
• Change the call to
_e.mock.On("BuildProvider", options, providerOpts)- In the call‐builder helpers (around lines 833–853):
• Update Run to accept func(options P, providerOpts ProviderOpts) and extract args[1]
• Update RunAndReturn to accept func(options P, providerOpts ProviderOpts) (Provider, error)Applying these changes will allow tests to set and assert expectations on the new providerOpts argument.
router/pkg/pubsub/pubsub_test.go (1)
64-65: Mock expectations likely outdated: BuildProvider now accepts provider options.Per the PR changes, MockProviderBuilder.BuildProvider takes providerOpts. The current mocks expect only a single argument and will mismatch once the new signature is in effect. Update the test mocks to include the additional argument (use mock.Anything to keep tests decoupled from the exact opts).
Apply these diffs:
@@ - mockBuilder.On("BuildProvider", natsEventSources[0]).Return(mockPubSubProvider, nil) + mockBuilder.On("BuildProvider", natsEventSources[0], mock.Anything).Return(mockPubSubProvider, nil) @@ - mockBuilder.On("BuildProvider", natsEventSources[0]).Return(nil, errors.New("provider error")) + mockBuilder.On("BuildProvider", natsEventSources[0], mock.Anything).Return(nil, errors.New("provider error")) @@ - mockBuilder.On("BuildProvider", natsEventSources[1]).Return(mockPubSubUsedProvider, nil) + mockBuilder.On("BuildProvider", natsEventSources[1], mock.Anything).Return(mockPubSubUsedProvider, nil)Add the missing import to use mock.Anything:
@@ - "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/mock"Also applies to: 121-122, 239-240
router/pkg/metric/oltp_event_metric_store.go (1)
68-69: Interface mismatch: Shutdown signature must accept a context.EventMetricStore requires Shutdown(ctx context.Context) error. The current method omits the context and won’t satisfy the interface. Also, delegating to meterProvider.Shutdown(ctx) ensures proper exporter teardown.
Apply this diff:
-func (o *otlpEventMetrics) Shutdown() error { return nil } +func (o *otlpEventMetrics) Shutdown(ctx context.Context) error { + return o.meterProvider.Shutdown(ctx) +}router/core/factoryresolver.go (2)
472-480: Add nil-safe fallback for EventMetricStoreAdd a guard around
routerEngineConfig.EventMetricStoreinfactoryresolver.goto prevent nil stores from causing panics. Verified this is the only production call site that needs updating.
- File: router/core/factoryresolver.go
factoryProviders, factoryDataSources, err := pubsub.BuildProvidersAndDataSources( l.ctx, routerEngineConfig.Events, + func() rmetric.EventMetricStore { + if routerEngineConfig.EventMetricStore == nil { + return rmetric.NewNoopEventMetricStore() + } + return routerEngineConfig.EventMetricStore + }(), l.logger, pubSubDS, l.resolver.InstanceData().HostName, l.resolver.InstanceData().ListenAddress, )
210-216: Ensure all RouterEngineConfiguration literals set EventMetricStoreIn router/core/plan_generator.go the zero-value literal leaves EventMetricStore nil. Update the struct instantiation to use a noop store, for example:
• router/core/plan_generator.go:256
- routerEngineConfig := RouterEngineConfiguration{} + routerEngineConfig := RouterEngineConfiguration{ + EventMetricStore: rmetric.NoopEventMetricStore, + }The graph_server path already supplies a non-nil store.
router-tests/events/kafka_events_test.go (1)
71-79: Bug in EnsureTopicExists helper: preallocate+append yields empty topics.Calls to
EnsureTopicExists(t, xEnv, topics...)rely on router-tests/events/event_helpers.go. That helper constructsprefixedTopics := make([]string, len(topics))and then appends, resulting in an initial slice of empty strings plus the desired topics. Deleting/creating topics with empty names will likely fail intermittently.Fix in router-tests/events/event_helpers.go:
func EnsureTopicExists(t *testing.T, xEnv *testenv.Environment, topics ...string) { // Delete topic for idempotency deleteCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() prefixedTopics := make([]string, 0, len(topics)) for _, topic := range topics { prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic)) } _, err := xEnv.KafkaAdminClient.DeleteTopics(deleteCtx, prefixedTopics...) require.NoError(t, err) ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() _, err = xEnv.KafkaAdminClient.CreateTopics(ctx, 1, 1, nil, prefixedTopics...) require.NoError(t, err) }Do you want me to open a PR patching this helper for you?
router/pkg/pubsub/redis/provider_builder.go (1)
46-66: Guard against missing adapter to avoid nil deref.Unlike Kafka/NATS builders, BuildEngineDataSourceFactory does not check whether an adapter exists for providerId. If the configuration references a non-existent provider, b.adapters[providerId] will be nil and downstream usage can panic.
Align with other builders by checking existence and returning an error if not found.
Apply this diff:
func (b *ProviderBuilder) BuildEngineDataSourceFactory(data *nodev1.RedisEventConfiguration) (datasource.EngineDataSourceFactory, error) { providerId := data.GetEngineEventConfiguration().GetProviderId() + adapter, ok := b.adapters[providerId] + if !ok { + return nil, fmt.Errorf("failed to get adapter for provider %s with ID %s", b.TypeID(), providerId) + } + var eventType EventType switch data.GetEngineEventConfiguration().GetType() { case nodev1.EventType_PUBLISH: eventType = EventTypePublish case nodev1.EventType_SUBSCRIBE: eventType = EventTypeSubscribe default: return nil, fmt.Errorf("unsupported event type: %s", data.GetEngineEventConfiguration().GetType()) } return &EngineDataSourceFactory{ - RedisAdapter: b.adapters[providerId], + RedisAdapter: adapter, fieldName: data.GetEngineEventConfiguration().GetFieldName(), eventType: eventType, channels: data.GetChannels(), providerId: providerId, }, nil }router/pkg/pubsub/kafka/adapter.go (1)
48-60: Use the subscription context for PollRecords to improve cancellation responsiveness.topicPoller checks ctx.Done(), but PollRecords currently uses p.ctx, which can delay cancellation when the subscription ctx is canceled. Use ctx in PollRecords so the poll loop can exit promptly on unsubscribe.
Apply this diff:
- fetches := client.PollRecords(p.ctx, 10_000) + fetches := client.PollRecords(ctx, 10_000)router/pkg/pubsub/nats/adapter.go (2)
129-147: JetStream fetch loop terminates on no-message/timeout; should continue with backoff.Using FetchNoWait(300) and returning on any error will stop the subscription when there are temporarily no messages (or on timeouts). This leads to dropped subscriptions and flakiness. Also, calling FetchNoWait in a tight loop risks a CPU spin if changed to continue.
Switch to Fetch with a short context timeout and continue on deadline/timeout errors.
- msgBatch, consumerFetchErr := consumer.FetchNoWait(300) - if consumerFetchErr != nil { - log.Error("error fetching messages", zap.Error(consumerFetchErr)) - return - } + fetchCtx, cancel := context.WithTimeout(p.ctx, 500*time.Millisecond) + msgBatch, consumerFetchErr := consumer.Fetch(fetchCtx, 300) + cancel() + if consumerFetchErr != nil { + // benign: no messages within the fetch window; keep polling + if errors.Is(consumerFetchErr, context.DeadlineExceeded) || errors.Is(consumerFetchErr, nats.ErrTimeout) { + continue + } + log.Error("error fetching messages", zap.Error(consumerFetchErr)) + return + }
318-341: Default to a NoopEventMetricStore to prevent nil-pointer panics.eventMetricStore is dereferenced in multiple code paths. If ProviderOpts.EventMetricStore is nil, this will panic. Initialize a NoopEventMetricStore by default.
func NewAdapter( ctx context.Context, logger *zap.Logger, url string, opts []nats.Option, hostName string, routerListenAddr string, providerOpts datasource.ProviderOpts, ) (Adapter, error) { if logger == nil { logger = zap.NewNop() } + if providerOpts.EventMetricStore == nil { + providerOpts.EventMetricStore = metric.NewNoopEventMetricStore() + } + return &ProviderAdapter{ ctx: ctx, logger: logger.With(zap.String("pubsub", "nats")), closeWg: sync.WaitGroup{}, hostName: hostName, routerListenAddr: routerListenAddr, url: url, opts: opts, flushTimeout: 10 * time.Second, eventMetricStore: providerOpts.EventMetricStore, }, nil }
🧹 Nitpick comments (32)
router/pkg/metric/prom_connection_metric_store.go (2)
46-49: Best-effort init is fine; consider documenting the intent and guarding against nil loggerLogging and continuing aligns with “best-effort” startup, but:
- If logger can be nil, this will panic. Guard with zap.NewNop().
- Add a short comment to document that init errors are intentionally non-fatal.
Apply this guard near the top of the constructor:
func newPromConnectionMetrics(logger *zap.Logger, meterProvider *metric.MeterProvider, stats *ConnectionPoolStats, attributes []attribute.KeyValue) (*promConnectionMetrics, error) { + if logger == nil { + logger = zap.NewNop() + } meter := meterProvider.Meter(
54-92: Receiver naming is inconsistent (h vs m)This file mixes receiver names (startInitMetrics uses h; other methods use m). Unify for readability.
Would you like a follow-up patch to rename receivers consistently (e.g., m) across methods?
router/pkg/metric/oltp_connection_metric_store.go (2)
46-49: Best-effort init: log-and-continue is OK; add nil-logger guard and document behaviorSame note as the Prometheus variant:
- Guard against nil logger to avoid panic.
- Consider a brief comment explaining that initialization errors are intentionally non-fatal.
Suggested guard:
func newOtlpConnectionMetrics(logger *zap.Logger, meterProvider *metric.MeterProvider, stats *ConnectionPoolStats, baseAttributes []attribute.KeyValue) (*otlpConnectionMetrics, error) { + if logger == nil { + logger = zap.NewNop() + } meter := meterProvider.Meter(
54-85: Duplicate init logic across OTLP and Prometheus — consider DRYing upstartInitMetrics is nearly identical in both files. Extract common logic into a shared helper (e.g., a small internal function taking meter, instruments, stats, and base attrs) to reduce drift and ease future changes.
If you want, I can propose a small internal package-level helper to encapsulate the shared callback registration and initial measurements.
router-tests/events/nats_events_test.go (1)
464-464: Fix test name grammar.Nit: “subscribe after message don't a boundary” → “subscribe after message: don't add a boundary”.
Apply this diff:
- t.Run("subscribe after message don't a boundary", func(t *testing.T) { + t.Run("subscribe after message: don't add a boundary", func(t *testing.T) {router/pkg/metric/event_measurements.go (1)
12-19: Metric names consistency check.Names look coherent. Consider pluralizing failure counters for consistency (“failures”) if you plan to expose these externally for long-term stability:
- router.events.publish.fail → router.events.publish.failures
- router.nats.request.fail → router.nats.request.failures
If you decide to change names now (before they become part of a stable public contract), update both the constants and any downstream references in providers/stores.
router/pkg/pubsub/datasource/provider.go (1)
36-38: Add doc comment for ProviderOpts to clarify intent and future usageA brief comment will help consumers understand what belongs in ProviderOpts vs. provider-specific options.
Apply:
+// ProviderOpts carries cross-cutting, runtime-provided facilities for providers +// that don't belong to their static config (e.g. metric/event stores). type ProviderOpts struct { EventMetricStore metric.EventMetricStore }demo/pkg/subgraphs/subgraphs.go (1)
214-221: Adapter calls updated with ProviderOpts — OKCalls to natsPubsub.NewAdapter now include ProviderOpts{}. This keeps the demo compiling with the new signature. If you later want to demo event metrics, consider threading a real EventMetricStore here.
router/pkg/metric/oltp_event_metric_store.go (1)
23-40: Optional: guard against nil meterProvider and consider using logger or removing it.If meterProvider is ever nil, newOtlpEventMetrics will panic. Either validate inputs or document invariants. Also, logger is stored but unused; either log instrumentation setup failure or drop the field.
Example guard:
func newOtlpEventMetrics(logger *zap.Logger, meterProvider *metric.MeterProvider) (*otlpEventMetrics, error) { + if meterProvider == nil { + return nil, fmt.Errorf("meterProvider is nil") + } meter := meterProvider.Meter(And either remove logger or use it to log instrumentation errors:
instruments, err := newEventInstruments(meter) if err != nil { - return nil, err + if logger != nil { + logger.Error("failed to initialize event instruments", zap.Error(err)) + } + return nil, err }router/pkg/pubsub/redis/adapter.go (1)
90-96: Optional: close PubSub on cleanup to avoid resource leaks.Unsubscribing is good, but explicitly closing the subscription ensures resources are freed promptly.
Apply this diff:
cleanup := func() { err := sub.PUnsubscribe(ctx, event.Channels...) if err != nil { log.Error(fmt.Sprintf("error unsubscribing from redis for topics %v", event.Channels), zap.Error(err)) } + _ = sub.Close() }router/pkg/metric/prom_event_metric_store.go (2)
64-65: Prefer logging on ForceFlush errors.Bubble up the error but also log it for visibility during tests and ops.
Apply this diff:
-func (p *promEventMetrics) Flush(ctx context.Context) error { return p.meterProvider.ForceFlush(ctx) } -func (p *promEventMetrics) Shutdown() error { return nil } +func (p *promEventMetrics) Flush(ctx context.Context) error { + err := p.meterProvider.ForceFlush(ctx) + if err != nil && p.logger != nil { + p.logger.Debug("prom event metrics flush failed", zap.Error(err)) + } + return err +} +func (p *promEventMetrics) Shutdown() error { return nil }
16-21: Struct fields ‘meter’ (and previously ‘logger’) are currently unused in methods.If you don’t plan to use
meter, consider removing it to avoid lint warnings. You already useloggerwith the Flush suggestion above.router-tests/testenv/testenv.go (1)
2825-2827: NATS test adapters now receive ProviderOpts with a Noop EventMetricStore.Good default for tests. If you later add tests validating event metrics emitted by subgraph-side publishers, consider making this configurable to pass a real store.
router/core/graph_server.go (1)
759-764: Flush EventMetrics before Shutdown to avoid losing buffered data.Shutdown sequencing should flush first to minimize metric loss on shutdown.
Apply this diff:
- if s.eventMetricStore != nil { - if aErr := s.eventMetricStore.Shutdown(ctx); aErr != nil { - err = errors.Join(err, aErr) - } - } + if s.eventMetricStore != nil { + if aErr := s.eventMetricStore.Flush(ctx); aErr != nil { + err = errors.Join(err, aErr) + } + if aErr := s.eventMetricStore.Shutdown(ctx); aErr != nil { + err = errors.Join(err, aErr) + } + }router/pkg/pubsub/kafka/adapter.go (1)
183-189: Emit publish metrics on both success and failure; consider guarding if store can be nil.Metrics on failure and success are emitted correctly and after waiting for the produce callback. This will panic if eventMetricStore is nil; the constructor-level default suggested above avoids that risk. If you prefer not to default, add nil checks here.
If you decide against constructor defaulting, here’s a guarded alternative:
- p.eventMetricStore.KafkaPublishFailure(ctx, event.ProviderID, event.Topic) + if p.eventMetricStore != nil { + p.eventMetricStore.KafkaPublishFailure(ctx, event.ProviderID, event.Topic) + } ... - p.eventMetricStore.KafkaPublish(ctx, event.ProviderID, event.Topic) + if p.eventMetricStore != nil { + p.eventMetricStore.KafkaPublish(ctx, event.ProviderID, event.Topic) + }Please confirm whether a no-op store is always injected across the codebase; if yes, we can keep the current calls without guards.
router-tests/events/event_helpers.go (2)
46-54: Optionally ignore “unknown topic” on delete to be idempotent.DeleteTopics for non-existent topics can return an error. Since this function aims for idempotency, consider ignoring "unknown topic or partition" errors from the admin API rather than failing the test.
If the client returns typed per-topic results, iterate and only require.NoError on unexpected errors. If it returns a single error, pattern-match and ignore expected “unknown topic” errors.
16-35: Minor: make produce callback non-blocking.pErrCh is unbuffered; the producer's callback goroutine can block if AwaitChannelWithT isn't receiving yet. Buffering to 1 makes the callback non-blocking and simplifies synchronization.
Apply this diff:
- pErrCh := make(chan error) + pErrCh := make(chan error, 1)router-tests/prometheus_event_metrics_test.go (3)
48-51: Guard against missing metric family to avoid nil dereference.Add a nil assertion before using family to produce clearer failures.
- family := findMetricFamilyByName(mf, "router_events_publish_messages_total") + family := findMetricFamilyByName(mf, "router_events_publish_messages_total") + require.NotNil(t, family, "metric family router_events_publish_messages_total not found") metrics := family.GetMetric() require.Len(t, metrics, 1)
112-126: Guard against missing metric family in Kafka subscribe path as well.Nit for consistency: assert the metric family is found before use.
- family := findMetricFamilyByName(mf, "router_events_messages_received_total") + family := findMetricFamilyByName(mf, "router_events_messages_received_total") + require.NotNil(t, family, "metric family router_events_messages_received_total not found") metrics := family.GetMetric() require.Len(t, metrics, 1)
272-285: Add robustness checks before indexing metrics slice.In this NATS subscribe path, you don't assert the number of metrics before accessing metrics[0]. Add a length assertion for symmetry with other tests and to avoid potential panics.
- family := findMetricFamilyByName(mf, "router_events_messages_received_total") - metrics := family.GetMetric() + family := findMetricFamilyByName(mf, "router_events_messages_received_total") + require.NotNil(t, family, "metric family router_events_messages_received_total not found") + metrics := family.GetMetric() + require.Len(t, metrics, 1)router/pkg/metric/noop_event_metrics.go (1)
3-7: Remove unused noopEventMetricProvider and otel metric import.noopEventMetricProvider is dead code and introduces an unnecessary otel dependency in this file. Simplify by removing it.
import ( "context" - - otelmetric "go.opentelemetry.io/otel/metric" ) -// A noop metric provider so we do not need to do nil checks for each provider call from the store -type noopEventMetricProvider struct{} - -func (n *noopEventMetricProvider) Publish(ctx context.Context, opts ...otelmetric.AddOption) {} -func (n *noopEventMetricProvider) PublishFailure(ctx context.Context, opts ...otelmetric.AddOption) {} -func (n *noopEventMetricProvider) MessagesReceived(ctx context.Context, opts ...otelmetric.AddOption) { -} - -func (n *noopEventMetricProvider) NatsRequest(ctx context.Context, opts ...otelmetric.AddOption) {} -func (n *noopEventMetricProvider) NatsRequestFailure(ctx context.Context, opts ...otelmetric.AddOption) { -} -func (n *noopEventMetricProvider) Flush(ctx context.Context) error { return nil } -func (n *noopEventMetricProvider) Shutdown() error { return nil }router-tests/telemetry/event_metrics_test.go (6)
54-60: Avoid panics: assert data points are present before indexing.Guard against empty data point sets to produce clearer failures. Apply similarly to other occurrences in this file.
- sum, ok := metric.Data.(metricdata.Sum[int64]) - require.True(t, ok) - - dataPoint := sum.DataPoints[0] + sum, ok := metric.Data.(metricdata.Sum[int64]) + require.True(t, ok) + require.NotEmpty(t, sum.DataPoints) + dataPoint := sum.DataPoints[0]
118-126: Repeat the data-point presence check in Kafka subscribe path.Same defensive assertion recommended here before accessing sum.DataPoints[0].
213-231: Repeat the data-point presence check in NATS request path.Add require.NotEmpty(t, sum.DataPoints) before indexing.
293-314: Repeat the data-point presence check in NATS subscribe path.Add require.NotEmpty(t, sum.DataPoints) before indexing.
345-366: Repeat the data-point presence check in Redis publish path.Add require.NotEmpty(t, sum.DataPoints) before indexing.
413-434: Repeat the data-point presence check in Redis subscribe path.Add require.NotEmpty(t, sum.DataPoints) before indexing.
router/pkg/metric/event_metric_store.go (5)
16-30: Optionally pass context to Shutdown to align with OTel APIsOTel components typically require context for Shutdown. Consider threading ctx through the provider Shutdown to support deadlines/cancellation and align with go.opentelemetry.io APIs. This change will touch provider implementations.
Apply this diff (breaking change within this PR):
type EventMetricProvider interface { // unified publish/receive for brokers (kafka, redis, nats) Publish(ctx context.Context, opts ...otelmetric.AddOption) PublishFailure(ctx context.Context, opts ...otelmetric.AddOption) MessagesReceived(ctx context.Context, opts ...otelmetric.AddOption) // keep NATS request separate NatsRequest(ctx context.Context, opts ...otelmetric.AddOption) NatsRequestFailure(ctx context.Context, opts ...otelmetric.AddOption) Flush(ctx context.Context) error - Shutdown() error + Shutdown(ctx context.Context) error }func (e *EventMetrics) Shutdown(ctx context.Context) error { var err error if errFlush := e.Flush(ctx); errFlush != nil { err = errors.Join(err, fmt.Errorf("failed to flush metrics: %w", errFlush)) } - if errProm := e.promMetrics.Shutdown(); errProm != nil { + if errProm := e.promMetrics.Shutdown(ctx); errProm != nil { err = errors.Join(err, fmt.Errorf("failed to shutdown prom metrics: %w", errProm)) } - if errOtlp := e.otlpMetrics.Shutdown(); errOtlp != nil { + if errOtlp := e.otlpMetrics.Shutdown(ctx); errOtlp != nil { err = errors.Join(err, fmt.Errorf("failed to shutdown otlp metrics: %w", errOtlp)) } return err }Also applies to: 209-226
60-85: Guard against nil metricsConfig to avoid panicsIf callers accidentally pass a nil metricsConfig, the current code panics. A small guard keeps the store safe and no-op by default.
Apply this diff:
func NewEventMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *sdkmetric.MeterProvider, metricsConfig *Config) (*EventMetrics, error) { store := &EventMetrics{ baseAttributes: baseAttributes, logger: logger, otlpMetrics: &noopEventMetricProvider{}, promMetrics: &noopEventMetricProvider{}, } + if metricsConfig == nil { + // No metrics configured; return noop store. + return store, nil + }
92-180: Reduce repetition with a small helper for common attributesMost methods repeat provider.type and provider.id. A tiny helper can centralize this and reduce copy/paste errors.
Apply this diff to add a helper (usage can be incrementally adopted):
func (e *EventMetrics) withAttrs(attrs ...attribute.KeyValue) otelmetric.AddOption { copied := append([]attribute.KeyValue{}, e.baseAttributes...) return otelmetric.WithAttributes(append(copied, attrs...)...) } + +// withProviderAttrs composes common provider attributes with additional ones. +func (e *EventMetrics) withProviderAttrs(providerType, providerID string, extra ...attribute.KeyValue) otelmetric.AddOption { + attrs := []attribute.KeyValue{ + otelattrs.WgEventProviderType.String(providerType), + otelattrs.WgEventProviderID.String(providerID), + } + attrs = append(attrs, extra...) + return e.withAttrs(attrs...) +}
92-170: Optional: replace magic strings with typed constantsThe literal "kafka", "redis", "nats" appear multiple times. Introduce constants to avoid typos and enable future refactors.
Apply this diff to define constants:
package metric import ( @@ otelattrs "github.com/wundergraph/cosmo/router/pkg/otel" ) +const ( + providerTypeKafka = "kafka" + providerTypeRedis = "redis" + providerTypeNats = "nats" +)Then replace occurrences of "kafka"/"redis"/"nats" with the respective constants (can be done progressively, especially if you adopt withProviderAttrs).
16-23: Naming consistency: consider singular MessageReceivedProvider interface uses MessagesReceived (plural) whereas store methods are KafkaMessageReceived, RedisMessageReceived, NatsMessageReceived (singular). Consider aligning to singular for consistency. This is cosmetic but avoids cognitive overhead.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these settings in your CodeRabbit configuration.
📒 Files selected for processing (34)
demo/pkg/subgraphs/employees/subgraph/schema.resolvers.go(1 hunks)demo/pkg/subgraphs/subgraphs.go(2 hunks)router-tests/events/event_helpers.go(1 hunks)router-tests/events/events_config_test.go(1 hunks)router-tests/events/kafka_events_test.go(29 hunks)router-tests/events/nats_events_test.go(1 hunks)router-tests/events/redis_events_test.go(19 hunks)router-tests/prometheus_event_metrics_test.go(1 hunks)router-tests/telemetry/event_metrics_test.go(1 hunks)router-tests/testenv/testenv.go(5 hunks)router/core/factoryresolver.go(3 hunks)router/core/graph_server.go(5 hunks)router/core/router.go(2 hunks)router/pkg/config/config.go(2 hunks)router/pkg/config/config.schema.json(2 hunks)router/pkg/metric/config.go(2 hunks)router/pkg/metric/event_measurements.go(1 hunks)router/pkg/metric/event_metric_store.go(1 hunks)router/pkg/metric/noop_event_metrics.go(1 hunks)router/pkg/metric/oltp_connection_metric_store.go(1 hunks)router/pkg/metric/oltp_event_metric_store.go(1 hunks)router/pkg/metric/prom_connection_metric_store.go(1 hunks)router/pkg/metric/prom_event_metric_store.go(1 hunks)router/pkg/otel/attributes.go(2 hunks)router/pkg/pubsub/datasource/mocks.go(1 hunks)router/pkg/pubsub/datasource/provider.go(2 hunks)router/pkg/pubsub/kafka/adapter.go(6 hunks)router/pkg/pubsub/kafka/provider_builder.go(2 hunks)router/pkg/pubsub/nats/adapter.go(9 hunks)router/pkg/pubsub/nats/provider_builder.go(2 hunks)router/pkg/pubsub/pubsub.go(6 hunks)router/pkg/pubsub/pubsub_test.go(3 hunks)router/pkg/pubsub/redis/adapter.go(4 hunks)router/pkg/pubsub/redis/provider_builder.go(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (23)
router/pkg/pubsub/datasource/mocks.go (1)
router/pkg/pubsub/datasource/provider.go (2)
ProviderOpts(36-38)Provider(18-24)
router/pkg/metric/config.go (1)
router/pkg/metric/event_metric_store.go (1)
EventMetrics(52-58)
router/pkg/metric/oltp_connection_metric_store.go (1)
router/pkg/pubsub/datasource/error.go (1)
Error(3-6)
router-tests/events/event_helpers.go (2)
router-tests/testenv/testenv.go (1)
Environment(1725-1761)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)
router/pkg/pubsub/datasource/provider.go (1)
router/pkg/metric/event_metric_store.go (1)
EventMetricStore(32-49)
router/core/router.go (3)
router/pkg/metric/event_metric_store.go (1)
EventMetrics(52-58)router/pkg/config/config.go (2)
Metrics(127-132)Prometheus(97-111)router/pkg/metric/metric_store.go (1)
Metrics(105-124)
router/pkg/pubsub/redis/provider_builder.go (3)
router/pkg/pubsub/datasource/provider.go (3)
ProviderBuilder(27-34)ProviderOpts(36-38)Provider(18-24)router/pkg/config/config.go (1)
RedisEventSource(611-615)router/pkg/pubsub/redis/adapter.go (1)
NewProviderAdapter(28-38)
demo/pkg/subgraphs/subgraphs.go (2)
router/pkg/pubsub/nats/adapter.go (1)
NewAdapter(318-342)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)
router/pkg/config/config.go (1)
router/pkg/metric/event_metric_store.go (1)
EventMetrics(52-58)
router/pkg/pubsub/kafka/adapter.go (3)
router/pkg/metric/event_metric_store.go (1)
EventMetricStore(32-49)router/pkg/pubsub/datasource/error.go (1)
NewError(12-17)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)
router/pkg/metric/event_metric_store.go (2)
router/pkg/metric/config.go (2)
Config(107-131)OpenTelemetry(69-84)router/pkg/otel/attributes.go (5)
WgEventProviderType(66-66)WgEventProviderID(65-65)WgKafkaTopic(67-67)WgRedisChannel(69-69)WgNatsSubject(68-68)
router/pkg/pubsub/redis/adapter.go (5)
router/pkg/pubsub/kafka/adapter.go (3)
NewProviderAdapter(232-246)Adapter(25-30)ProviderAdapter(37-45)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)router/pkg/metric/event_metric_store.go (1)
EventMetricStore(32-49)router/internal/persistedoperation/operationstorage/redis/rdcloser.go (1)
RDCloser(14-16)router/pkg/pubsub/datasource/error.go (1)
NewError(12-17)
router-tests/prometheus_event_metrics_test.go (4)
router-tests/testenv/testenv.go (8)
Run(105-122)Config(284-339)ConfigWithEdfsKafkaJSONTemplate(85-85)MetricOptions(263-277)Environment(1725-1761)GraphQLRequest(1893-1901)ConfigWithEdfsNatsJSONTemplate(87-87)ConfigWithEdfsRedisJSONTemplate(89-89)router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-77)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)router/pkg/config/config.go (1)
EngineExecutionConfiguration(370-398)
router/pkg/pubsub/nats/adapter.go (4)
router/pkg/metric/event_metric_store.go (1)
EventMetricStore(32-49)router/pkg/pubsub/nats/engine_datasource.go (1)
PublishAndRequestEventConfiguration(28-32)router/pkg/pubsub/datasource/error.go (1)
NewError(12-17)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)
router-tests/telemetry/event_metrics_test.go (5)
router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-77)router-tests/utils.go (1)
GetMetricByName(79-86)router/pkg/otel/attributes.go (5)
WgEventProviderID(65-65)WgEventProviderType(66-66)WgKafkaTopic(67-67)WgNatsSubject(68-68)WgRedisChannel(69-69)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)router/pkg/config/config.go (1)
EngineExecutionConfiguration(370-398)
router/pkg/pubsub/nats/provider_builder.go (2)
router/pkg/pubsub/datasource/provider.go (3)
ProviderBuilder(27-34)ProviderOpts(36-38)Provider(18-24)router/pkg/pubsub/nats/adapter.go (2)
Adapter(22-33)NewAdapter(318-342)
router/core/factoryresolver.go (1)
router/pkg/metric/event_metric_store.go (1)
EventMetricStore(32-49)
router-tests/events/kafka_events_test.go (2)
router-tests/events/event_helpers.go (2)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)
router/pkg/pubsub/kafka/provider_builder.go (3)
router/pkg/pubsub/datasource/provider.go (3)
ProviderBuilder(27-34)ProviderOpts(36-38)Provider(18-24)router/pkg/config/config.go (1)
KafkaEventSource(599-605)router/pkg/pubsub/kafka/adapter.go (2)
Adapter(25-30)NewProviderAdapter(232-246)
router-tests/events/redis_events_test.go (1)
router-tests/events/event_helpers.go (1)
ProduceRedisMessage(56-77)
router/pkg/pubsub/pubsub.go (3)
router/pkg/metric/event_metric_store.go (1)
EventMetricStore(32-49)router/pkg/pubsub/datasource/provider.go (3)
Provider(18-24)ProviderBuilder(27-34)ProviderOpts(36-38)router/pkg/pubsub/kafka/provider_builder.go (2)
NewProviderBuilder(167-180)ProviderBuilder(21-27)
router-tests/testenv/testenv.go (4)
router/pkg/metric/event_metric_store.go (2)
EventMetrics(52-58)EventMetricStore(32-49)router/pkg/pubsub/nats/adapter.go (1)
NewAdapter(318-342)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)router/pkg/metric/noop_event_metrics.go (1)
NewNoopEventMetricStore(50-50)
router/core/graph_server.go (5)
router/pkg/metric/event_metric_store.go (3)
EventMetricStore(32-49)EventMetrics(52-58)NewEventMetricStore(60-85)router/pkg/metric/noop_metrics.go (1)
NewNoopMetrics(63-65)router/pkg/metric/noop_event_metrics.go (1)
NewNoopEventMetricStore(50-50)router/pkg/metric/config.go (1)
OpenTelemetry(69-84)router/pkg/config/config.go (1)
Prometheus(97-111)
🪛 GitHub Actions: Router CI
router-tests/events/nats_events_test.go
[error] 87-87: Test run failed: undefined: natsSubscriptionArgs. Command: gotestsum -f github-actions -- -ldflags=-extldflags=-Wl,-ld_classic -run '^Test[^(Flaky)]' --timeout=5m -p 1 --parallel 10 -race ./events
🔇 Additional comments (33)
router-tests/events/nats_events_test.go (1)
281-283: Comment contradicts code: sleep is shorter than configured read timeout.
WebSocketClientReadTimeoutis set to 1s (Line 239) but we sleep only 200ms here while the comment says “Wait longer than the read timeout”.Consider updating the sleep to exceed the timeout (or fix the comment). Example:
- // Wait longer than the read timeout to ensure that read timeouts are handled correctly - time.Sleep(time.Millisecond * 200) + // Wait longer than the read timeout to ensure that read timeouts are handled correctly + time.Sleep(time.Second + 200*time.Millisecond)If a longer sleep makes the test too slow, alternatively correct the comment instead of increasing the delay.
router-tests/events/events_config_test.go (1)
1-1: Package rename to internal test package looks good.Using
package eventsaligns with the other event tests and allows access to unexported internals where needed.router/pkg/metric/config.go (2)
38-39: Prometheus config: EventMetrics toggle added — LGTM.Adding
EventMetrics boolto PrometheusConfig is consistent with the new event metrics feature flag surface.
82-84: OTLP config: EventMetrics toggle added — LGTM.Adding
EventMetrics boolto OpenTelemetry config aligns with the new metrics wiring.router/pkg/metric/event_measurements.go (2)
52-103: Initialize counters — looks correct and defensive.The instruments are created with error checks and wrapped messages. Good grouping and separation for generic events vs NATS request metrics.
21-39: It looks like everywhere in the repo you’re usingotelmetric.Int64CounterOptionandotelmetric.WithDescription, with no imports fromgo.opentelemetry.io/otel/metric/instrument. The new options inevent_measurements.goare consistent with existing usage. No changes needed.Action: None—merge as is.
router/pkg/config/config.go (1)
141-145: OTLP EventMetrics toggle — LGTM.The OTLP event metrics toggle/tag looks correct and consistent with the rest of the configuration surface.
router/pkg/config/config.schema.json (1)
1061-1065: OTLP event_metrics flag added — looks goodThe schema addition is consistent with the surrounding options and uses a safe default (false). No issues spotted here.
router/core/router.go (1)
2242-2259: Correctly wires event metrics flags into metric configMapping cfg.Metrics.OTLP.EventMetrics and cfg.Metrics.Prometheus.EventMetrics into rmetric.Config is accurate and aligns with the schema additions. No issues.
router/pkg/pubsub/pubsub_test.go (2)
17-76: Minor: test covers happy-path build correctly.No issues spotted in the positive-path assertions (provider count, root nodes), given the wiring changes.
293-294: Production already defaults to a no-op EventMetricStore — no change neededConfirmed: tests pass nil intentionally, and production paths initialize and pass a no-op store so adapters won't panic. BuildProvidersAndDataSources simply forwards the store (it doesn't itself substitute nil).
Key locations:
- router/pkg/pubsub/pubsub_test.go — tests call BuildProvidersAndDataSources(..., nil, ...) (lines ~287, 340, 393).
- router/pkg/pubsub/pubsub.go — BuildProvidersAndDataSources signature accepts store and forwards it into provider opts (signature at line ~55; forwarded at build(...) -> ProviderOpts.EventMetricStore).
- router/pkg/metric/noop_event_metrics.go — Noop implementation and constructor: NewNoopEventMetricStore().
- router/core/graph_server.go — buildGraphMux initializes gm.eventMetricStore = rmetric.NewNoopEventMetricStore() and only replaces it with a real store when metrics are enabled (see gm creation and later assignment).
- router/core/factoryresolver.go — calls pubsub.BuildProvidersAndDataSources(..., routerEngineConfig.EventMetricStore, ...) (around line ~472), so production callers supply the non-nil gm.eventMetricStore.
Result: production callers already supply a no-op default; tests can pass nil safely for focused testing.
router-tests/events/redis_events_test.go (2)
1-1: Package switch to internal events package is sensible for shared helpers.Moving to package events allows direct use of shared test helpers without import cycles. This improves reuse and reduces duplication.
107-107: Consolidated to the shared ProduceRedisMessage helper — LGTM.Replacing the local publisher with ProduceRedisMessage centralizes behavior and keeps tests DRY. The message payloads and assertions remain unchanged, preserving test intent.
Also applies to: 173-173, 185-185, 195-195, 205-205, 276-276, 357-357, 378-378, 454-454, 512-512, 517-517, 593-593, 666-666, 795-795, 856-856, 869-869, 880-880, 891-891, 994-994
router/pkg/otel/attributes.go (1)
63-71: Event metric attribute keys look consistent and future-proof.Provider-scoped and backend-scoped labels are clearly named and consistent with existing wg.* attributes. This should make metric queries straightforward.
router/core/factoryresolver.go (1)
7-7: Import alias for metrics looks correct and scoped.Alias
rmetricis used for the new EventMetricStore type; no issues here.router-tests/testenv/testenv.go (3)
273-277: Config flags for EventMetrics are well-scoped and consistent.Adding
EnableOTLPEventMetricsandEnablePrometheusEventMetricsto MetricOptions keeps test toggles symmetric with existing metrics flags.
1508-1514: Prometheus EventMetrics wiring is correct.
PrometheusConfig.EventMetricsmirrors the new flag and integrates with the metrics bootstrap.
1530-1539: OTLP EventMetrics wiring is correct.
Metrics.OTLP.EventMetricsis correctly controlled by the test flag.router/core/graph_server.go (4)
517-518: Graph mux now tracks an EventMetricStore.This aligns with the engine wiring and allows lifecycle control per-mux. Looks good.
779-783: Sane defaults: Noop stores on initialization.Defaulting both metric stores to Noop avoids nil checks everywhere and ensures safe shutdown.
879-891: Conditional creation of real EventMetricStore is correct.Creating a concrete store only when either OTLP or Prometheus event metrics are enabled keeps overhead minimal.
1134-1135: Engine configuration now includes EventMetricStore.Passing the store into RouterEngineConfiguration is the right place for downstream pub/sub provider wiring.
router-tests/events/kafka_events_test.go (2)
110-116: Migration to exported test helpers LGTM.Using
ProduceKafkaMessagefrom the shared helpers removes duplication and improves readability.
434-455: Multipart test flow remains correct with helper usage.The switch to
EnsureTopicExists+ProduceKafkaMessagepreserves behavior. No issues spotted.router/pkg/pubsub/redis/provider_builder.go (1)
69-75: Provider options propagation looks correct.The new BuildProvider signature and passing providerOpts into NewProviderAdapter is consistent with the generic ProviderBuilder interface change and with Kafka/NATS builders.
router/pkg/pubsub/kafka/provider_builder.go (2)
59-68: ProviderOpts threading is correct; map registration retained.BuildProvider now accepts providerOpts and registers the adapter in p.adapters after creation. This matches the interface change and maintains downstream EngineDataSourceFactory lookups.
153-165: Forwarding providerOpts into adapter creation is correct.buildProvider threads providerOpts into NewProviderAdapter. Error wrapping includes provider ID which is helpful for diagnostics.
router/pkg/pubsub/nats/provider_builder.go (2)
67-75: ProviderOpts propagation and adapter registration LGTM.BuildProvider now forwards providerOpts through buildProvider and stores the adapter for later lookup. Consistent with Kafka.
121-133: ProviderOpts passed into NewAdapter correctly.buildProvider wires providerOpts into NewAdapter, enabling metrics injection. Error messages include provider ID which is good.
router/pkg/pubsub/kafka/adapter.go (1)
93-97: Record “message received” with provider and topic looks good.The metric emission at message receipt is correctly placed after iterating a record and before updater.Update.
router-tests/prometheus_event_metrics_test.go (1)
41-42: Potential flakiness from EnsureTopicExists helper (prefixedTopics building bug).The helper events.EnsureTopicExists is used here. Based on router-tests/events/event_helpers.go, prefixedTopics is created with length len(topics) and then appended to, which doubles the length and inserts empty strings. This can cause unexpected behavior and test flakiness.
Please fix the helper as follows:
// router-tests/events/event_helpers.go func EnsureTopicExists(t *testing.T, xEnv *testenv.Environment, topics ...string) { // Delete topic for idempotency deleteCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() - prefixedTopics := make([]string, len(topics)) - for _, topic := range topics { - prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic)) - } + prefixedTopics := make([]string, 0, len(topics)) + for _, topic := range topics { + prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic)) + } _, err := xEnv.KafkaAdminClient.DeleteTopics(deleteCtx, prefixedTopics...) require.NoError(t, err) ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() _, err = xEnv.KafkaAdminClient.CreateTopics(ctx, 1, 1, nil, prefixedTopics...) require.NoError(t, err) }router/pkg/metric/event_metric_store.go (2)
87-90: Attribute merging helper is correct and copy-safeCopy-on-append avoids mutating baseAttributes and ensures per-call isolation. Good practice.
194-207: No action needed: Go version supports errors.Join
Allgo.modfiles declare Go 1.23.x (≥ 1.20), so usingerrors.Joinis safe.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🔭 Outside diff range comments (16)
router/pkg/metric/prom_connection_metric_store.go (1)
102-104: errors.Join usage bug: this drops previously accumulated errors
errors.Join(regErr)with a single arg is effectively justregErr. You’re overwritingerreach time and not accumulating. Useerrors.Join(err, regErr)to collect all unregister errors.- if regErr := reg.Unregister(); regErr != nil { - err = errors.Join(regErr) - } + if regErr := reg.Unregister(); regErr != nil { + err = errors.Join(err, regErr) + }router/pkg/metric/oltp_connection_metric_store.go (1)
102-104: Fix error accumulation in ShutdownAs in the Prometheus store,
errors.Join(regErr)drops previously collected errors. Useerrors.Join(err, regErr)to aggregate all failures.- if regErr := reg.Unregister(); regErr != nil { - err = errors.Join(regErr) - } + if regErr := reg.Unregister(); regErr != nil { + err = errors.Join(err, regErr) + }router-tests/events/nats_events_test.go (2)
1367-1389: Fix blocking ReadLine calls causing CI timeout in subscribe_sse_with_filterCI failed with a 5m timeout in this test. The direct calls to reader.ReadLine() block indefinitely if an expected SSE frame is not delivered (e.g., due to filter logic differences or timing). Wrap the reads with a timeout to fail fast and avoid hanging the entire suite.
Apply this diff to replace blocking reads with a helper that enforces NatsWaitTimeout:
- eventNext, _, gErr := reader.ReadLine() - require.NoError(t, gErr) - require.Equal(t, "event: next", string(eventNext)) - data, _, gErr := reader.ReadLine() - require.NoError(t, gErr) - require.Equal(t, fmt.Sprintf("data: {\"data\":{\"filteredEmployeeUpdated\":{\"id\":%d,\"details\":{\"forename\":\"%s\",\"surname\":\"%s\"}}}}", 1, testData[1].forename, testData[1].surname), string(data)) - line, _, gErr := reader.ReadLine() - require.NoError(t, gErr) - require.Equal(t, "", string(line)) + eventNext := readLineWithTimeout(t, reader, NatsWaitTimeout) + require.Equal(t, "event: next", eventNext) + data := readLineWithTimeout(t, reader, NatsWaitTimeout) + require.Equal(t, fmt.Sprintf("data: {\"data\":{\"filteredEmployeeUpdated\":{\"id\":%d,\"details\":{\"forename\":\"%s\",\"surname\":\"%s\"}}}}", 1, testData[1].forename, testData[1].surname), data) + line := readLineWithTimeout(t, reader, NatsWaitTimeout) + require.Equal(t, "", line)- case 1, 3, 4, 5, 7, 8, 11: - eventNext, _, gErr = reader.ReadLine() - require.NoError(t, gErr) - require.Equal(t, "event: next", string(eventNext)) - data, _, gErr = reader.ReadLine() - require.NoError(t, gErr) - require.Equal(t, fmt.Sprintf("data: {\"data\":{\"filteredEmployeeUpdated\":{\"id\":%d,\"details\":{\"forename\":\"%s\",\"surname\":\"%s\"}}}}", i, testData[i].forename, testData[i].surname), string(data)) - line, _, gErr = reader.ReadLine() - require.NoError(t, gErr) - require.Equal(t, "", string(line)) + case 1, 3, 4, 5, 7, 8, 11: + eventNext := readLineWithTimeout(t, reader, NatsWaitTimeout) + require.Equal(t, "event: next", eventNext) + data := readLineWithTimeout(t, reader, NatsWaitTimeout) + require.Equal(t, fmt.Sprintf("data: {\"data\":{\"filteredEmployeeUpdated\":{\"id\":%d,\"details\":{\"forename\":\"%s\",\"surname\":\"%s\"}}}}", i, testData[i].forename, testData[i].surname), data) + line := readLineWithTimeout(t, reader, NatsWaitTimeout) + require.Equal(t, "", line)Add this helper within this test file (outside the selected range), near the other test helpers:
// helper to prevent indefinite blocking on streaming reads func readLineWithTimeout(t *testing.T, reader *bufio.Reader, timeout time.Duration) string { t.Helper() lineCh := make(chan string, 1) errCh := make(chan error, 1) go func() { line, _, err := reader.ReadLine() if err != nil { errCh <- err return } lineCh <- string(line) }() select { case s := <-lineCh: return s case err := <-errCh: require.NoError(t, err) return "" case <-time.After(timeout): t.Fatalf("timed out after %s waiting for stream line", timeout) return "" } }
1548-1560: Slice preallocation bug: duplicate zero-value entries in providers listThe slice is pre-sized and then appended to, yielding len = 2x providers with the first half being zero-values. Use zero length with capacity preallocation.
- natsEventSources := make([]config.NatsEventSource, len(testenv.DemoNatsProviders)) + natsEventSources := make([]config.NatsEventSource, 0, len(testenv.DemoNatsProviders)) for _, sourceName := range testenv.DemoNatsProviders { natsEventSources = append(natsEventSources, config.NatsEventSource{ ID: sourceName, URL: url, }) }This avoids invalid empty providers from being added unintentionally. Even though this test expects failure, the current code introduces misleading extra entries.
router-tests/prometheus_event_metrics_test.go (1)
26-424: Reduce flakiness by polling metrics until expected values appearPrometheus collection can lag behind the event emission. Right after triggering mutations/publishes, gathering once may intermittently return stale values. Consider polling the registry for the expected counter with a timeout to make this test deterministic.
I can factor a small helper like awaitPromCounter(registry, familyName, labelKey, labelSuffix, expected, timeout) used across the subtests. Want me to draft it?
router/pkg/metric/config.go (1)
1-164: Fix Prometheus EventMetrics env tagThe
EventMetricsfield in the Prometheus config is incorrectly using the OTLP env key. Update it to its own prefix so users can toggle Prometheus event metrics independently.Locations to update:
- router/pkg/config/config.go: line 103
Diff:
--- a/router/pkg/config/config.go +++ b/router/pkg/config/config.go @@ -100,7 +100,7 @@ type PrometheusConfig struct { - EventMetrics bool `yaml:"event_metrics" envDefault:"false" env:"METRICS_OTLP_EVENT_METRICS"` + EventMetrics bool `yaml:"event_metrics" envDefault:"false" env:"PROMETHEUS_EVENT_METRICS"`Keep the OTLP
EventMetricsat line 141 unchanged.router/pkg/pubsub/datasource/mocks.go (1)
795-820: Mock does not account for newproviderOptsparameter; expectations will silently mismatchThe mock ignores
providerOptswhen invoking_mock.Called(...), and the expecter helpers don’t include the second argument inOn(...),Run(...), orRunAndReturn(...). Any tests that set expectations with both args or want to useproviderOptsin return functions will break or be impossible to write.Fix by threading
providerOptsthrough the mock call, expecter, and function pointer types:func (_mock *MockProviderBuilder[P, E]) BuildProvider(options P, providerOpts ProviderOpts) (Provider, error) { - ret := _mock.Called(options) + ret := _mock.Called(options, providerOpts) if len(ret) == 0 { panic("no return value specified for BuildProvider") } var r0 Provider var r1 error - if returnFunc, ok := ret.Get(0).(func(P) (Provider, error)); ok { - return returnFunc(options) + if returnFunc, ok := ret.Get(0).(func(P, ProviderOpts) (Provider, error)); ok { + return returnFunc(options, providerOpts) } - if returnFunc, ok := ret.Get(0).(func(P) Provider); ok { - r0 = returnFunc(options) + if returnFunc, ok := ret.Get(0).(func(P, ProviderOpts) Provider); ok { + r0 = returnFunc(options, providerOpts) } else { if ret.Get(0) != nil { r0 = ret.Get(0).(Provider) } } - if returnFunc, ok := ret.Get(1).(func(P) error); ok { - r1 = returnFunc(options) + if returnFunc, ok := ret.Get(1).(func(P, ProviderOpts) error); ok { + r1 = returnFunc(options, providerOpts) } else { r1 = ret.Error(1) } return r0, r1 } @@ -// - options P -func (_e *MockProviderBuilder_Expecter[P, E]) BuildProvider(options interface{}) *MockProviderBuilder_BuildProvider_Call[P, E] { - return &MockProviderBuilder_BuildProvider_Call[P, E]{Call: _e.mock.On("BuildProvider", options)} +// - options P +// - providerOpts ProviderOpts +func (_e *MockProviderBuilder_Expecter[P, E]) BuildProvider(options interface{}, providerOpts interface{}) *MockProviderBuilder_BuildProvider_Call[P, E] { + return &MockProviderBuilder_BuildProvider_Call[P, E]{Call: _e.mock.On("BuildProvider", options, providerOpts)} } -func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) Run(run func(options P)) *MockProviderBuilder_BuildProvider_Call[P, E] { +func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) Run(run func(options P, providerOpts ProviderOpts)) *MockProviderBuilder_BuildProvider_Call[P, E] { _c.Call.Run(func(args mock.Arguments) { var arg0 P if args[0] != nil { arg0 = args[0].(P) } - run( - arg0, - ) + var arg1 ProviderOpts + if args[1] != nil { + arg1 = args[1].(ProviderOpts) + } + run( + arg0, + arg1, + ) }) return _c } func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) Return(provider Provider, err error) *MockProviderBuilder_BuildProvider_Call[P, E] { _c.Call.Return(provider, err) return _c } -func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) RunAndReturn(run func(options P) (Provider, error)) *MockProviderBuilder_BuildProvider_Call[P, E] { +func (_c *MockProviderBuilder_BuildProvider_Call[P, E]) RunAndReturn(run func(options P, providerOpts ProviderOpts) (Provider, error)) *MockProviderBuilder_BuildProvider_Call[P, E] { _c.Call.Return(run) return _c }Also applies to: 827-854
router/pkg/pubsub/pubsub_test.go (4)
3-16: Import testify/mock to enable argument matchersYou’ll need mock.Anything to update BuildProvider expectations (see below).
import ( "context" "errors" rmetric "github.com/wundergraph/cosmo/router/pkg/metric" "testing" "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/mock" "github.com/stretchr/testify/require" nodev1 "github.com/wundergraph/cosmo/router/gen/proto/wg/cosmo/node/v1" "github.com/wundergraph/cosmo/router/pkg/config" "github.com/wundergraph/cosmo/router/pkg/pubsub/datasource" "github.com/wundergraph/graphql-go-tools/v2/pkg/engine/plan" "go.uber.org/zap" )
64-66: Update mock to match new BuildProvider signatureBuildProvider now receives providerOpts; update the mock to include it.
- mockBuilder.On("TypeID").Return("nats") - mockBuilder.On("BuildProvider", natsEventSources[0]).Return(mockPubSubProvider, nil) + mockBuilder.On("TypeID").Return("nats") + mockBuilder.On("BuildProvider", natsEventSources[0], mock.Anything).Return(mockPubSubProvider, nil)
239-241: Fix failing mock expectation for “not used” providerUpdate to include providerOpts.
-mockBuilder.On("TypeID").Return("nats") -mockBuilder.On("BuildProvider", natsEventSources[1]).Return(mockPubSubUsedProvider, nil) +mockBuilder.On("TypeID").Return("nats") +mockBuilder.On("BuildProvider", natsEventSources[1], mock.Anything).Return(mockPubSubUsedProvider, nil)
288-295: Default nil metric store to NewNoopEventMetricStore in BuildProvidersAndDataSourcesBuildProvidersAndDataSources currently accepts a nil EventMetricStore and passes it through, violating the “never nil” invariant. Add a guard at the top of the function so that any nil
storeis replaced withmetric.NewNoopEventMetricStore():File: router/pkg/pubsub/pubsub.go
Location: start offunc BuildProvidersAndDataSources(around line 55)Suggested diff:
func BuildProvidersAndDataSources( ctx context.Context, config config.EventsConfiguration, - store metric.EventMetricStore, + store metric.EventMetricStore, logger *zap.Logger, dsConfs []DataSourceConfigurationWithMetadata, hostName string, routerListenAddr string, ) ([]pubsub_datasource.Provider, []plan.DataSource, error) { + // Ensure metric store is never nil + if store == nil { + store = metric.NewNoopEventMetricStore() + } var pubSubProviders []pubsub_datasource.Provider var outs []plan.DataSource …This change will uphold the null-object pattern and satisfy tests that pass
nilfor the metric store.router-tests/testenv/testenv.go (1)
1632-1644: Fix port-collision flake in CI: avoid pre-selecting ports for httptest serversThe pipeline shows: “could not listen on port ... bind: address already in use” at Line 1637. The current pattern obtains a free port and then binds later, which races under parallel tests.
Use the OS to select an ephemeral port directly (":0") to eliminate TOCTOU races.
Apply this refactor to make the listener robust:
func makeSafeHttpTestServer(t testing.TB, handler http.Handler) *httptest.Server { s := httptest.NewUnstartedServer(handler) l, err := net.Listen("tcp", "127.0.0.1:0") // Let OS pick an available port if err != nil { t.Fatalf("could not listen on port: %s", err.Error()) } _ = s.Listener.Close() s.Listener = l s.Start() return s }Consider applying the same pattern to other helpers that pre-pick ports (e.g., makeHttpTestServerWithPort/SetupCDNServer) or add a fallback to ":0" on bind failure.
router/pkg/pubsub/nats/provider_builder.go (1)
77-119: Fix potential nil deref and make logger usage safe.
- Possible nil deref: eventSource.Authentication.UserInfo may be nil before accessing Username/Password.
- Logger may be nil; using it in handlers will panic. Default to zap.NewNop() to be safe.
Apply this diff:
func buildNatsOptions(eventSource config.NatsEventSource, logger *zap.Logger) ([]nats.Option, error) { + if logger == nil { + logger = zap.NewNop() + } opts := []nats.Option{ nats.Name(fmt.Sprintf("cosmo.router.edfs.nats.%s", eventSource.ID)), nats.ReconnectJitter(500*time.Millisecond, 2*time.Second), nats.ClosedHandler(func(conn *nats.Conn) { logger.Info("NATS connection closed", zap.String("provider_id", eventSource.ID), zap.Error(conn.LastError())) }), nats.ConnectHandler(func(nc *nats.Conn) { logger.Info("NATS connection established", zap.String("provider_id", eventSource.ID), zap.String("url", nc.ConnectedUrlRedacted())) }), nats.DisconnectErrHandler(func(nc *nats.Conn, err error) { if err != nil { logger.Error("NATS disconnected; will attempt to reconnect", zap.Error(err), zap.String("provider_id", eventSource.ID)) } else { logger.Info("NATS disconnected", zap.String("provider_id", eventSource.ID)) } }), nats.ErrorHandler(func(conn *nats.Conn, subscription *nats.Subscription, err error) { if errors.Is(err, nats.ErrSlowConsumer) { logger.Warn( "NATS slow consumer detected. Events are being dropped. Please consider increasing the buffer size or reducing the number of messages being sent.", zap.Error(err), zap.String("provider_id", eventSource.ID), ) } else { logger.Error("NATS error", zap.Error(err)) } }), nats.ReconnectHandler(func(conn *nats.Conn) { logger.Info("NATS reconnected", zap.String("provider_id", eventSource.ID), zap.String("url", conn.ConnectedUrlRedacted())) }), } if eventSource.Authentication != nil { if eventSource.Authentication.Token != nil { opts = append(opts, nats.Token(*eventSource.Authentication.Token)) - } else if eventSource.Authentication.UserInfo.Username != nil && eventSource.Authentication.UserInfo.Password != nil { + } else if eventSource.Authentication.UserInfo != nil && + eventSource.Authentication.UserInfo.Username != nil && + eventSource.Authentication.UserInfo.Password != nil { opts = append(opts, nats.UserInfo(*eventSource.Authentication.UserInfo.Username, *eventSource.Authentication.UserInfo.Password)) } }router/pkg/pubsub/kafka/adapter.go (1)
48-99: Close per-subscription Kafka clients to avoid leaks.topicPoller returns on cancellations and errors but never closes the client created for subscriptions. Add a defer to ensure the client is closed when the poller exits.
-func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, providerId string) error { +func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, providerId string) error { + // Ensure client is closed when the poller exits + defer client.Close() for { select {router/pkg/pubsub/nats/adapter.go (2)
115-147: JetStream fetch loop can hot-spin and abort on “no messages”; handle non-fatal cases with backoffFetchNoWait in a tight default loop will spin CPU and spam the server when no messages are available. Also, transient “no messages/timeout” conditions shouldn’t terminate the subscription.
Add non-fatal handling with a short sleep/backoff and only abort on real errors.
Apply this diff:
default: - msgBatch, consumerFetchErr := consumer.FetchNoWait(300) - if consumerFetchErr != nil { - log.Error("error fetching messages", zap.Error(consumerFetchErr)) - return - } + msgBatch, consumerFetchErr := consumer.FetchNoWait(300) + if consumerFetchErr != nil { + // Treat "no messages" / timeouts as non-fatal to avoid tearing down the loop. + // Prevent a hot loop with a small backoff. + if errors.Is(consumerFetchErr, jetstream.ErrNoMessages) || errors.Is(consumerFetchErr, nats.ErrTimeout) { + time.Sleep(50 * time.Millisecond) + continue + } + log.Error("error fetching messages", zap.Error(consumerFetchErr)) + return + } for msg := range msgBatch.Messages() { log.Debug("subscription update", zap.String("message_subject", msg.Subject()), zap.ByteString("data", msg.Data())) - p.eventMetricStore.NatsMessageReceived(p.ctx, event.ProviderID, msg.Subject()) + p.eventMetricStore.NatsMessageReceived(ctx, event.ProviderID, msg.Subject()) updater.Update(msg.Data())
156-165: Unbuffered ChanSubscribe risks slow-consumer backpressure; use a buffered channelUsing an unbuffered channel for ChanSubscribe can stall delivery and trigger slow-consumer issues. Allocate a reasonable buffer.
Apply this diff:
- msgChan := make(chan *nats.Msg) + // Use a buffered channel to avoid slow-consumer backpressure. + msgChan := make(chan *nats.Msg, 1024) subscriptions := make([]*nats.Subscription, len(event.Subjects))
♻️ Duplicate comments (9)
router/pkg/config/config.schema.json (1)
1171-1175: LGTM: Prometheus event_metrics schema flag, but verify env var wiring in code (duplicate from prior review)Schema is correct. As noted previously, ensure Prometheus.EventMetrics in router/pkg/config/config.go does not reuse the OTLP env tag. It should be independently toggleable.
Run this to confirm struct tags:
#!/bin/bash set -euo pipefail echo "Prometheus.EventMetrics env tag:" rg -n $'type\\s+Prometheus\\s+struct\\s*{[\\s\\S]*?EventMetrics\\s+bool\\s+`[^`]*env:([^`]*)`[\\s\\S]*?}' router/pkg/config/config.go -N -U -S -A 5 echo echo "MetricsOTLP.EventMetrics env tag:" rg -n $'type\\s+MetricsOTLP\\s+struct\\s*{[\\s\\S]*?EventMetrics\\s+bool\\s+`[^`]*env:([^`]*)`[\\s\\S]*?}' router/pkg/config/config.go -N -U -S -A 5router/pkg/pubsub/redis/adapter.go (2)
114-115: Recording RedisMessageReceived on subscribe — LGTMInstrumentation is in the hot path and uses the provided context. Per learnings, EventMetricStore is guaranteed non-nil via noop store when disabled.
153-158: Publish success/failure instrumentation — LGTMFailure and success paths are instrumented correctly with provider/channel attributes.
router/pkg/pubsub/kafka/adapter.go (1)
231-244: Previous concern about nil EventMetricStore is addressed by the null-object pattern.Earlier feedback suggested defaulting EventMetricStore when nil. With the repository-wide guarantee that EventMetricStore is never nil (Noop store when disabled), this constructor is fine as-is.
router-tests/events/event_helpers.go (2)
41-44: Bug: prefixedTopics incorrectly constructed (leading empty entries).Initialize with length 0 and capacity len(topics), or assign by index. This avoids empty topic names during delete/create.
- prefixedTopics := make([]string, len(topics)) - for _, topic := range topics { - prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic)) - } + prefixedTopics := make([]string, 0, len(topics)) + for _, topic := range topics { + prefixedTopics = append(prefixedTopics, xEnv.GetPubSubName(topic)) + }
56-77: Close Redis client to avoid leaking connections in tests.Ensure the client/cluster client is closed after Publish.
if !xEnv.RedisWithClusterMode { redisConn = redis.NewClient(&redis.Options{ Addr: parsedURL.Host, }) } else { redisConn = redis.NewClusterClient(&redis.ClusterOptions{ Addrs: []string{parsedURL.Host}, }) } + defer func() { _ = redisConn.Close() }() intCmd := redisConn.Publish(ctx, xEnv.GetPubSubName(topicName), message) require.NoError(t, intCmd.Err())router/pkg/pubsub/pubsub.go (1)
55-55: Default to NoopEventMetricStore when store is nil to harden provider constructionThis prevents nil derefs in adapters if upstream wiring ever passes a nil store. It follows the null-object pattern used across the codebase.
Apply this diff:
func BuildProvidersAndDataSources(ctx context.Context, config config.EventsConfiguration, store metric.EventMetricStore, logger *zap.Logger, dsConfs []DataSourceConfigurationWithMetadata, hostName string, routerListenAddr string) ([]pubsub_datasource.Provider, []plan.DataSource, error) { + if store == nil { + store = metric.NewNoopEventMetricStore() + } var pubSubProviders []pubsub_datasource.Provider var outs []plan.DataSourcerouter/pkg/metric/event_metric_store.go (2)
8-13: Wrong import alias causes confusion/compile issues; use sdkmetric and update signatureAlias sdk/metric as sdkmetric and update NewEventMetricStore signature to use *sdkmetric.MeterProvider.
Apply this diff:
import ( "context" "errors" "fmt" "go.opentelemetry.io/otel/attribute" otelmetric "go.opentelemetry.io/otel/metric" - "go.opentelemetry.io/otel/sdk/metric" + sdkmetric "go.opentelemetry.io/otel/sdk/metric" "go.uber.org/zap" otelattrs "github.com/wundergraph/cosmo/router/pkg/otel" ) @@ -func NewEventMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *metric.MeterProvider, metricsConfig *Config) (*EventMetrics, error) { +func NewEventMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *sdkmetric.MeterProvider, metricsConfig *Config) (*EventMetrics, error) {Also applies to: 60-60
182-192: Add provider.type label to NATS request metrics for label-set parityOther broker ops include wg.event.provider.type; request/request-failure omit it, causing inconsistent labels and downstream aggregation issues.
Apply this diff:
func (e *EventMetrics) NatsRequest(ctx context.Context, providerID string, subject string) { - opts := e.withAttrs(otelattrs.WgEventProviderID.String(providerID), otelattrs.WgNatsSubject.String(subject)) + opts := e.withAttrs( + otelattrs.WgEventProviderType.String("nats"), + otelattrs.WgEventProviderID.String(providerID), + otelattrs.WgNatsSubject.String(subject), + ) e.otlpMetrics.NatsRequest(ctx, opts) e.promMetrics.NatsRequest(ctx, opts) } func (e *EventMetrics) NatsRequestFailure(ctx context.Context, providerID string, subject string) { - opts := e.withAttrs(otelattrs.WgEventProviderID.String(providerID), otelattrs.WgNatsSubject.String(subject)) + opts := e.withAttrs( + otelattrs.WgEventProviderType.String("nats"), + otelattrs.WgEventProviderID.String(providerID), + otelattrs.WgNatsSubject.String(subject), + ) e.otlpMetrics.NatsRequestFailure(ctx, opts) e.promMetrics.NatsRequestFailure(ctx, opts) }
🧹 Nitpick comments (30)
router/pkg/metric/prom_connection_metric_store.go (1)
70-73: Avoid multiple WithAttributes calls; combine attributes once to prevent subtle option handling differencesPassing multiple WithAttributes options relies on API-specific merging behavior. Combine base and dynamic attributes explicitly to make behavior deterministic and future-proof.
- o.ObserveInt64(h.instruments.connectionsActive, activeConnections, - otelmetric.WithAttributes(attributes...), - otelmetric.WithAttributes(attrs...), - ) + combined := make([]attribute.KeyValue, 0, len(attributes)+len(attrs)) + combined = append(combined, attributes...) + combined = append(combined, attrs...) + o.ObserveInt64( + h.instruments.connectionsActive, + activeConnections, + otelmetric.WithAttributes(combined...), + )router/pkg/metric/oltp_connection_metric_store.go (1)
71-74: Merge base and dynamic attributes before ObserveInt64Relying on multiple WithAttributes calls is avoidable and can be brittle. Construct a single attribute slice to pass once.
- o.ObserveInt64(h.instruments.connectionsActive, activeConnections, - otelmetric.WithAttributes(attributes...), - otelmetric.WithAttributes(attrs...), - ) + combined := make([]attribute.KeyValue, 0, len(attributes)+len(attrs)) + combined = append(combined, attributes...) + combined = append(combined, attrs...) + o.ObserveInt64( + h.instruments.connectionsActive, + activeConnections, + otelmetric.WithAttributes(combined...), + )router-tests/events/nats_events_test.go (2)
469-471: Nit: fix test name grammar for claritySmall wording improvement.
- t.Run("subscribe after message don't a boundary", func(t *testing.T) { + t.Run("subscribe after message doesn't add a boundary", func(t *testing.T) {
142-146: Make log-count assertions less brittleExact counts on log snapshots tend to be flaky as logging evolves. Prefer lower bounds.
- require.Len(t, natsLogs, 2) + require.GreaterOrEqual(t, len(natsLogs), 2) - require.Len(t, providerIDFields, 3) + require.GreaterOrEqual(t, len(providerIDFields), 3)Please confirm whether tightening/loosening these assertions affects intended coverage.
router/pkg/metric/event_measurements.go (2)
21-39: Add explicit units to countersCounters should declare units to aid downstream backends. For pure counts, use unit "1".
- eventsPublishMessagesOptions = []otelmetric.Int64CounterOption{ - otelmetric.WithDescription("Number of event messages published"), - } + eventsPublishMessagesOptions = []otelmetric.Int64CounterOption{ + otelmetric.WithDescription("Number of event messages published"), + otelmetric.WithUnit("1"), + } - eventsPublishFailuresOptions = []otelmetric.Int64CounterOption{ - otelmetric.WithDescription("Number of event publish failures"), - } + eventsPublishFailuresOptions = []otelmetric.Int64CounterOption{ + otelmetric.WithDescription("Number of event publish failures"), + otelmetric.WithUnit("1"), + } - eventsMessagesReceivedOptions = []otelmetric.Int64CounterOption{ - otelmetric.WithDescription("Number of event messages received"), - } + eventsMessagesReceivedOptions = []otelmetric.Int64CounterOption{ + otelmetric.WithDescription("Number of event messages received"), + otelmetric.WithUnit("1"), + } - natsRequestsOptions = []otelmetric.Int64CounterOption{ - otelmetric.WithDescription("Number of NATS requests"), - } + natsRequestsOptions = []otelmetric.Int64CounterOption{ + otelmetric.WithDescription("Number of NATS requests"), + otelmetric.WithUnit("1"), + } - natsRequestFailuresOptions = []otelmetric.Int64CounterOption{ - otelmetric.WithDescription("Number of NATS request failures"), - } + natsRequestFailuresOptions = []otelmetric.Int64CounterOption{ + otelmetric.WithDescription("Number of NATS request failures"), + otelmetric.WithUnit("1"), + }
9-19: Consider naming consistency for NATS metricsMinor: for uniformity with event metrics naming, consider a "router.events.nats.request" prefix. This is optional and only if it doesn't break existing dashboards.
Confirm whether these names are already referenced downstream (dashboards/alerts). If so, skip renaming to avoid breakage.
router-tests/prometheus_event_metrics_test.go (7)
51-68: Guard against nil metric family before dereferencingIf
findMetricFamilyByNamereturns nil,family.GetMetric()will panic. Add a nil check to make failures explicit and less flaky.Apply this diff:
- mf, err := promRegistry.Gather() + mf, err := promRegistry.Gather() require.NoError(t, err) - family := findMetricFamilyByName(mf, "router_events_publish_messages_total") + family := findMetricFamilyByName(mf, "router_events_publish_messages_total") + require.NotNil(t, family, "router_events_publish_messages_total not found in Prometheus registry") metrics := family.GetMetric() require.Len(t, metrics, 1)
117-134: Kafka subscribe: add nil check for metric familySame issue here when accessing
family.GetMetric().Apply this diff:
- mf, err := promRegistry.Gather() + mf, err := promRegistry.Gather() require.NoError(t, err) - family := findMetricFamilyByName(mf, "router_events_messages_received_total") + family := findMetricFamilyByName(mf, "router_events_messages_received_total") + require.NotNil(t, family, "router_events_messages_received_total not found in Prometheus registry") metrics := family.GetMetric() require.Len(t, metrics, 1)
166-183: NATS publish: add nil check for metric familyPrevent a potential panic if the family is not present yet.
Apply this diff:
- mf, err := promRegistry.Gather() + mf, err := promRegistry.Gather() require.NoError(t, err) - family := findMetricFamilyByName(mf, "router_events_publish_messages_total") + family := findMetricFamilyByName(mf, "router_events_publish_messages_total") + require.NotNil(t, family, "router_events_publish_messages_total not found in Prometheus registry") metrics := family.GetMetric() require.Len(t, metrics, 1)
208-222: NATS request: add nil check for metric familySame pattern as above.
Apply this diff:
- mf, err := promRegistry.Gather() + mf, err := promRegistry.Gather() require.NoError(t, err) - family := findMetricFamilyByName(mf, "router_nats_request_total") + family := findMetricFamilyByName(mf, "router_nats_request_total") + require.NotNil(t, family, "router_nats_request_total not found in Prometheus registry") metrics := family.GetMetric() require.Len(t, metrics, 1)
285-301: NATS subscribe: assert non-empty metrics before indexingThis block indexes
metrics[0]without asserting there is at least one metric. Add a check to avoid panics on slow collectors.Apply this diff:
- family := findMetricFamilyByName(mf, "router_events_messages_received_total") - metrics := family.GetMetric() + family := findMetricFamilyByName(mf, "router_events_messages_received_total") + require.NotNil(t, family, "router_events_messages_received_total not found in Prometheus registry") + metrics := family.GetMetric() + require.NotEmpty(t, metrics, "no metrics found for router_events_messages_received_total")
335-352: Redis publish: add nil check for metric familyPrevent a potential panic when the family is missing.
Apply this diff:
- mf, err := promRegistry.Gather() + mf, err := promRegistry.Gather() require.NoError(t, err) - family := findMetricFamilyByName(mf, "router_events_publish_messages_total") + family := findMetricFamilyByName(mf, "router_events_publish_messages_total") + require.NotNil(t, family, "router_events_publish_messages_total not found in Prometheus registry") metrics := family.GetMetric() require.Len(t, metrics, 1)
400-416: Redis subscribe: add nil check for metric familySame defensive pattern.
Apply this diff:
- mf, err := promRegistry.Gather() + mf, err := promRegistry.Gather() require.NoError(t, err) - family := findMetricFamilyByName(mf, "router_events_messages_received_total") + family := findMetricFamilyByName(mf, "router_events_messages_received_total") + require.NotNil(t, family, "router_events_messages_received_total not found in Prometheus registry") metrics := family.GetMetric() require.Len(t, metrics, 1)router/pkg/pubsub/datasource/provider.go (1)
36-38: Document ProviderOpts and null-object expectationAdd a brief doc comment stating that EventMetricStore is always non-nil (noop when disabled). This encodes the invariant and removes the need for nil checks downstream.
Apply this diff:
-type ProviderOpts struct { - EventMetricStore metric.EventMetricStore -} +// ProviderOpts carries cross-cutting configuration for providers/adapters. +// EventMetricStore is never nil; when metrics are disabled, a no-op implementation is provided. +type ProviderOpts struct { + EventMetricStore metric.EventMetricStore +}demo/pkg/subgraphs/subgraphs.go (2)
215-217: DRY: reuse a single ProviderOpts instanceYou duplicate ProviderOpts construction for both adapters. Minor cleanup: create once and reuse.
Apply this diff:
- defaultAdapter, err := natsPubsub.NewAdapter(ctx, zap.NewNop(), url, []nats.Option{}, "hostname", "test", datasource.ProviderOpts{ - EventMetricStore: rmetric.NewNoopEventMetricStore(), - }) + po := datasource.ProviderOpts{EventMetricStore: rmetric.NewNoopEventMetricStore()} + defaultAdapter, err := natsPubsub.NewAdapter(ctx, zap.NewNop(), url, []nats.Option{}, "hostname", "test", po)
223-225: DRY: reuse ProviderOpts for my-nats adapter as wellFollow-up to the previous comment.
Apply this diff:
- myNatsAdapter, err := natsPubsub.NewAdapter(ctx, zap.NewNop(), url, []nats.Option{}, "hostname", "test", datasource.ProviderOpts{ - EventMetricStore: rmetric.NewNoopEventMetricStore(), - }) + myNatsAdapter, err := natsPubsub.NewAdapter(ctx, zap.NewNop(), url, []nats.Option{}, "hostname", "test", po)router/pkg/metric/config.go (2)
36-39: Clarify semantics of EventMetrics (Prometheus)Add a doc comment so readers know which metrics are controlled and where they surface.
Apply this diff:
type PrometheusConfig struct { @@ // Prometheus schema field usage configuration PromSchemaFieldUsage PrometheusSchemaFieldUsage - EventMetrics bool + // EventMetrics enables Prometheus event metrics (Kafka, Redis, NATS). + EventMetrics bool }
81-84: Clarify semantics of EventMetrics (OTLP)Mirror the documentation for OTLP config as well.
Apply this diff:
type OpenTelemetry struct { @@ // TestReader is used for testing purposes. If set, the reader will be used instead of the configured exporters. - TestReader sdkmetric.Reader - EventMetrics bool + TestReader sdkmetric.Reader + // EventMetrics enables OTLP event metrics (Kafka, Redis, NATS). + EventMetrics bool }router/core/graph_server.go (1)
759-763: Remove unnecessary nil-check on EventMetricStore in ShutdownPer our repo conventions (null-object pattern), EventMetricStore is always non-nil (noop when disabled), so this nil check is redundant and slightly obscures the invariant.
Using the learning from prior work in this codebase: EventMetricStore fields are never nil—
NewNoopEventMetricStore()is used when metrics are disabled.Apply this small cleanup:
- if s.eventMetricStore != nil { - if aErr := s.eventMetricStore.Shutdown(ctx); aErr != nil { - err = errors.Join(err, aErr) - } - } + if aErr := s.eventMetricStore.Shutdown(ctx); aErr != nil { + err = errors.Join(err, aErr) + }router/pkg/pubsub/kafka/provider_builder_test.go (2)
37-39: Avoid brittle assertion on options count in TLS testHard-coding the exact length makes this test fragile as builder options can evolve. Prefer asserting “more than baseline” like your SASL tests.
- // Can't directly check for TLS options, but we can verify more options are present - require.Equal(t, len(opts), 5) + // Can't directly check for TLS options; verify more options are present + require.Greater(t, len(opts), 1)
97-107: Assert provider identity/type to strengthen the testYou already assert adapter/logger presence. Adding ID and TypeID checks strengthens the contract of the builder.
kafkaProvider, ok := provider.(*datasource.PubSubProvider) require.True(t, ok) assert.NotNil(t, kafkaProvider.Logger) assert.NotNil(t, kafkaProvider.Adapter) + assert.Equal(t, providerId, provider.ID()) + assert.Equal(t, "kafka", provider.TypeID())router-tests/testenv/testenv.go (1)
2825-2827: Use provider-specific NATS params when constructing adaptersYou’re iterating over DemoNatsProviders but always use natsData.Params[0]. If future providers use different URLs/options, both adapters will point to the same connection parameters.
A minimal refactor is to index Params in step with the providers (or map by provider ID) to avoid accidental cross-wiring. For example:
// Outside this range, adjust the loop to capture index: for i, sourceName := range DemoNatsProviders { params := natsData.Params[i] adapter, err := pubsubNats.NewAdapter(ctx, logger, params.Url, params.Opts, "hostname", "listenaddr", datasource.ProviderOpts{ EventMetricStore: rmetric.NewNoopEventMetricStore(), // per learnings: never nil }) // ... }If Params ordering differs from provider IDs, prefer a lookup keyed by provider ID.
router-tests/telemetry/event_metrics_test.go (1)
60-74: Make metric DP assertions resilient: filter by attributes instead of expecting a single datapointAsserting require.Len(sum.DataPoints, 1) is brittle if additional datapoints with different attributes appear (e.g., multiple subjects/topics in future changes). Focus on the datapoint matching provider attributes and backend-topic/subject/channel.
Example utility to locate the right datapoint:
func findSumDP(d metricdata.Sum[int64], match func(attribute.Set) bool) *metricdata.DataPoint[int64] { for i := range d.DataPoints { if match(d.DataPoints[i].Attributes) { return &d.DataPoints[i] } } return nil } // Usage: sum, _ := metricEntry.Data.(metricdata.Sum[int64]) dp := findSumDP(sum, func(attrs attribute.Set) bool { id, _ := attrs.Value(otelattrs.WgEventProviderID) tp, _ := attrs.Value(otelattrs.WgEventProviderType) topic, _ := attrs.Value(otelattrs.WgKafkaTopic) // or WgNatsSubject/WgRedisChannel return id.AsString() == "my-kafka" && tp.AsString() == "kafka" && strings.HasSuffix(topic.AsString(), "employeeUpdated") }) require.NotNil(t, dp) require.Equal(t, int64(2), dp.Value)Apply the same pattern across similar assertions in this file to reduce flakiness as coverage grows.
router/pkg/pubsub/nats/provider_builder_test.go (2)
45-46: Avoid brittle assertions on the count of NATS optionsThe “> 7” check is tied to internal implementation details and may break with harmless changes. The NotEmpty assertion is sufficient for this unit test’s purpose.
Apply this diff:
- // Can't directly check for token options, but we can verify options are present - require.Greater(t, len(opts), 7) // Basic options (7) + token option + // Options are present; detailed introspection is not stable across versions. + require.NotEmpty(t, opts)
67-68: Same here: remove “> 7” assertion for user/password testKeep tests robust against harmless changes in option composition.
Apply this diff:
- // Can't directly check for auth options, but we can verify options are present - require.Greater(t, len(opts), 7) // Basic options (7) + user info option + // Options are present; detailed introspection is not stable across versions. + require.NotEmpty(t, opts)router/pkg/pubsub/redis/adapter.go (1)
28-37: Annotate logger with a “pubsub=redis” field for consistent logsKafka’s adapter tags the logger; doing the same here improves log filtering. Respecting the null-object pattern for metrics, no nil checks are needed.
Apply this diff:
func NewProviderAdapter(ctx context.Context, logger *zap.Logger, urls []string, clusterEnabled bool, opts datasource.ProviderOpts) Adapter { ctx, cancel := context.WithCancel(ctx) - return &ProviderAdapter{ - ctx: ctx, - cancel: cancel, - logger: logger, + if logger == nil { + logger = zap.NewNop() + } + return &ProviderAdapter{ + ctx: ctx, + cancel: cancel, + logger: logger.With(zap.String("pubsub", "redis")), urls: urls, clusterEnabled: clusterEnabled, eventMetricStore: opts.EventMetricStore, } }router/pkg/pubsub/nats/adapter.go (3)
136-141: Use the operation context for metric emission instead of adapter contextConsistency: elsewhere (publish/request) you pass the method’s ctx to the store. Use the subscription ctx here as well.
Apply this diff:
- p.eventMetricStore.NatsMessageReceived(p.ctx, event.ProviderID, msg.Subject()) + p.eventMetricStore.NatsMessageReceived(ctx, event.ProviderID, msg.Subject())And:
- p.eventMetricStore.NatsMessageReceived(p.ctx, event.ProviderID, msg.Subject) + p.eventMetricStore.NatsMessageReceived(ctx, event.ProviderID, msg.Subject)Also applies to: 172-177
205-228: Optional: ensure publish is flushed within ctx to detect broker errors earlierNATS Publish is fire-and-forget. If you need delivery guarantees for metrics/tests, flush after publish using the provided ctx. Consider this behind a feature flag to avoid latency impact.
Apply this diff:
err := p.client.Publish(event.Subject, event.Data) if err != nil { log.Error("publish error", zap.Error(err)) p.eventMetricStore.NatsPublishFailure(ctx, event.ProviderID, event.Subject) return datasource.NewError(fmt.Sprintf("error publishing to NATS subject %s", event.Subject), err) } else { p.eventMetricStore.NatsPublish(ctx, event.ProviderID, event.Subject) + // Optionally flush to ensure the server processed the publish. + if fErr := p.flush(ctx); fErr != nil { + log.Warn("flush after publish failed", zap.Error(fErr)) + } }
318-341: Defensive default for EventMetricStore to prevent nil deref across adaptersWhile the codebase aims to always pass a non-nil store (null-object pattern), a guard here prevents panics if callers regress.
Apply this diff:
func NewAdapter( ctx context.Context, logger *zap.Logger, url string, opts []nats.Option, hostName string, routerListenAddr string, providerOpts datasource.ProviderOpts, ) (Adapter, error) { if logger == nil { logger = zap.NewNop() } + ems := providerOpts.EventMetricStore + if ems == nil { + ems = metric.NewNoopEventMetricStore() + } + return &ProviderAdapter{ ctx: ctx, logger: logger.With(zap.String("pubsub", "nats")), closeWg: sync.WaitGroup{}, hostName: hostName, routerListenAddr: routerListenAddr, url: url, opts: opts, flushTimeout: 10 * time.Second, - eventMetricStore: providerOpts.EventMetricStore, + eventMetricStore: ems, }, nil }Note: This aligns with the repository-wide null-object pattern for metrics (per learnings).
router/pkg/metric/noop_event_metrics.go (1)
1-51: Add compile-time interface assertions for safetyAssert implementations at compile time to catch signature drift early.
Apply this diff:
package metric import ( "context" otelmetric "go.opentelemetry.io/otel/metric" ) // A noop metric provider so we do not need to do nil checks for each provider call from the store type noopEventMetricProvider struct{} func (n *noopEventMetricProvider) Publish(ctx context.Context, opts ...otelmetric.AddOption) {} func (n *noopEventMetricProvider) PublishFailure(ctx context.Context, opts ...otelmetric.AddOption) {} func (n *noopEventMetricProvider) MessagesReceived(ctx context.Context, opts ...otelmetric.AddOption) { } func (n *noopEventMetricProvider) NatsRequest(ctx context.Context, opts ...otelmetric.AddOption) {} func (n *noopEventMetricProvider) NatsRequestFailure(ctx context.Context, opts ...otelmetric.AddOption) { } func (n *noopEventMetricProvider) Flush(ctx context.Context) error { return nil } func (n *noopEventMetricProvider) Shutdown() error { return nil } type NoopEventMetricStore struct{} func (n *NoopEventMetricStore) KafkaPublish(ctx context.Context, providerID string, topic string) {} func (n *NoopEventMetricStore) KafkaPublishFailure(ctx context.Context, providerID string, topic string) { } func (n *NoopEventMetricStore) KafkaMessageReceived(ctx context.Context, providerID string, topic string) { } func (n *NoopEventMetricStore) RedisPublish(ctx context.Context, providerID string, channel string) {} func (n *NoopEventMetricStore) RedisPublishFailure(ctx context.Context, providerID string, channel string) { } func (n *NoopEventMetricStore) RedisMessageReceived(ctx context.Context, providerID string, channel string) { } func (n *NoopEventMetricStore) NatsPublish(ctx context.Context, providerID string, subject string) {} func (n *NoopEventMetricStore) NatsPublishFailure(ctx context.Context, providerID string, subject string) { } func (n *NoopEventMetricStore) NatsMessageReceived(ctx context.Context, providerID string, subject string) { } func (n *NoopEventMetricStore) NatsRequest(ctx context.Context, providerID string, subject string) {} func (n *NoopEventMetricStore) NatsRequestFailure(ctx context.Context, providerID string, subject string) { } func (n *NoopEventMetricStore) Flush(ctx context.Context) error { return nil } func (n *NoopEventMetricStore) Shutdown(ctx context.Context) error { return nil } func NewNoopEventMetricStore() *NoopEventMetricStore { return &NoopEventMetricStore{} } + +// Compile-time assertions +var _ EventMetricStore = (*NoopEventMetricStore)(nil) +var _ EventMetricProvider = (*noopEventMetricProvider)(nil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
router-tests/prometheus_event_metrics_test.go (4)
26-26: Rename the test for clarity (it’s not intended to be flaky).Consider renaming to better reflect intent and avoid confusion during triage.
Apply:
-func TestFlakyEventMetrics(t *testing.T) { +func TestPrometheusEventMetrics(t *testing.T) {
51-69: Stabilize metric assertions with eventual/polling checksPrometheus collection via OTel can be slightly eventual; asserting immediately after producing/consuming can be flaky under load. Consider polling Gather until the family appears and the expected count matches or a timeout is reached.
If you start seeing CI flakes, add a tiny helper and replace direct Gather calls:
// Add once in this file: func awaitCounterValue(t *testing.T, reg *prometheus.Registry, metricName string, expected float64, timeout time.Duration) []*io_prometheus_client.Metric { t.Helper() deadline := time.Now().Add(timeout) for { mf, err := reg.Gather() require.NoError(t, err) family := findMetricFamilyByName(mf, metricName) if family != nil { metrics := family.GetMetric() if len(metrics) > 0 && metrics[0].GetCounter() != nil && metrics[0].GetCounter().GetValue() == expected { return metrics } } if time.Now().After(deadline) { require.Failf(t, "metric not observed", "metric %s did not reach value %v before timeout", metricName, expected) } time.Sleep(50 * time.Millisecond) } }Then, e.g.:
- mf, err := promRegistry.Gather() - require.NoError(t, err) - family := findMetricFamilyByName(mf, "messaging_client_sent_messages_total") - require.NotNil(t, family) - metrics := family.GetMetric() + metrics := awaitCounterValue(t, promRegistry, "messaging_client_sent_messages_total", 2, WaitTimeout)Note: you’ll need to import io_prometheus_client: "github.com/prometheus/client_model/go" as io_prometheus_client.
Also applies to: 118-136, 168-185, 210-228, 284-301, 335-353, 401-418
27-27: Nested t.Parallel() may over-stress shared brokers in CIRunning all backends and their subtests in parallel can increase flakiness on constrained CI hosts (Kafka/NATS/Redis are shared Docker services). Consider reducing parallelism at one level or gating with a single top-level Parallel.
Would you like me to propose a table-driven structure with selective parallelism to keep broker load bounded?
Also applies to: 29-31, 72-74, 147-149, 188-190, 231-233, 317-319, 356-358
241-242: 1s WebSocketClientReadTimeout might be too aggressiveOn slow CI, a 1s read timeout can prematurely trigger disconnects. Consider 2–5s unless there’s a strong reason to keep it at 1s.
- ModifyEngineExecutionConfiguration: func(ec *config.EngineExecutionConfiguration) { ec.WebSocketClientReadTimeout = time.Second }, + ModifyEngineExecutionConfiguration: func(ec *config.EngineExecutionConfiguration) { ec.WebSocketClientReadTimeout = 3 * time.Second },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
router-tests/prometheus_event_metrics_test.go(1 hunks)router-tests/telemetry/event_metrics_test.go(1 hunks)router/pkg/pubsub/nats/adapter.go(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- router/pkg/pubsub/nats/adapter.go
- router-tests/telemetry/event_metrics_test.go
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
🧬 Code Graph Analysis (1)
router-tests/prometheus_event_metrics_test.go (5)
router-tests/testenv/testenv.go (8)
Run(105-122)Config(284-340)ConfigWithEdfsKafkaJSONTemplate(85-85)MetricOptions(263-277)Environment(1727-1763)GraphQLRequest(1895-1903)ConfigWithEdfsNatsJSONTemplate(87-87)ConfigWithEdfsRedisJSONTemplate(89-89)demo/pkg/subgraphs/subgraphs.go (1)
Config(79-84)router-tests/events/event_helpers.go (2)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)router/pkg/config/config.go (1)
EngineExecutionConfiguration(370-398)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build_test
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: build_test
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan (nonroot)
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
router-tests/prometheus_event_metrics_test.go (1)
29-69: Strong end-to-end coverage and label assertions across Kafka, NATS, and RedisGood integration coverage validating both producer and consumer counters, per-backend operation names, and destination/system labels. Per-test Prometheus registry and manual reader scoping reduces cross-test interference. Subscription lifecycle is handled cleanly with WaitForSubscriptionCount/ConnectionCount and client.Close waits.
Also applies to: 147-185, 188-229, 231-311, 317-354, 356-425
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (1)
router/pkg/pubsub/kafka/adapter.go (1)
145-161: Close the per-subscription consumer client when the poller exitsEach Subscribe creates a client; it’s never closed. Ensure client.Close() is called when the goroutine finishes to avoid leaking resources.
Apply:
go func() { defer p.closeWg.Done() + defer client.Close() err := p.topicPoller(ctx, client, updater, PollerOpts{providerId: event.ProviderID})
🧹 Nitpick comments (17)
router/pkg/otel/attributes.go (1)
63-70: Align WG Messaging Attributes with OpenTelemetry SemConvPer the OpenTelemetry semantic conventions, standard messaging attributes use keys like
messaging.system,messaging.operation.name,messaging.operation.type,messaging.destination.name, etc., and errors are represented viaerror.type,error.message, plus exception event attributes (exception.type,exception.message,exception.stacktrace) on spans.In router/pkg/otel/attributes.go (lines 63–70), we currently define only vendor-prefixed keys:
const ( MessagingOperationName = attribute.Key("wg.messaging.operation.name") MessagingSystem = attribute.Key("wg.messaging.system") MessagingError = attribute.Key("wg.messaging.error") MessagingDestinationName = attribute.Key("wg.messaging.destination.name") ProviderId = attribute.Key("wg.provider.id") )Recommendations:
- Emit the standard semconv keys alongside the
wg.*keys. For example:
•attribute.Key("messaging.system")
•attribute.Key("messaging.operation.name")
•attribute.Key("messaging.operation.type")
•attribute.Key("messaging.destination.name")
• …and so on for other relevant attributes.- For errors, prefer setting
error.type(low-cardinality) anderror.message, and record exceptions via the exception semantic event attributes rather than a customwg.messaging.errorkey.- Optionally, introduce a bridging/mapping layer so that downstream consumers querying only standard semconv keys can work without awareness of the
wg.*prefixes.router/pkg/metric/messaging_event_measurements.go (2)
14-21: Tighten counter descriptions for clarityDescriptions are a bit awkward and include “event” redundantly. Make them clear and user-facing.
- messagingSentMessagesOptions = []otelmetric.Int64CounterOption{ - otelmetric.WithDescription("Number of messaging event sent messages"), - } + messagingSentMessagesOptions = []otelmetric.Int64CounterOption{ + otelmetric.WithDescription("Number of messages sent by the client"), + } - messagingConsumedMessagesOptions = []otelmetric.Int64CounterOption{ - otelmetric.WithDescription("Number of messaging event consumed messages"), - } + messagingConsumedMessagesOptions = []otelmetric.Int64CounterOption{ + otelmetric.WithDescription("Number of messages consumed by the client"), + }
37-43: Keep error messages consistent with chosen terminologyIf we standardize on “consumed” instead of “received,” update the error string.
- if err != nil { - return nil, fmt.Errorf("failed to create received messages counter: %w", err) - } + if err != nil { + return nil, fmt.Errorf("failed to create consumed messages counter: %w", err) + }router/pkg/metric/oltp_messaging_event_metric_store.go (2)
16-21: Remove unused logger field or put it to uselogger is stored but never used. Either remove it to reduce noise or add debug logs when creating instruments/Flushing.
type otlpMessagingEventMetrics struct { instruments *eventInstruments meterProvider *metric.MeterProvider - logger *zap.Logger meter otelmetric.Meter } func newOtlpMessagingEventMetrics(logger *zap.Logger, meterProvider *metric.MeterProvider) (*otlpMessagingEventMetrics, error) { @@ return &otlpMessagingEventMetrics{ instruments: instruments, meterProvider: meterProvider, - logger: logger, meter: meter, }, nil }If you plan to add logs shortly, keeping it is fine; otherwise consider dropping it.
50-52: Clarify Flush intent (tests vs prod)ForceFlush can be expensive with some SDKs/exporters. If intended primarily for tests/shutdown paths, add a brief comment to discourage hot-path use.
router/core/graph_server.go (1)
763-768: Redundant nil check (optional)Given we initialize with NewNoopEventMetricStore(), this cannot be nil. Safe to leave as-is; optional to remove for brevity.
router-tests/telemetry/messaging_event_metrics_test.go (2)
29-31: Rename test to remove “Flaky” for clarity.The name suggests instability; consider a neutral name like TestMessagingEventMetrics to avoid confusion in CI dashboards.
Apply this change:
-func TestFlakyEventMetrics(t *testing.T) { +func TestMessagingEventMetrics(t *testing.T) {
224-275: NATS request path coverage is valuable; consider adding a consumer-side check (optional).You assert messaging.event.sent.messages for request. If feasible, also add a received-messages assertion on the reply side to cover both directions. Optional, not blocking.
router-tests/prometheus_messaging_event_metrics_test.go (2)
26-28: Rename test to remove “Flaky” for clarity.Same suggestion as the OTLP test suite: neutral naming reads better in CI and issue triage.
Apply this change:
-func TestFlakyEventMetrics(t *testing.T) { +func TestPrometheusMessagingEventMetrics(t *testing.T) {
206-253: NATS request: complete Prometheus coverage for send path.Optional: you could assert that only a single timeseries exists (require.Len == 1 already does) and consider adding a receive-side metric assertion if surfaced.
router/pkg/pubsub/redis/adapter.go (1)
114-118: Don’t terminate the subscription loop on a nil messageCurrent behavior logs “skipping” but returns, which stops the goroutine. Prefer continuing the loop to avoid tearing down the subscription due to a single nil message.
Apply:
- if msg == nil { - log.Debug("empty message received on subscription update, skipping") - return - } + if msg == nil { + log.Debug("empty message received on subscription update, skipping") + continue + }router/pkg/pubsub/kafka/adapter.go (2)
24-27: Optional: standardize producer operation name to “publish” for consistencyKafka uses “produce” while Redis/NATS use “publish”. If you want uniform Prom series and dashboards, consider renaming to “publish”. Not a blocker if you deliberately keep broker-native verbs.
Apply:
-const ( - kafkaReceive = "receive" - kafkaProduce = "produce" -) +const ( + kafkaReceive = "receive" + kafkaPublish = "publish" +)And update usage:
- OperationName: kafkaProduce, + OperationName: kafkaPublish,
102-108: Use the subscription context for metric emissionUsing ctx instead of p.ctx ties the metric to the subscription lifecycle and supports finer-grained cancellation/attr propagation (mirrors the select paths).
Apply:
- p.messagingEventMetricStore.Consume(p.ctx, metric.MessagingEvent{ + p.messagingEventMetricStore.Consume(ctx, metric.MessagingEvent{router/pkg/metric/messaging_event_metric_store.go (2)
24-31: Go naming: rename ProviderId to ProviderIDGo prefers initialisms to be capitalized (ID vs Id). Consider renaming the field; the attribute key can remain wg.provider.id.
Apply:
type MessagingEvent struct { - ProviderId string // The id of the provider defined in the configuration + ProviderID string // The id of the provider defined in the configuration OperationName string // The operation name that is specific to the messaging system MessagingSystem ProviderType // The messaging system type that are supported Error bool // Indicates if the operation resulted in an error or not (true or false) DestinationName string // The name of the destination queue / topic / channel }Follow-ups: update call sites and attribute construction accordingly.
Would you like me to prepare a repo-wide codemod patch to rename the field and update references?
88-105: DRY: unify attribute construction for Produce/ConsumeProduce and Consume duplicate attribute building. Extract into a helper to reduce drift and simplify future additions.
Apply within these methods:
func (e *MessagingEventMetrics) Produce(ctx context.Context, event MessagingEvent) { - attrs := []attribute.KeyValue{ - otel.MessagingOperationName.String(event.OperationName), - otel.MessagingSystem.String(string(event.MessagingSystem)), - otel.MessagingError.Bool(event.Error), - } - if event.ProviderId != "" { - attrs = append(attrs, otel.ProviderId.String(event.ProviderId)) - } - if event.DestinationName != "" { - attrs = append(attrs, otel.MessagingDestinationName.String(event.DestinationName)) - } - opt := e.withAttrs(attrs...) + opt := e.withAttrs(e.eventAttrs(event)...) for _, provider := range e.providers { provider.Produce(ctx, opt) } } func (e *MessagingEventMetrics) Consume(ctx context.Context, event MessagingEvent) { - attrs := []attribute.KeyValue{ - otel.MessagingOperationName.String(event.OperationName), - otel.MessagingSystem.String(string(event.MessagingSystem)), - otel.MessagingError.Bool(event.Error), - } - if event.ProviderId != "" { - attrs = append(attrs, otel.ProviderId.String(event.ProviderId)) - } - if event.DestinationName != "" { - attrs = append(attrs, otel.MessagingDestinationName.String(event.DestinationName)) - } - opt := e.withAttrs(attrs...) + opt := e.withAttrs(e.eventAttrs(event)...) for _, provider := range e.providers { provider.Consume(ctx, opt) } }Add this helper (outside of the selected lines):
// eventAttrs builds the per-event attributes. func (e *MessagingEventMetrics) eventAttrs(event MessagingEvent) []attribute.KeyValue { attrs := []attribute.KeyValue{ otel.MessagingOperationName.String(event.OperationName), otel.MessagingSystem.String(string(event.MessagingSystem)), otel.MessagingError.Bool(event.Error), } // Keep optional labels sparse to reduce cardinality. if event.ProviderId != "" { attrs = append(attrs, otel.ProviderId.String(event.ProviderId)) } if event.DestinationName != "" { attrs = append(attrs, otel.MessagingDestinationName.String(event.DestinationName)) } return attrs }Also applies to: 107-125
router/pkg/pubsub/nats/adapter.go (2)
144-149: Use the subscription context for metric emission (durable path)Prefer ctx over p.ctx so the emission aligns with the subscription scope and cancellation.
Apply:
- p.messagingEventMetricStore.Consume(p.ctx, metric.MessagingEvent{ + p.messagingEventMetricStore.Consume(ctx, metric.MessagingEvent{
187-192: Use the subscription context for metric emission (chan subscribe path)Same rationale as the durable path.
Apply:
- p.messagingEventMetricStore.Consume(p.ctx, metric.MessagingEvent{ + p.messagingEventMetricStore.Consume(ctx, metric.MessagingEvent{
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (23)
demo/pkg/subgraphs/subgraphs.go(2 hunks)router-tests/prometheus_messaging_event_metrics_test.go(1 hunks)router-tests/telemetry/messaging_event_metrics_test.go(1 hunks)router-tests/testenv/testenv.go(5 hunks)router/core/factoryresolver.go(3 hunks)router/core/graph_server.go(5 hunks)router/core/router.go(2 hunks)router/pkg/config/config.go(2 hunks)router/pkg/config/config.schema.json(2 hunks)router/pkg/config/testdata/config_defaults.json(2 hunks)router/pkg/config/testdata/config_full.json(2 hunks)router/pkg/metric/config.go(2 hunks)router/pkg/metric/messaging_event_measurements.go(1 hunks)router/pkg/metric/messaging_event_metric_store.go(1 hunks)router/pkg/metric/noop_event_metrics.go(1 hunks)router/pkg/metric/oltp_messaging_event_metric_store.go(1 hunks)router/pkg/metric/prom_messaging_event_metric_store.go(1 hunks)router/pkg/otel/attributes.go(2 hunks)router/pkg/pubsub/datasource/provider.go(2 hunks)router/pkg/pubsub/kafka/adapter.go(7 hunks)router/pkg/pubsub/nats/adapter.go(9 hunks)router/pkg/pubsub/pubsub.go(7 hunks)router/pkg/pubsub/redis/adapter.go(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- router/pkg/config/testdata/config_full.json
- demo/pkg/subgraphs/subgraphs.go
- router/pkg/pubsub/datasource/provider.go
- router/pkg/metric/config.go
- router/pkg/metric/noop_event_metrics.go
- router/pkg/config/config.go
- router/pkg/config/testdata/config_defaults.json
- router/core/router.go
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/pkg/metric/oltp_messaging_event_metric_store.gorouter/pkg/metric/prom_messaging_event_metric_store.gorouter-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:46:00.905Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/pkg/metric/oltp_messaging_event_metric_store.gorouter/pkg/metric/prom_messaging_event_metric_store.gorouter/core/graph_server.gorouter/pkg/metric/messaging_event_metric_store.gorouter-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:45:57.468Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/pkg/metric/oltp_messaging_event_metric_store.gorouter/pkg/metric/prom_messaging_event_metric_store.gorouter/core/graph_server.gorouter/pkg/metric/messaging_event_metric_store.gorouter-tests/testenv/testenv.go
📚 Learning: 2025-08-14T16:47:03.713Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.713Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/pkg/metric/oltp_messaging_event_metric_store.gorouter/pkg/pubsub/kafka/adapter.go
🧬 Code Graph Analysis (10)
router-tests/telemetry/messaging_event_metrics_test.go (5)
router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-81)router-tests/utils.go (2)
GetMetricScopeByName(88-95)GetMetricByName(79-86)demo/cmd/subscriptiontest/main.go (1)
Data(32-34)router/pkg/otel/attributes.go (5)
MessagingOperationName(65-65)MessagingSystem(66-66)MessagingDestinationName(68-68)ProviderId(69-69)MessagingError(67-67)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)
router-tests/prometheus_messaging_event_metrics_test.go (4)
router-tests/testenv/testenv.go (3)
Run(105-122)MetricOptions(263-277)Environment(1727-1763)router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-81)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)router/pkg/config/config.go (1)
EngineExecutionConfiguration(370-398)
router/core/graph_server.go (5)
router/pkg/metric/messaging_event_metric_store.go (3)
MessagingEventMetricStore(41-47)MessagingEventMetrics(50-54)NewMessagingEventMetricStore(56-81)router/pkg/metric/noop_metrics.go (1)
NewNoopMetrics(63-65)router/pkg/metric/noop_event_metrics.go (1)
NewNoopEventMetricStore(15-15)router/pkg/metric/config.go (1)
OpenTelemetry(69-84)router/pkg/config/config.go (1)
Prometheus(97-111)
router/core/factoryresolver.go (2)
router/pkg/config/config.go (4)
EngineExecutionConfiguration(370-398)HeaderRules(250-255)EventsConfiguration(627-629)SubgraphErrorPropagationConfiguration(746-758)router/pkg/metric/messaging_event_metric_store.go (1)
MessagingEventMetricStore(41-47)
router/pkg/metric/messaging_event_metric_store.go (3)
router/pkg/otel/attributes.go (5)
ProviderId(69-69)MessagingSystem(66-66)MessagingOperationName(65-65)MessagingError(67-67)MessagingDestinationName(68-68)router/pkg/config/config.go (2)
Config(979-1053)Prometheus(97-111)router/pkg/metric/config.go (2)
Config(107-131)OpenTelemetry(69-84)
router/pkg/pubsub/pubsub.go (6)
router/pkg/metric/messaging_event_metric_store.go (1)
MessagingEventMetricStore(41-47)router/pkg/pubsub/datasource/provider.go (3)
Provider(18-24)ProviderBuilder(27-34)ProviderOpts(36-38)router/pkg/metric/noop_event_metrics.go (1)
NewNoopEventMetricStore(15-15)router/pkg/pubsub/kafka/provider_builder.go (1)
ProviderBuilder(21-27)router/pkg/pubsub/nats/provider_builder.go (1)
ProviderBuilder(18-24)router/pkg/pubsub/redis/provider_builder.go (1)
ProviderBuilder(16-22)
router/pkg/pubsub/nats/adapter.go (5)
router/pkg/metric/messaging_event_metric_store.go (3)
MessagingEventMetricStore(41-47)MessagingEvent(25-31)ProviderTypeNats(20-20)router/pkg/pubsub/kafka/adapter.go (2)
ProviderAdapter(42-50)Adapter(30-35)router/pkg/pubsub/nats/engine_datasource.go (1)
PublishAndRequestEventConfiguration(28-32)router/pkg/pubsub/datasource/error.go (2)
Error(3-6)NewError(12-17)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)
router/pkg/pubsub/redis/adapter.go (3)
router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)router/pkg/metric/messaging_event_metric_store.go (3)
MessagingEventMetricStore(41-47)MessagingEvent(25-31)ProviderTypeRedis(21-21)router/internal/persistedoperation/operationstorage/redis/rdcloser.go (1)
RDCloser(14-16)
router/pkg/pubsub/kafka/adapter.go (5)
router/pkg/metric/messaging_event_metric_store.go (3)
MessagingEventMetricStore(41-47)MessagingEvent(25-31)ProviderTypeKafka(19-19)router/pkg/otel/attributes.go (2)
ProviderId(69-69)MessagingSystem(66-66)router/pkg/pubsub/datasource/error.go (2)
Error(3-6)NewError(12-17)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)router/pkg/metric/noop_event_metrics.go (1)
NewNoopEventMetricStore(15-15)
router-tests/testenv/testenv.go (5)
router/pkg/config/config.go (1)
CircuitBreaker(213-224)router/pkg/metric/messaging_event_metric_store.go (2)
MessagingEventMetrics(50-54)MessagingEventMetricStore(41-47)router/pkg/pubsub/nats/adapter.go (1)
NewAdapter(356-372)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)router/pkg/metric/noop_event_metrics.go (1)
NewNoopEventMetricStore(15-15)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: build-router
- GitHub Check: image_scan
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: build_test
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_test
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (42)
router/pkg/metric/oltp_messaging_event_metric_store.go (1)
42-48: LGTM: produce/consume increment with flexible attributesIncrementing by 1 and accepting AddOptions enables the store wrapper to attach attributes consistently. Good separation of concerns.
router/core/graph_server.go (3)
521-522: LGTM: graphMux gains MessagingEventMetricStoreThe field addition is aligned with how other metric stores are hung off graphMux.
783-786: Good defaulting to Noop store (null object pattern)Initializing messagingEventMetricStore to a no-op avoids nil checks across the codebase and aligns with existing patterns.
883-895: LGTM: Conditional wiring of messaging event metricsCreation is gated behind exporter-specific toggles and uses base attributes/meter providers. Error propagation is correct.
router/pkg/config/config.schema.json (1)
1069-1073: LGTM: OTLP messaging event metrics flag addedThe naming matches code (MessagingEventMetrics) and the description is clear.
router-tests/testenv/testenv.go (5)
13-13: New datasource import is appropriate for ProviderOpts wiring.Importing the datasource package here is necessary to pass ProviderOpts to the NATS adapter. Looks good.
273-277: Event-metric toggles added to MetricOptions are well-scoped.Adding EnableOTLPEventMetrics and EnablePrometheusEventMetrics keeps the testenv interface consistent with other metric toggles and makes enabling/disabling explicit per backend.
1508-1512: Prometheus config: event metrics and exclusions are correctly propagated.Wiring MessagingEventMetrics and the exclusion lists (metrics, labels, scope info) into PrometheusConfig is consistent with how other metric features are toggled. No issues.
1530-1535: OTLP config: event metrics flag properly plumbed.Setting Enabled and propagating RouterRuntime, GraphqlCache, MessagingEventMetrics, and ConnectionStats from test options when a manual MetricReader is present matches the pattern used elsewhere.
2827-2829: AllNewAdaptercalls updated to includeProviderOptsVerified that every invocation of
NewAdapternow passes the newproviderOptsargument. No callsites were found using the old signature.• router/pkg/pubsub/nats/provider_builder.go
• router-tests/testenv/testenv.go
• demo/pkg/subgraphs/subgraphs.gorouter-tests/telemetry/messaging_event_metrics_test.go (6)
55-83: Kafka publish metric assertions are precise and aligned with conventions.Validating scope, metric name, attributes (operation/system/destination/provider/error), and count covers the critical surface. Good use of a manual reader to deterministically collect datapoints.
86-166: End-to-end subscription path validation (Kafka) is solid.The test synchronizes on subscription establishment, triggers a message, asserts payload, and validates the received.messages counter and attributes. This provides good signal against regressions.
169-222: NATS publish path instrumentation looks correct.Checking publish operation semantics for NATS with the expected attribute set keeps parity with Kafka and Redis coverage. LGTM.
277-370: NATS subscribe path: stability tweak and assertions are appropriate.Lowering WebSocketClientReadTimeout and explicitly waiting for subscription/connection counts helps reduce flakes. Attribute checks are consistent with the publish/request tests. Good.
373-424: Redis publish path assertions mirror other backends well.The label set and counter value assertions look correct and consistent. LGTM.
426-505: Redis subscribe path: complete and consistent.Good synchronization, payload validation, and metric assertions. No issues spotted.
router/pkg/metric/prom_messaging_event_metric_store.go (1)
16-21: Instrument names verified and correct. ThenewMessagingEventInstrumentsfunction registers counters named
messaging.event.sent.messagesmessaging.event.received.messagesThe Prometheus exporter will convert these to
messaging_event_sent_messages_totalandmessaging_event_received_messages_total, matching the tests. No changes required.router/core/factoryresolver.go (3)
7-7: Metric import alias addition is correct.Bringing in rmetric for the new store wiring is appropriate here.
210-216: RouterEngineConfiguration: new MessagingEventMetricStore field is well-placed.Placing the store alongside Events and other execution config keeps the loader’s dependencies cohesive.
472-480: All BuildProvidersAndDataSources calls match the updated signature
Both the resolver and all tests now pass thestore(ornil) before thelogger, matching:BuildProvidersAndDataSources( ctx, // 1 config.EventsConfiguration, // 2 store, // 3 ← MessagingEventMetricStore or nil logger, // 4 dsConfs, // 5 hostName, // 6 listenAddr, // 7 )Verified locations:
- router/core/factoryresolver.go (line 472)
- router/pkg/pubsub/pubsub_test.go (all three tests)
No further changes required.
router-tests/prometheus_messaging_event_metrics_test.go (7)
35-75: Kafka publish: Prometheus metric and label assertions are on point.Validating metric family name, label set, and counter value demonstrates the end-to-end exposition is correct.
81-153: Kafka subscribe: solid E2E and Prom assertions.Subscription establishment, message production, and subsequent registry scrape/verify pattern is good. LGTM.
162-204: NATS publish Prometheus assertions mirror OTLP coverage well.Checks are consistent and comprehensive for the label set. Looks good.
255-341: NATS subscribe: stable pattern with explicit waits and full label verification.This matches the approach in OTLP tests and should be robust in CI.
347-390: Redis publish: label/value assertions are correct.Good parity with Kafka and NATS tests. No issues.
392-467: Redis subscribe: end-to-end and Prom exposition verification are consistent.Solid assertions and cleanup. LGTM.
51-75: Helper functions confirmed
The tests in router-tests/prometheus_messaging_event_metrics_test.go use findMetricFamilyByName and findMetricLabelByName, which are defined in router-tests/prometheus_test.go (lines 5527–5536). No changes needed.router/pkg/pubsub/redis/adapter.go (5)
16-19: Good: backend-specific operation name constants definedDefining redisPublish/redisReceive improves readability and reduces typos across emit sites.
33-42: Wiring metrics via ProviderOpts is correct (null-object pattern assumed)Accepting ProviderOpts and storing opts.MessagingEventMetricStore aligns with the new build flow. Given the repo uses a no-op store when disabled, keeping this field without extra nil checks is fine.
119-124: Consume metrics include required attributesEmits ProviderId, operation, system, and destination. This satisfies the shared label set and matches the store’s attribute handling.
163-169: Produce error path metrics look goodCaptures ProviderId, operation, system, destination, and sets Error=true. Matches the store schema.
173-178: Produce success path metrics look goodSame attribute set without the error flag. Consistent with the error path emission.
router/pkg/pubsub/kafka/adapter.go (3)
197-206: Produce error path metrics look goodCaptures ProviderId, operation, system, destination, and sets Error=true. Matches the store schema and Kafka adapter semantics.
210-215: Produce success path metrics look goodSame attribute set without the error flag. Consistent with the error path emission.
257-277: Good: no-op fallback when providerOpts store is nilThis guarantees messagingEventMetricStore is always usable without nil checks (aligns with the project’s null-object pattern).
router/pkg/metric/messaging_event_metric_store.go (1)
140-149: LGTM: Shutdown performs a flush; provider shutdown is orchestrated elsewherePer project conventions, meter providers are shut down at the router level. Keeping Shutdown to just a Flush is correct.
router/pkg/pubsub/pubsub.go (2)
55-59: Good: defaulting to a no-op store ensures adapters don’t need nil checksThis maintains a consistent contract across all providers and adapters.
114-120: Correct propagation of MessagingEventMetricStore through builder pipelineForwarding the store into each provider via ProviderOpts is aligned with the new interfaces and keeps wiring centralized.
Also applies to: 140-142
router/pkg/pubsub/nats/adapter.go (4)
21-25: Good: explicit operation name constantsnatsRequest/natsPublish/natsReceive reduce typos and improve observability consistency within the backend.
221-252: Publish metrics (error/success) look correctEmits ProviderId, operation, system, destination, and toggles Error as expected. Mirrors Redis/Kafka patterns.
270-289: Request metrics (error/success) look correctEmits the same attribute set as Publish. Comment clarifies writer errors aren’t counted, which is reasonable.
356-371: Adapter construction: metrics store wiring aligns with ProviderOptsWires the store from providerOpts; consistent with the rest of the providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
router/pkg/metric/prom_messaging_event_metric_store.go (1)
50-52: Add no-op Shutdown to align with lifecycle and interface expectations.Per repository conventions, meter providers are shut down centrally at the router level. This provider should expose a no-op Shutdown(ctx) error to conform to the expected interface and avoid subtle breakages.
Apply this diff:
func (p *promMessagingEventMetrics) Flush(ctx context.Context) error { return p.meterProvider.ForceFlush(ctx) } + +// Shutdown is intentionally a no-op; meter providers are shut down at the router level. +func (p *promMessagingEventMetrics) Shutdown(ctx context.Context) error { + return nil +}
🧹 Nitpick comments (3)
router-tests/testenv/testenv.go (1)
2827-2829: Confirm intent: using Noop store in subgraph NATS adapters.Given tests validate router-level event metrics, it’s fine for demo subgraph adapters to use a no-op store. If you later want to assert adapter-level producer metrics from demo subgraphs, consider threading a real MessagingEventMetricStore here when event metrics are enabled in the test env.
Would you like me to propose a small extension to SubgraphOptions and test env config to optionally inject a real metrics store into these adapters for deeper end-to-end coverage?
router-tests/prometheus_messaging_event_metrics_test.go (2)
26-27: Rename test to reflect scope and avoid “flaky” in name.Naming a test “Flaky” sets the wrong expectation. Consider a neutral, descriptive name.
Apply this diff:
-func TestFlakyEventMetrics(t *testing.T) { +func TestPrometheusMessagingEventMetrics(t *testing.T) {
29-466: Reduce duplication with small helper(s) for metric assertions.There is repeated code across publish/subscribe variants for:
- Gathering metrics
- Finding a family
- Validating common labels like wg_messaging_error, wg_messaging_system, destination suffix, wg_provider_id
- Checking the counter value
Extracting a helper would improve readability and future maintainability.
I can sketch a helper like assertMessagingCounter(t, reg, family, expectedOp, system, destSuffix, provider, expectedVal) if helpful.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (11)
demo/pkg/subgraphs/subgraphs.go(2 hunks)router-tests/prometheus_messaging_event_metrics_test.go(1 hunks)router-tests/telemetry/messaging_event_metrics_test.go(1 hunks)router-tests/testenv/testenv.go(5 hunks)router/core/plan_generator.go(2 hunks)router/pkg/config/config.schema.json(2 hunks)router/pkg/metric/messaging_event_measurements.go(1 hunks)router/pkg/metric/messaging_event_metric_store.go(1 hunks)router/pkg/metric/oltp_messaging_event_metric_store.go(1 hunks)router/pkg/metric/prom_messaging_event_metric_store.go(1 hunks)router/pkg/otel/attributes.go(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- demo/pkg/subgraphs/subgraphs.go
- router/pkg/metric/messaging_event_measurements.go
- router-tests/telemetry/messaging_event_metrics_test.go
- router/pkg/metric/messaging_event_metric_store.go
- router/pkg/otel/attributes.go
- router/pkg/metric/oltp_messaging_event_metric_store.go
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
📚 Learning: 2025-08-14T16:47:03.713Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.713Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/core/plan_generator.gorouter/pkg/metric/prom_messaging_event_metric_store.go
📚 Learning: 2025-08-14T17:45:57.468Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/pkg/metric/prom_messaging_event_metric_store.gorouter-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:46:00.905Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/pkg/metric/prom_messaging_event_metric_store.gorouter-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router-tests/testenv/testenv.go
🧬 Code Graph Analysis (2)
router/core/plan_generator.go (3)
router/core/factoryresolver.go (1)
RouterEngineConfiguration(210-216)router/pkg/metric/messaging_event_metric_store.go (1)
MessagingEventMetricStore(41-47)router/pkg/metric/noop_event_metrics.go (1)
NewNoopEventMetricStore(15-15)
router-tests/testenv/testenv.go (4)
router/pkg/metric/messaging_event_metric_store.go (2)
MessagingEventMetrics(50-54)MessagingEventMetricStore(41-47)router/pkg/pubsub/nats/adapter.go (1)
NewAdapter(356-372)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)router/pkg/metric/noop_event_metrics.go (1)
NewNoopEventMetricStore(15-15)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./events)
- GitHub Check: image_scan (nonroot)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: build_test
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (16)
router/pkg/config/config.schema.json (2)
1069-1073: Consistency fix looks good: unified key and description for messaging event metrics (OTLP).The property name and description now align with the code and Prometheus schema. This resolves the previously noted mismatch.
1179-1183: Approve unifiedmessaging_event_metricsschemaAll references to the old key have been removed and no stale occurrences were found in markdown docs or the router code. Ready to merge.
router/core/plan_generator.go (2)
10-11: Importing metric package is appropriate for default no-op wiring.Import is used below to set a default store; no unused import issues.
258-260: Good default: initialize MessagingEventMetricStore with no-op.This adheres to the null-object pattern used across the codebase and avoids nil checks downstream.
router/pkg/metric/prom_messaging_event_metric_store.go (1)
42-48: No interface mismatch: promMessagingEventMetrics implements the Provider interfaceAll Prom and OTLP metric types (
promMessagingEventMetricsandotlpMessagingEventMetrics) define:
- Produce(ctx context.Context, opts ...otelmetric.AddOption)
- Consume(ctx context.Context, opts ...otelmetric.AddOption)
- Flush(ctx context.Context) error
These signatures exactly match the
MessagingEventMetricProviderinterface inmessaging_event_metric_store.go(lines 34–39). TheMessagingEventMetricStoreinterface (lines 41–44) is implemented byMessagingEventMetrics(which takes aMessagingEvent), so there is no breaking mismatch here. No changes required.router-tests/testenv/testenv.go (4)
66-67: New import is correct for ProviderOpts wiring.Needed for passing MessagingEventMetricStore into adapters.
273-277: MetricOptions extended with event metrics flags: LGTM.The additions are clear and symmetric across OTLP/Prom backends.
1507-1512: Prometheus config now exposes MessagingEventMetrics: wiring looks correct.Flag is plumbed through to PrometheusConfig.MessagingEventMetrics and coexists with other exclusions. No ordering issues observed.
1530-1535: OTLP config receives MessagingEventMetrics flag: LGTM.Consistent naming and placement alongside other toggles.
router-tests/prometheus_messaging_event_metrics_test.go (7)
32-75: Kafka publish flow assertions are precise and correct.
- Validates counter family, labels (operation/system/destination/provider), and value.
- Checks wg_messaging_error = "false".
Solid coverage for the success path.
78-153: Kafka subscribe flow is well-synchronized and asserts expected labels.Use of WaitForSubscriptionCount and channel-based Await keeps flakiness low. Label/value assertions are consistent with conventions.
159-204: NATS publish path mirrors Kafka assertions correctly.Operation label “publish” and destination naming check look good.
206-253: NATS request path is a nice addition; label/value checks look right.Covers an additional operation mode and validates “request” label.
255-341: NATS subscribe flow is carefully coordinated; assertions are correct.
- The temporary read timeout tweak is a pragmatic stabilizer.
- Ensures provider “default” for the default connection.
All good.
347-390: Redis publish flow is consistent with other backends.Assertions on labels and counter value are appropriate.
392-466: Redis subscribe flow follows the same robust pattern.Label checks and counter validation are consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/pkg/pubsub/kafka/adapter.go (1)
56-71: Subscription cancel not respected; PollRecords should stop when ctx is canceledtopicPoller passes p.ctx to client.PollRecords, so if the subscription ctx is canceled but the adapter context remains active, PollRecords may block indefinitely until new records arrive or adapter shutdown, leaking goroutines and delaying unsubscribe.
Use a merged context that cancels when either p.ctx or ctx is done.
Apply this diff:
-func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, pollerOpts PollerOpts) error { - for { - select { - case <-p.ctx.Done(): // Close the poller if the application context was canceled - return p.ctx.Err() - case <-ctx.Done(): // Close the poller if the subscription context was canceled - return ctx.Err() - default: - // Try to fetch max records from any subscribed topics - fetches := client.PollRecords(p.ctx, 10_000) +func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, pollerOpts PollerOpts) error { + // pollCtx cancels when either the adapter context or the subscription context is canceled. + pollCtx, cancel := context.WithCancel(p.ctx) + defer cancel() + go func() { + select { + case <-ctx.Done(): + cancel() + case <-pollCtx.Done(): + // adapter shutdown + } + }() + for { + select { + case <-pollCtx.Done(): + return pollCtx.Err() + default: + // Try to fetch max records from any subscribed topics + fetches := client.PollRecords(pollCtx, 10_000) @@ - p.messagingEventMetricStore.Consume(p.ctx, metric.MessagingEvent{ + p.messagingEventMetricStore.Consume(pollCtx, metric.MessagingEvent{Also applies to: 98-109
🧹 Nitpick comments (8)
router-tests/telemetry/messaging_event_metrics_test.go (3)
27-27: Minor: Consolidate WaitTimeout to a shared test utilWaitTimeout is duplicated across test files. Consider placing a single const in a shared testutils package (e.g., router-tests/testenv or router-tests/utils) and reusing it here to avoid drift.
65-76: Also assert attribute presence for core keysFor operation/system/destination, assert the “found” bool returned by Attributes.Value(...) before comparing strings to surface clearer failures when attributes are absent.
Example pattern:
- operation, ok := attrs.Value(otelattrs.WgMessagingOperationName); require.True(t, ok)
- system, ok := attrs.Value(otelattrs.WgMessagingSystem); require.True(t, ok)
- destination, ok := attrs.Value(otelattrs.WgMessagingDestinationName); require.True(t, ok)
This is a robustness improvement; apply similarly to each block.
Also applies to: 142-153, 201-213, 253-265, 338-350, 400-412, 476-488
29-36: Parallelism layering may increase flakiness; consider dialing it backYou mark the top test and each nested subtest with t.Parallel(). Given shared external deps (brokers), heavy parallelism can increase nondeterminism. Consider parallelizing only the backend groups (kafka/nats/redis) or only the leaf tests.
Would you like me to generate a quick test plan to run these subtests serially per backend to compare stability?
router-tests/prometheus_messaging_event_metrics_test.go (2)
24-24: Minor: Consolidate WaitTimeout to a shared test utilSame as the OTLP tests, consider centralizing WaitTimeout to avoid duplication.
159-172: Avoid blind assumptions on labels; assert presence firstBefore reading label.GetValue(), consider asserting the label exists (non-nil) for all core labels like wg_messaging_operation_name, wg_messaging_system, wg_messaging_destination_name, wg_provider_id to produce clearer failures.
Also applies to: 210-220, 259-268, 351-362, 396-405
router/pkg/metric/messaging_event_metric_store.go (2)
50-54: Nit: logger field is unused in the structlogger is only needed during provider construction. Consider removing it from the struct to reduce state footprint, unless you plan to log within Produce/Consume later.
type MessagingEventMetrics struct { - baseAttributes []attribute.KeyValue - logger *zap.Logger - providers []MessagingEventMetricProvider + baseAttributes []attribute.KeyValue + providers []MessagingEventMetricProvider } func NewMessagingEventMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *metric.MeterProvider, metricsConfig *Config) (*MessagingEventMetrics, error) { @@ store := &MessagingEventMetrics{ baseAttributes: baseAttributes, - logger: logger, providers: providers, }
59-66: Optional: guard against nil providers if configs are misalignedIf metricsConfig enables a backend but its MeterProvider is nil due to upstream misconfiguration, newOtlpMessagingEventMetrics/newPromMessagingEventMetrics may panic. Consider a short guard with a helpful error.
Also applies to: 67-74
router/pkg/pubsub/kafka/adapter.go (1)
178-193: Minor: logging full message payloads may expose PIIlog.Debug("publish", zap.ByteString("data", event.Data)) can leak message contents under debug logs. Consider logging size or a hash instead, unless you’re certain messages don’t contain sensitive data.
- log.Debug("publish", zap.ByteString("data", event.Data)) + log.Debug("publish", zap.Int("payload_bytes", len(event.Data)))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
router-tests/prometheus_messaging_event_metrics_test.go(1 hunks)router-tests/telemetry/messaging_event_metrics_test.go(1 hunks)router/pkg/metric/messaging_event_measurements.go(1 hunks)router/pkg/metric/messaging_event_metric_store.go(1 hunks)router/pkg/otel/attributes.go(2 hunks)router/pkg/pubsub/kafka/adapter.go(7 hunks)router/pkg/pubsub/nats/adapter.go(9 hunks)router/pkg/pubsub/redis/adapter.go(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- router/pkg/metric/messaging_event_measurements.go
- router/pkg/otel/attributes.go
- router/pkg/pubsub/nats/adapter.go
- router/pkg/pubsub/redis/adapter.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-14T16:47:03.713Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.713Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/pkg/pubsub/kafka/adapter.go
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/pkg/metric/messaging_event_metric_store.go
📚 Learning: 2025-08-14T17:45:57.468Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/pkg/metric/messaging_event_metric_store.go
📚 Learning: 2025-08-14T17:46:00.905Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/pkg/metric/messaging_event_metric_store.go
🧬 Code Graph Analysis (4)
router/pkg/pubsub/kafka/adapter.go (5)
router/pkg/metric/messaging_event_metric_store.go (3)
MessagingEventMetricStore(41-47)MessagingEvent(25-31)ProviderTypeKafka(19-19)router/pkg/pubsub/nats/adapter.go (1)
ProviderAdapter(42-54)router/pkg/pubsub/redis/adapter.go (2)
ProviderAdapter(56-65)NewProviderAdapter(33-54)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(36-38)router/pkg/metric/noop_event_metrics.go (1)
NewNoopEventMetricStore(15-15)
router/pkg/metric/messaging_event_metric_store.go (3)
router/pkg/config/config.go (2)
Config(979-1053)Prometheus(97-111)router/pkg/metric/config.go (2)
Config(107-131)OpenTelemetry(69-84)router/pkg/otel/attributes.go (5)
WgMessagingOperationName(65-65)WgMessagingSystem(66-66)WgErrorType(70-70)WgProviderId(69-69)WgMessagingDestinationName(68-68)
router-tests/telemetry/messaging_event_metrics_test.go (6)
router-tests/prometheus_messaging_event_metrics_test.go (2)
TestFlakyEventMetrics(26-468)WaitTimeout(24-24)router-tests/testenv/testenv.go (4)
Run(105-122)Config(284-340)MetricOptions(263-277)Environment(1727-1763)router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-81)router-tests/utils.go (2)
GetMetricScopeByName(88-95)GetMetricByName(79-86)router/pkg/otel/attributes.go (5)
WgMessagingOperationName(65-65)WgMessagingSystem(66-66)WgMessagingDestinationName(68-68)WgProviderId(69-69)WgErrorType(70-70)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)
router-tests/prometheus_messaging_event_metrics_test.go (5)
router-tests/telemetry/messaging_event_metrics_test.go (1)
TestFlakyEventMetrics(29-500)router-tests/testenv/testenv.go (6)
Run(105-122)Config(284-340)ConfigWithEdfsKafkaJSONTemplate(85-85)MetricOptions(263-277)Environment(1727-1763)GraphQLRequest(1895-1903)router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-81)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)router/pkg/config/config.go (1)
EngineExecutionConfiguration(370-398)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_test
- GitHub Check: build_test
- GitHub Check: Analyze (go)
🔇 Additional comments (6)
router-tests/prometheus_messaging_event_metrics_test.go (1)
51-75: Helper Functions Already Present in Router-Tests PackageThe
findMetricFamilyByNameandfindMetricLabelByNamehelpers are defined inrouter-tests/prometheus_test.goand live in the samerouter-testspackage asprometheus_messaging_event_metrics_test.go. No additional imports or definitions are needed here.router/pkg/metric/messaging_event_metric_store.go (3)
56-81: LGTM: Pluggable providers and construction flow look soundConditional wiring of OTLP/Prom providers is clean and errors are surfaced early. Null-object pattern downstream keeps call sites simple.
83-87: Attribute construction matches tests and keeps cardinality under controlOnly sets provider, destination, and error.type when present, and centralizes base attributes via withAttrs. Nice.
Also applies to: 88-107, 109-129
131-142: Good: explicit Flush/Shutdown with error aggregationUsing errors.Join preserves all provider errors and avoids partial masking. Consistent with the central lifecycle design.
Also applies to: 144-153
router/pkg/pubsub/kafka/adapter.go (2)
174-215: LGTM: publish path emits success/error metrics with expected attributesOn error you emit ErrorType "publish_error"; on success you omit it. OperationName is "produce" for Kafka, which matches the tests.
257-277: LGTM: No-op metric store fallback keeps adapter code simpleChoosing a noop store when provider opts are absent avoids nil checks at call-sites and aligns with the project’s pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
router/pkg/pubsub/pubsub_test.go (1)
66-66: Fix mock signature: add providerOpts argumentBuildProvider now accepts providerOpts; this expectation will not match and will fail.
-mockBuilder.On("BuildProvider", natsEventSources[0]).Return(mockPubSubProvider, nil) +mockBuilder.On("BuildProvider", natsEventSources[0], mock.Anything).Return(mockPubSubProvider, nil)router/pkg/pubsub/kafka/adapter.go (1)
56-113: Fix poller cancellation and close the read client to avoid resource leaks
- PollRecords currently uses p.ctx. If the subscription ctx (ctx) is canceled while p.ctx remains active, the poller can block longer than necessary.
- The per-subscription client created in Subscribe is never closed, leaking connections.
Use the subscription ctx in PollRecords and ensure client.Close() is called when the poller exits.
Apply:
-func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, pollerOpts PollerOpts) error { +func (p *ProviderAdapter) topicPoller(ctx context.Context, client *kgo.Client, updater resolve.SubscriptionUpdater, pollerOpts PollerOpts) error { + // ensure per-subscription client is closed + defer client.Close() for { @@ - // Try to fetch max records from any subscribed topics - fetches := client.PollRecords(p.ctx, 10_000) + // Try to fetch max records from any subscribed topics + fetches := client.PollRecords(ctx, 10_000) @@ - p.streamMetricStore.Consume(p.ctx, metric.StreamsEvent{ + p.streamMetricStore.Consume(ctx, metric.StreamsEvent{ ProviderId: pollerOpts.providerId, StreamOperationName: kafkaReceive, ProviderType: metric.ProviderTypeKafka, DestinationName: r.Topic, })
♻️ Duplicate comments (1)
router/pkg/otel/attributes.go (1)
65-70: Use correct Go acronym casing: WgProviderID (not WgProviderId)Matches existing conventions like WgFederatedGraphID and prior guidance.
Apply this diff:
- WgProviderId = attribute.Key("wg.provider.id") + WgProviderID = attribute.Key("wg.provider.id")Note: update all call sites accordingly.
🧹 Nitpick comments (21)
router/pkg/metric/stream_measurements.go (1)
15-21: Tighten descriptions and use consistent terminology ("received" vs "consumed")Descriptions read awkwardly and mix terms with the metric name. Prefer succinct, consistent phrasing.
Apply this diff:
-messagingSentMessagesOptions = []otelmetric.Int64CounterOption{ - otelmetric.WithDescription("Number of stream sent messages"), -} +messagingSentMessagesOptions = []otelmetric.Int64CounterOption{ + otelmetric.WithDescription("Number of messages sent"), +} … -messagingConsumedMessagesOptions = []otelmetric.Int64CounterOption{ - otelmetric.WithDescription("Number of stream consumed messages"), -} +messagingConsumedMessagesOptions = []otelmetric.Int64CounterOption{ + otelmetric.WithDescription("Number of messages received"), +}router/core/graph_server.go (1)
763-767: Nit: redundant nil-check for streamMetricStoregm.streamMetricStore is initialized with a Noop in buildGraphMux, so the nil-guard isn’t needed. Safe to keep, but it adds noise.
router/pkg/metric/prom_stream_metric_store.go (1)
11-14: Optional: unify meter naming across providersOTLP uses "cosmo.router.streams", Prom uses "cosmo.router.streams.prometheus". Consider a consistent scheme (either suffix provider for both or neither) to make source filtering predictable in pipelines.
router/pkg/metric/noop_stream_metrics.go (1)
7-15: Add compile-time interface assertion and brief doc commentsA small safeguard and docs improve maintainability and catch drift if the interface changes.
type NoopStreamMetricStore struct{} +// Ensure NoopStreamMetricStore implements StreamMetricStore at compile time. +var _ StreamMetricStore = (*NoopStreamMetricStore)(nil) + -func (n *NoopStreamMetricStore) Produce(ctx context.Context, event StreamsEvent) {} +// Produce is a no-op. +func (n *NoopStreamMetricStore) Produce(ctx context.Context, event StreamsEvent) {} -func (n *NoopStreamMetricStore) Consume(ctx context.Context, event StreamsEvent) {} +// Consume is a no-op. +func (n *NoopStreamMetricStore) Consume(ctx context.Context, event StreamsEvent) {} -func (n *NoopStreamMetricStore) Flush(ctx context.Context) error { return nil } -func (n *NoopStreamMetricStore) Shutdown(ctx context.Context) error { return nil } +// Flush is a no-op. +func (n *NoopStreamMetricStore) Flush(ctx context.Context) error { return nil } +// Shutdown is a no-op. +func (n *NoopStreamMetricStore) Shutdown(ctx context.Context) error { return nil } -func NewNoopStreamMetricStore() *NoopStreamMetricStore { return &NoopStreamMetricStore{} } +// NewNoopStreamMetricStore returns a StreamMetricStore that discards all metrics. +func NewNoopStreamMetricStore() *NoopStreamMetricStore { return &NoopStreamMetricStore{} }router/pkg/pubsub/pubsub_test.go (1)
296-296: Prefer Noop StreamMetricStore over nil to follow null-object patternTests currently pass nil for streamMetricStore in BuildProvidersAndDataSources. Aligning with the repo’s null-object pattern (using NewNoopStreamMetricStore) avoids nil-handling across the stack and reduces surprises if defaults change.
-}, nil, zap.NewNop(), dsConfs, "host", "addr") +}, rmetric.NewNoopStreamMetricStore(), zap.NewNop(), dsConfs, "host", "addr")Apply the same change at Lines 296, 349, and 402.
Also applies to: 349-349, 402-402
router-tests/prometheus_stream_metrics_test.go (2)
54-56: Guard against missing metric families before dereferencingIf findMetricFamilyByName returns nil, dereferencing will panic. Add an explicit NotNil assertion for clearer failures.
- family := findMetricFamilyByName(mf, "streams_sent_messages_total") + family := findMetricFamilyByName(mf, "streams_sent_messages_total") + require.NotNil(t, family, "expected metric family streams_sent_messages_total to be present") - family := findMetricFamilyByName(mf, "streams_received_messages_total") + family := findMetricFamilyByName(mf, "streams_received_messages_total") + require.NotNil(t, family, "expected metric family streams_received_messages_total to be present")Apply in each location indicated by the selected ranges.
Also applies to: 127-130, 183-186, 232-235, 312-314, 369-372, 441-444
35-37: Optional: drop MetricReader to keep tests Prometheus-only and reduce overheadProviding a ManualReader enables OTLP as well; it’s not needed when only Prometheus metrics are asserted. Dropping it simplifies setup and may speed up tests.
- metricReader := metric.NewManualReader() - promRegistry := prometheus.NewRegistry() + promRegistry := prometheus.NewRegistry() ... - MetricReader: metricReader, + // MetricReader omitted: Prometheus-only test PrometheusRegistry: promRegistry,Also applies to: 81-83, 162-164, 210-212, 259-261, 351-353, 396-398
router/core/factoryresolver.go (2)
216-217: Expose StreamMetricStore on RouterEngineConfigurationThis cleanly threads the store through the engine. Consider defaulting to a no-op store where the struct is constructed to reduce nil surface area.
473-481: Default to NoopStreamMetricStore when none is providedKeeps with the null-object pattern used elsewhere and prevents downstream nil checks.
- factoryProviders, factoryDataSources, err := pubsub.BuildProvidersAndDataSources( + store := routerEngineConfig.StreamMetricStore + if store == nil { + store = rmetric.NewNoopStreamMetricStore() + } + factoryProviders, factoryDataSources, err := pubsub.BuildProvidersAndDataSources( l.ctx, routerEngineConfig.Events, - routerEngineConfig.StreamMetricStore, + store, l.logger, pubSubDS, l.resolver.InstanceData().HostName, l.resolver.InstanceData().ListenAddress, )router/pkg/pubsub/redis/adapter.go (1)
125-129: Don't terminate the subscription loop on nil messagesReturning on msg == nil ends the goroutine; better to continue and wait for the next message. Channel-closure is already handled via !ok.
Apply:
- if msg == nil { - log.Debug("empty message received on subscription update, skipping") - return - } + if msg == nil { + log.Debug("empty message received on subscription update, skipping") + continue + }router-tests/telemetry/stream_metrics_test.go (2)
29-31: Consider renaming the test to avoid “Flaky” in the nameThe test appears deterministic; the name might confuse readers or tooling that tracks flaky tests.
Apply:
-func TestFlakyEventMetrics(t *testing.T) { +func TestEventMetrics(t *testing.T) {
170-219: NATS tests cover publish, request, and subscribe paths thoroughlyGreat coverage across all operations, labels, and counts. Consider a follow-up negative-path test for request/publish error to assert wg.error.type is present.
If helpful, I can add a failing publish/request (e.g., to a non-existent subject or by injecting a client error) and extend assertions for WgErrorType.
Also applies to: 221-271, 273-366
router/pkg/pubsub/nats/adapter.go (2)
144-149: Prefer subscription context when emitting consume metrics (durable path)Using ctx ties metric lifetimes to the subscription, mirroring shutdown behavior and avoiding emissions after cancellation.
Apply:
- p.streamMetricStore.Consume(p.ctx, metric.StreamsEvent{ + p.streamMetricStore.Consume(ctx, metric.StreamsEvent{ ProviderId: event.ProviderID, StreamOperationName: natsReceive, ProviderType: metric.ProviderTypeNats, DestinationName: msg.Subject(), })
187-193: Same as above for channel subscription path: use ctx over p.ctxKeeps emission scoped to the subscription lifecycle.
Apply:
- p.streamMetricStore.Consume(p.ctx, metric.StreamsEvent{ + p.streamMetricStore.Consume(ctx, metric.StreamsEvent{ ProviderId: event.ProviderID, StreamOperationName: natsReceive, ProviderType: metric.ProviderTypeNats, DestinationName: msg.Subject, })router/pkg/pubsub/pubsub.go (1)
126-134: Minor: use a set for provider ID membership checksCurrent code does O(n^2) slices.Contains scans. Replace with maps for clarity and linear time; negligible at small sizes but cleaner.
Apply:
- usedProviderIds := []string{} - for _, dsConf := range dsConfs { - for _, event := range dsConf.events { - if !slices.Contains(usedProviderIds, event.GetEngineEventConfiguration().GetProviderId()) { - usedProviderIds = append(usedProviderIds, event.GetEngineEventConfiguration().GetProviderId()) - } - } - } + usedProviderIds := make(map[string]struct{}) + for _, dsConf := range dsConfs { + for _, event := range dsConf.events { + usedProviderIds[event.GetEngineEventConfiguration().GetProviderId()] = struct{}{} + } + } @@ - providerIds := []string{} + providerIds := make(map[string]struct{}) for _, providerData := range providersData { - if !slices.Contains(usedProviderIds, providerData.GetID()) { + if _, ok := usedProviderIds[providerData.GetID()]; !ok { continue } provider, err := builder.BuildProvider(providerData, pubsub_datasource.ProviderOpts{ StreamMetricStore: store, }) if err != nil { return nil, nil, err } pubSubProviders = append(pubSubProviders, provider) - providerIds = append(providerIds, provider.ID()) + providerIds[provider.ID()] = struct{}{} } @@ - for _, providerId := range usedProviderIds { - if !slices.Contains(providerIds, providerId) { + for providerId := range usedProviderIds { + if _, ok := providerIds[providerId]; !ok { return pubSubProviders, nil, &ProviderNotDefinedError{ ProviderID: providerId, ProviderTypeID: builder.TypeID(), } } }Also applies to: 136-149, 151-159
router/pkg/metric/stream_metric_store.go (6)
10-11: Aliassdk/metricassdkmetricto avoid confusion with local packagemetric.The codebase elsewhere (e.g., config.go) aliases OpenTelemetry's sdk/metric as
sdkmetric. Usingmetrichere is confusing since this file is in packagemetric. Aligning the alias improves readability and consistency.Apply:
- "go.opentelemetry.io/otel/sdk/metric" + sdkmetric "go.opentelemetry.io/otel/sdk/metric" @@ -func NewStreamMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *metric.MeterProvider, metricsConfig *Config) (*StreamMetrics, error) { +func NewStreamMetricStore(logger *zap.Logger, baseAttributes []attribute.KeyValue, otelProvider, promProvider *sdkmetric.MeterProvider, metricsConfig *Config) (*StreamMetrics, error) {Also applies to: 56-56
83-86: DRY up attribute construction and preallocate to reduce allocations.
ProduceandConsumeduplicate attribute building. Factor it into a helper; also preallocate the combined slice to avoid intermediate allocations in hot paths.func (e *StreamMetrics) withAttrs(attrs ...attribute.KeyValue) otelmetric.AddOption { - copied := append([]attribute.KeyValue{}, e.baseAttributes...) - return otelmetric.WithAttributes(append(copied, attrs...)...) + copied := make([]attribute.KeyValue, 0, len(e.baseAttributes)+len(attrs)) + copied = append(copied, e.baseAttributes...) + copied = append(copied, attrs...) + return otelmetric.WithAttributes(copied...) } +func (e *StreamMetrics) addOptionFor(event StreamsEvent) otelmetric.AddOption { + attrs := []attribute.KeyValue{ + otel.WgStreamOperationName.String(event.StreamOperationName), + otel.WgProviderType.String(string(event.ProviderType)), + } + if event.ErrorType != "" { + attrs = append(attrs, otel.WgErrorType.String(event.ErrorType)) + } + if event.ProviderId != "" { + attrs = append(attrs, otel.WgProviderId.String(event.ProviderId)) + } + if event.DestinationName != "" { + attrs = append(attrs, otel.WgDestinationName.String(event.DestinationName)) + } + return e.withAttrs(attrs...) +} + func (e *StreamMetrics) Produce(ctx context.Context, event StreamsEvent) { - attrs := []attribute.KeyValue{ - otel.WgStreamOperationName.String(event.StreamOperationName), - otel.WgProviderType.String(string(event.ProviderType)), - } - if event.ErrorType != "" { - attrs = append(attrs, otel.WgErrorType.String(event.ErrorType)) - } - if event.ProviderId != "" { - attrs = append(attrs, otel.WgProviderId.String(event.ProviderId)) - } - if event.DestinationName != "" { - attrs = append(attrs, otel.WgDestinationName.String(event.DestinationName)) - } - opt := e.withAttrs(attrs...) + opt := e.addOptionFor(event) for _, provider := range e.providers { provider.Produce(ctx, opt) } } func (e *StreamMetrics) Consume(ctx context.Context, event StreamsEvent) { - attrs := []attribute.KeyValue{ - otel.WgStreamOperationName.String(event.StreamOperationName), - otel.WgProviderType.String(string(event.ProviderType)), - } - if event.ErrorType != "" { - attrs = append(attrs, otel.WgErrorType.String(event.ErrorType)) - } - if event.ProviderId != "" { - attrs = append(attrs, otel.WgProviderId.String(event.ProviderId)) - } - if event.DestinationName != "" { - attrs = append(attrs, otel.WgDestinationName.String(event.DestinationName)) - } - - opt := e.withAttrs(attrs...) + opt := e.addOptionFor(event) for _, provider := range e.providers { provider.Consume(ctx, opt) } }Also applies to: 88-107, 109-129
73-79: Log when no providers are configured to aid debugging.If both OTLP and Prometheus streams are disabled, the store becomes a no-op. A debug log helps explain why metrics aren’t observed.
providers := make([]StreamMetricProvider, 0) @@ if metricsConfig.Prometheus.Streams { promMetrics, err := newPromStreamEventMetrics(logger, promProvider) if err != nil { return nil, fmt.Errorf("failed to create prometheus stream event metrics: %w", err) } providers = append(providers, promMetrics) } + if len(providers) == 0 { + logger.Debug("no stream metrics providers configured; StreamMetrics will be a no-op") + } + store := &StreamMetrics{ baseAttributes: baseAttributes, logger: logger, providers: providers, }
135-139: Neutral variable naming in flush loop.Use a provider-agnostic name instead of
errOtlpin a loop over all providers.- for _, provider := range e.providers { - if errOtlp := provider.Flush(ctx); errOtlp != nil { - err = errors.Join(err, fmt.Errorf("failed to flush metrics: %w", errOtlp)) - } - } + for _, provider := range e.providers { + if flushErr := provider.Flush(ctx); flushErr != nil { + err = errors.Join(err, fmt.Errorf("failed to flush metrics: %w", flushErr)) + } + }
144-146: Clarify Shutdown docstring to match behavior and project conventions.Per repo conventions (router manages meter provider lifecycle), this store’s Shutdown only flushes. Update the comment to avoid implying it stops observers.
-// Shutdown flushes the metrics and stops observers if any. +// Shutdown flushes the metrics. Provider lifecycle (including stopping observers, if any) +// is managed at the router level.
26-31: Polish field comments for grammar and clarity.Minor wording tweaks.
type StreamsEvent struct { - ProviderId string // The id of the provider defined in the configuration - StreamOperationName string // The stream operation name that is specific to the messaging system - ProviderType ProviderType // The messaging system type that are supported - ErrorType string // Optional error type, e.g., "publish_error" or "receive_error". If empty, the attribute is not set - DestinationName string // The name of the destination queue / topic / channel + ProviderId string // The provider id defined in the configuration. + StreamOperationName string // The operation name specific to the messaging system. + ProviderType ProviderType // The messaging system type being used. + ErrorType string // Optional error type (e.g., "publish_error", "receive_error"). If empty, the attribute is not set. + DestinationName string // The destination name (queue/topic/channel). }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (25)
demo/pkg/subgraphs/subgraphs.go(2 hunks)router-tests/prometheus_stream_metrics_test.go(1 hunks)router-tests/telemetry/stream_metrics_test.go(1 hunks)router-tests/testenv/testenv.go(5 hunks)router/core/factoryresolver.go(3 hunks)router/core/graph_server.go(5 hunks)router/core/plan_generator.go(3 hunks)router/core/router.go(2 hunks)router/pkg/config/config.go(2 hunks)router/pkg/config/config.schema.json(2 hunks)router/pkg/config/testdata/config_defaults.json(2 hunks)router/pkg/config/testdata/config_full.json(2 hunks)router/pkg/metric/config.go(2 hunks)router/pkg/metric/noop_stream_metrics.go(1 hunks)router/pkg/metric/oltp_stream_metric_store.go(1 hunks)router/pkg/metric/prom_stream_metric_store.go(1 hunks)router/pkg/metric/stream_measurements.go(1 hunks)router/pkg/metric/stream_metric_store.go(1 hunks)router/pkg/otel/attributes.go(2 hunks)router/pkg/pubsub/datasource/provider.go(2 hunks)router/pkg/pubsub/kafka/adapter.go(7 hunks)router/pkg/pubsub/nats/adapter.go(9 hunks)router/pkg/pubsub/pubsub.go(7 hunks)router/pkg/pubsub/pubsub_test.go(8 hunks)router/pkg/pubsub/redis/adapter.go(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- router/pkg/pubsub/datasource/provider.go
- router/pkg/config/testdata/config_full.json
- router/pkg/metric/config.go
- router/pkg/config/testdata/config_defaults.json
- router/pkg/config/config.go
- router/core/router.go
- demo/pkg/subgraphs/subgraphs.go
- router/core/plan_generator.go
- router/pkg/config/config.schema.json
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-14T16:47:03.713Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.713Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/pkg/metric/noop_stream_metrics.gorouter/pkg/metric/oltp_stream_metric_store.gorouter/pkg/metric/prom_stream_metric_store.gorouter/pkg/pubsub/kafka/adapter.go
📚 Learning: 2025-08-14T17:46:00.905Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.905Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/pkg/metric/noop_stream_metrics.gorouter/pkg/metric/stream_metric_store.gorouter/pkg/metric/oltp_stream_metric_store.gorouter/pkg/metric/prom_stream_metric_store.gorouter-tests/testenv/testenv.gorouter/core/graph_server.go
📚 Learning: 2025-08-14T17:45:57.468Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.468Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/pkg/metric/stream_metric_store.gorouter/pkg/metric/oltp_stream_metric_store.gorouter/pkg/metric/prom_stream_metric_store.gorouter-tests/testenv/testenv.go
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/pkg/metric/stream_metric_store.gorouter/pkg/metric/oltp_stream_metric_store.gorouter-tests/testenv/testenv.go
🧬 Code Graph Analysis (12)
router/pkg/metric/noop_stream_metrics.go (1)
router/pkg/metric/stream_metric_store.go (1)
StreamsEvent(25-31)
router-tests/prometheus_stream_metrics_test.go (4)
router-tests/telemetry/stream_metrics_test.go (1)
TestFlakyEventMetrics(29-500)router-tests/testenv/testenv.go (4)
Run(105-122)Config(284-340)MetricOptions(263-277)Environment(1727-1763)router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-81)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)
router/pkg/metric/stream_metric_store.go (2)
router/pkg/metric/config.go (2)
Config(107-131)OpenTelemetry(69-84)router/pkg/otel/attributes.go (5)
WgStreamOperationName(65-65)WgProviderType(66-66)WgErrorType(69-69)WgProviderId(68-68)WgDestinationName(67-67)
router/core/factoryresolver.go (1)
router/pkg/metric/stream_metric_store.go (1)
StreamMetricStore(41-47)
router-tests/telemetry/stream_metrics_test.go (5)
router-tests/prometheus_stream_metrics_test.go (1)
TestFlakyEventMetrics(26-468)router-tests/testenv/testenv.go (4)
Run(105-122)Config(284-340)MetricOptions(263-277)Environment(1727-1763)router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-81)router-tests/utils.go (2)
GetMetricScopeByName(88-95)GetMetricByName(79-86)router/pkg/otel/attributes.go (5)
WgStreamOperationName(65-65)WgProviderType(66-66)WgDestinationName(67-67)WgProviderId(68-68)WgErrorType(69-69)
router/pkg/pubsub/redis/adapter.go (5)
router/pkg/pubsub/kafka/adapter.go (3)
NewProviderAdapter(257-278)Adapter(30-35)ProviderAdapter(42-50)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(37-39)router/pkg/metric/stream_metric_store.go (4)
StreamMetricStore(41-47)StreamsEvent(25-31)ProviderType(16-16)ProviderTypeRedis(21-21)router/pkg/metric/noop_stream_metrics.go (1)
NewNoopStreamMetricStore(15-15)router/internal/persistedoperation/operationstorage/redis/rdcloser.go (1)
RDCloser(14-16)
router/pkg/pubsub/kafka/adapter.go (5)
router/pkg/metric/stream_metric_store.go (4)
StreamMetricStore(41-47)StreamsEvent(25-31)ProviderType(16-16)ProviderTypeKafka(19-19)router/pkg/pubsub/nats/adapter.go (1)
ProviderAdapter(42-54)router/pkg/pubsub/redis/adapter.go (2)
ProviderAdapter(56-65)NewProviderAdapter(33-54)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(37-39)router/pkg/metric/noop_stream_metrics.go (1)
NewNoopStreamMetricStore(15-15)
router/pkg/pubsub/pubsub.go (4)
router/pkg/config/config.go (1)
EventsConfiguration(627-629)router/pkg/metric/stream_metric_store.go (1)
StreamMetricStore(41-47)router/pkg/pubsub/datasource/provider.go (3)
Provider(19-25)ProviderBuilder(28-35)ProviderOpts(37-39)router/pkg/metric/noop_stream_metrics.go (1)
NewNoopStreamMetricStore(15-15)
router-tests/testenv/testenv.go (4)
router/pkg/pubsub/nats/adapter.go (1)
NewAdapter(356-379)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(37-39)router/pkg/metric/stream_metric_store.go (1)
StreamMetricStore(41-47)router/pkg/metric/noop_stream_metrics.go (1)
NewNoopStreamMetricStore(15-15)
router/pkg/pubsub/pubsub_test.go (2)
router/pkg/metric/noop_stream_metrics.go (1)
NewNoopStreamMetricStore(15-15)demo/pkg/subgraphs/subgraphs.go (1)
New(207-280)
router/pkg/pubsub/nats/adapter.go (4)
router/pkg/metric/stream_metric_store.go (4)
StreamMetricStore(41-47)StreamsEvent(25-31)ProviderType(16-16)ProviderTypeNats(20-20)router/pkg/pubsub/nats/engine_datasource.go (1)
PublishAndRequestEventConfiguration(28-32)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(37-39)router/pkg/metric/noop_stream_metrics.go (1)
NewNoopStreamMetricStore(15-15)
router/core/graph_server.go (5)
router/pkg/metric/stream_metric_store.go (2)
StreamMetricStore(41-47)NewStreamMetricStore(56-81)router/pkg/metric/noop_metrics.go (1)
NewNoopMetrics(63-65)router/pkg/metric/noop_stream_metrics.go (1)
NewNoopStreamMetricStore(15-15)router/pkg/metric/config.go (1)
OpenTelemetry(69-84)router/pkg/config/config.go (1)
Prometheus(97-111)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: integration_test (./events)
- GitHub Check: image_scan (nonroot)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: build_test
- GitHub Check: image_scan
- GitHub Check: build-router
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (34)
router/pkg/metric/oltp_stream_metric_store.go (1)
50-52: No-op Shutdown on StreamMetricProvider isn’t required
TheStreamMetricProviderinterface only definesProduce,Consume, andFlush—it does not includeShutdown. As a result, adding a no-opShutdownmethod tootlpStreamEventMetrics(orpromStreamEventMetrics) would neither satisfy any interface nor be called by the store. TheStreamMetricsstore itself implementsShutdown(ctx)by flushing all providers, which aligns with the central shutdown of meter providers at the router level.Likely an incorrect or invalid review comment.
router/pkg/otel/attributes.go (1)
63-70: Attribute keys diverge from PR’s documented labels; confirm final namingPR text mentions:
- wg.messaging.operation.name
- wg.messaging.system
- wg.messaging.destination.name
- wg.provider.id
- wg.error (boolean)
Current additions are:
- wg.stream.operation.name
- wg.provider.type
- wg.destination.name
- wg.provider.id
- wg.error.type
This inconsistency will confuse users and break documented examples. Decide on one scheme (prefer a single “stream” or “messaging” prefix, and boolean wg.error vs error.type).
Would you like a follow-up diff to align keys and the call sites across the metrics store and adapters once you confirm the chosen naming?
router/core/graph_server.go (3)
783-786: Good defaulting to Noop storesInitializing both metricStore and streamMetricStore to no-op avoids scattered nil checks and simplifies shutdown paths.
883-895: Stream store wiring looks correct; verify config gating matches intentCreation is gated on OpenTelemetry.Streams || Prometheus.Streams, independent of the top-level metricsEnabled. If intended to allow enabling streams independently, this is good. Otherwise, consider also gating on metricsEnabled.
1135-1140: StreamMetricStore set in all RouterEngineConfiguration literalsVerified that every literal includes the new field:
- router/core/plan_generator.go (line 258):
StreamMetricStore: metric.NewNoopStreamMetricStore()- router/core/graph_server.go (lines 1135–1140):
StreamMetricStore: gm.streamMetricStoreNo missing instances found.
router/pkg/metric/noop_stream_metrics.go (1)
7-15: Solid null-object implementation; fits the codebase patternNoopStreamMetricStore cleanly satisfies the StreamMetricStore contract with no side effects. Matches the router-wide approach of central provider lifecycle management and no-op Shutdown/Flush.
router/pkg/pubsub/pubsub_test.go (3)
70-71: Correct: pass StreamMetricStore into build()Good call to thread rmetric.NewNoopStreamMetricStore() into build for deterministic tests.
123-124: Correct: error-path mock updated to include providerOptsThe expectation now matches the updated BuildProvider signature.
241-243: Correct: OK-path mock updated to include providerOptsExpectation and line break formatting look good.
router-tests/testenv/testenv.go (5)
66-68: Correct imports for ProviderOpts and metricsNew imports are used below to inject a no-op stream metric store into NATS adapters during tests.
273-277: Well-scoped flags for enabling stream metricsEnableOTLPEventMetrics and EnablePrometheusEventMetrics integrate cleanly with the existing MetricOptions without coupling.
1494-1516: Prometheus config: proper wiring of Streams flag and exclusionsThe Streams toggle flows into PrometheusConfig; consistent with other metric groups and exclusions.
1519-1547: OTLP config: proper wiring of Streams flag and test readerStreams toggle and TestReader integration are consistent with the rest of the telemetry setup.
2827-2829: Good: inject NoopStreamMetricStore into NATS adapters in testsThis matches the codebase’s null-object pattern for metric stores; avoids nil checks and keeps adapters simple. Referencing your prior approach for event metrics is consistent.
router-tests/prometheus_stream_metrics_test.go (2)
24-24: Test timeout is reasonable30s affords enough time for external brokers in CI; keep as-is unless flakes are observed.
259-268: Nice: shorter read timeout improves NATS subscribe determinismTuning WebSocketClientReadTimeout to 1s helps these subscription tests complete reliably.
router/core/factoryresolver.go (1)
11-12: Import alias is appropriateUsing rmetric keeps the StreamMetricStore type reference clear and avoids collision.
router/pkg/pubsub/kafka/adapter.go (4)
24-27: Operation-name constants are clear and aligned with testsDefining kafkaReceive/ kafkaProduce centrally is good. Kafka uses "produce" (as asserted in tests) while Redis/NATS use "publish"/"request".
143-161: Goroutine wiring and provider ID propagation look goodSpawning a poller per subscription and threading providerId via PollerOpts is clean and keeps metrics attributable per provider.
197-215: Produce-path metrics are complete (provider labels and error type included)Both success and error paths emit StreamOperationName and include ProviderType, ProviderId, DestinationName; error path sets "publish_error". Matches OTLP/Prom expectations and tests.
257-277: Null-object fallback for StreamMetricStore is correct and consistentDefaulting to NewNoopStreamMetricStore when no store is provided aligns with the codebase’s pattern (no nil checks at call sites).
router/pkg/pubsub/redis/adapter.go (3)
33-54: ProviderOpts plumbing and Noop metric store fallback are solidConstructor threads StreamMetricStore through and defaults to Noop, ensuring non-nil store everywhere.
130-136: Consume-path metrics include provider and destination attributesEmitting ProviderId, ProviderType=redis, operation=receive, and DestinationName=channel is correct and matches tests.
174-180: Publish-path metrics are complete and consistentError path uses ErrorType="publish_error"; both paths include ProviderId/Type and DestinationName. Good coverage.
Also applies to: 184-189
router-tests/telemetry/stream_metrics_test.go (3)
38-82: Kafka publish assertions are precise and aligned with instrumentationValidates operation="produce", provider/type/destination labels, and correct count aggregation. Nice.
88-164: Kafka subscribe test synchronizes well before collecting metricsWaiting for subscription, producing one message, then Collect-ing is a good pattern to reduce flakiness.
371-418: Redis publish/subscribe tests are consistent with the metric specAttributes and counts match the store’s semantics. Good use of testenv helpers.
Also applies to: 420-499
router/pkg/pubsub/nats/adapter.go (3)
221-255: Publish metrics include provider labels and error classificationBoth success and error paths set ProviderId/Type, destination, and appropriate operation; error path uses "publish_error". Matches tests.
257-299: Request metrics are complete; error type is specificOperation="request" and "request_error" classification is useful; provider/destination labels included. Good.
356-379: Constructor uses Noop store fallback and scopes logger to “nats”Consistent with other providers; avoids nil checks downstream.
router/pkg/pubsub/pubsub.go (2)
56-60: Good: default to Noop stream metric storeGuarding against nil here ensures provider builders always receive a non-nil store, matching the project’s null-object pattern.
73-76: Passing the store through build() per provider is the right abstractionCentralizes metric plumbing without leaking concerns to data source planning.
Also applies to: 89-92, 105-108
router/pkg/metric/stream_metric_store.go (2)
90-92: Verify attribute set for cardinality and conventions.
- Destination name and provider id as labels can increase cardinality. Ensure this aligns with your Prom/OTLP backends and that Config.CardinalityLimit is set appropriately where applicable.
- Confirm
wg.stream.operation.nameis the intended attribute (product notes mentioned omitting operation name). If you plan to exclude it, implement that upstream or gate it via config.Also applies to: 96-101, 110-113, 118-122
33-39: Interfaces look clean and future-proof.Separation between provider and store (with only Flush/Shutdown on the store) aligns with the router-level provider lifecycle. Good abstraction.
Also applies to: 41-47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
router-tests/telemetry/stream_metrics_test.go (5)
60-62: Assert type assertions for metric sums to fail fast with a clear messageCurrently the code discards the
okresult of the type assertion. If the metric type changes, tests will fail with a vague "len=0" rather than an explicit type mismatch. Assertokto get actionable failures.Apply this diff (pattern repeated across all occurrences):
- sum, _ := metricEntry.Data.(metricdata.Sum[int64]) - require.Len(t, sum.DataPoints, 1) + sum, ok := metricEntry.Data.(metricdata.Sum[int64]) + require.True(t, ok, "expected router stream metric %q to be Sum[int64], got %T", metricEntry.Name, metricEntry.Data) + require.Len(t, sum.DataPoints, 1)Also applies to: 137-140, 197-200, 249-252, 333-337, 395-399, 471-475
281-283: Increase WebSocket client read timeout to reduce CI flakinessA 1s WebSocket client read timeout can be tight under CI load. Bumping this slightly gives headroom without materially slowing the test.
- ModifyEngineExecutionConfiguration: func(ec *config.EngineExecutionConfiguration) { ec.WebSocketClientReadTimeout = time.Second }, + ModifyEngineExecutionConfiguration: func(ec *config.EngineExecutionConfiguration) { ec.WebSocketClientReadTimeout = 3 * time.Second },
101-109: Prefer typed assertions over raw JSON string comparisons for subscription payloadsYou pass a typed selection struct to the subscription but only assert on the raw JSON bytes. Decoding into a typed struct (or asserting on typed fields) yields clearer diffs and compile-time safety. At minimum, align the ID type with GraphQL Int.
Apply this small type improvement:
- employeeUpdatedMyKafka struct { - ID float64 `graphql:"id"` + employeeUpdatedMyKafka struct { + ID int `graphql:"id"` Details struct { Forename string `graphql:"forename"` Surname string `graphql:"surname"` } `graphql:"details"` } `graphql:"employeeUpdatedMyKafka(employeeID: 3)"`- employeeUpdated struct { - ID float64 `graphql:"id"` + employeeUpdated struct { + ID int `graphql:"id"` Details struct { Forename string `graphql:"forename"` Surname string `graphql:"surname"` } `graphql:"details"` } `graphql:"employeeUpdated(employeeID: 3)"`- employeeUpdates struct { - ID float64 `graphql:"id"` + employeeUpdates struct { + ID int `graphql:"id"` Details struct { Forename string `graphql:"forename"` Surname string `graphql:"surname"` } `graphql:"details"` } `graphql:"employeeUpdates"`Optionally, unmarshal args.dataValue into a small typed struct and assert fields rather than the whole JSON blob.
Also applies to: 284-292, 433-441
52-82: Validate metric temporality/monotonicity for completeness (optional)You already validate the counter values and attributes. As a nice-to-have, also assert that the metric is monotonic and has cumulative temporality to match OTel semantic expectations for counters.
Example (apply similarly in other sections):
require.True(t, sum.IsMonotonic, "expected counter to be monotonic") require.Equal(t, metricdata.CumulativeTemporality, sum.Temporality)
1-500: Reduce duplication with a small assertion helper (optional)The scope/metric lookup, type assertion, attribute checks, and value assertions are repeated across backends. A helper would make failures uniform and the tests shorter.
Example helper (place in router-tests/telemetry/assert_helpers.go):
package telemetry import ( "strings" "testing" "github.com/stretchr/testify/require" "go.opentelemetry.io/otel/sdk/metric/metricdata" ) type StreamMetricExpect struct { MetricName string Operation string ProviderType string DestinationSfx string ProviderID string ExpectedValue int64 } func AssertOTLPStreamCounter(t *testing.T, rm metricdata.ResourceMetrics, exp StreamMetricExpect) { t.Helper() scope := integration.GetMetricScopeByName(rm.ScopeMetrics, "cosmo.router.streams") require.NotNil(t, scope) metricEntry := integration.GetMetricByName(scope, exp.MetricName) require.NotNil(t, metricEntry) sum, ok := metricEntry.Data.(metricdata.Sum[int64]) require.True(t, ok, "expected %q to be Sum[int64], got %T", metricEntry.Name, metricEntry.Data) require.Len(t, sum.DataPoints, 1) dp := sum.DataPoints[0] attrs := dp.Attributes op, _ := attrs.Value(otelattrs.WgStreamOperationName) sys, _ := attrs.Value(otelattrs.WgProviderType) dst, _ := attrs.Value(otelattrs.WgDestinationName) provider, hasProvider := attrs.Value(otelattrs.WgProviderId) _, hasErr := attrs.Value(otelattrs.WgErrorType) require.Equal(t, exp.Operation, op.AsString()) require.Equal(t, exp.ProviderType, sys.AsString()) require.True(t, strings.HasSuffix(dst.AsString(), exp.DestinationSfx), dst.AsString()) require.True(t, hasProvider) require.Equal(t, exp.ProviderID, provider.AsString()) require.False(t, hasErr) require.Equal(t, exp.ExpectedValue, dp.Value) }Use in tests:
rm := metricdata.ResourceMetrics{} require.NoError(t, metricReader.Collect(context.Background(), &rm)) AssertOTLPStreamCounter(t, rm, StreamMetricExpect{ MetricName: "router.streams.sent.messages", Operation: "produce", ProviderType: "kafka", DestinationSfx: "employeeUpdated", ProviderID: "my-kafka", ExpectedValue: 2, })I can push a refactor PR on top of this if helpful.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
router-tests/prometheus_stream_metrics_test.go(1 hunks)router-tests/telemetry/stream_metrics_test.go(1 hunks)router/pkg/metric/stream_measurements.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- router/pkg/metric/stream_measurements.go
- router-tests/prometheus_stream_metrics_test.go
🧰 Additional context used
🧬 Code Graph Analysis (1)
router-tests/telemetry/stream_metrics_test.go (6)
router-tests/prometheus_stream_metrics_test.go (2)
TestFlakyEventMetrics(26-468)WaitTimeout(24-24)router-tests/testenv/testenv.go (4)
Run(105-122)Config(284-340)MetricOptions(263-277)Environment(1727-1763)router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-81)router-tests/utils.go (2)
GetMetricScopeByName(88-95)GetMetricByName(79-86)router/pkg/otel/attributes.go (5)
WgStreamOperationName(65-65)WgProviderType(66-66)WgDestinationName(67-67)WgProviderId(68-68)WgErrorType(69-69)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
- GitHub Check: build_push_image
- GitHub Check: image_scan
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: build_test
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./telemetry)
StarpTech
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
router-tests/prometheus_stream_metrics_test.go (1)
32-154: Fix race condition: Kafka subtests share topics.The "publish" and "subscribe" Kafka subtests run in parallel but operate on the same topic ("employeeUpdated"). Since
EnsureTopicExistsdeletes and recreates the topic, concurrent execution causes race conditions.Remove parallel execution from the Kafka subtests:
t.Run("publish", func(t *testing.T) { - t.Parallel() + // Not parallel: shares topic with subscribe test t.Run("subscribe", func(t *testing.T) { - t.Parallel() + // Not parallel: shares topic with publish testrouter-tests/telemetry/stream_metrics_test.go (1)
35-37: Fix race condition: Kafka subtests share topics.Same issue as in the Prometheus test - the Kafka subtests run in parallel but share the same topic.
Remove parallel execution from the Kafka subtests:
t.Run("publish", func(t *testing.T) { - t.Parallel() + // Not parallel: shares topic with subscribe test t.Run("subscribe", func(t *testing.T) { - t.Parallel() + // Not parallel: shares topic with publish test
🧹 Nitpick comments (3)
router-tests/prometheus_stream_metrics_test.go (2)
26-26: Consider renaming test to reflect its purpose.The test name
TestFlakyEventMetricssuggests it's testing flaky behavior, but it appears to be testing normal stream metrics functionality. Consider renaming to something more descriptive likeTestPrometheusStreamMetrics.-func TestFlakyEventMetrics(t *testing.T) { +func TestPrometheusStreamMetrics(t *testing.T) {
54-75: Add helper functions for metric assertions.The metric assertion logic is repeated throughout the test with only minor variations. Consider extracting helper functions to reduce duplication and improve readability.
Add these helper functions at the beginning of the file:
func findMetricFamilyByName(mf []*dto.MetricFamily, name string) *dto.MetricFamily { for _, family := range mf { if family.GetName() == name { return family } } return nil } func findMetricLabelByName(metrics []*dto.Metric, name string) *dto.LabelPair { if len(metrics) == 0 { return nil } for _, label := range metrics[0].GetLabel() { if label.GetName() == name { return label } } return nil }Then use them consistently throughout the test.
router-tests/telemetry/stream_metrics_test.go (1)
29-29: Consider renaming test to reflect its purpose.Similar to the Prometheus test, the name
TestFlakyEventMetricsis misleading. Consider renaming toTestOTLPStreamMetricsfor clarity.-func TestFlakyEventMetrics(t *testing.T) { +func TestOTLPStreamMetrics(t *testing.T) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
router-tests/prometheus_stream_metrics_test.go(1 hunks)router-tests/telemetry/stream_metrics_test.go(1 hunks)router-tests/testenv/testenv.go(5 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-14T17:22:41.620Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.620Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router-tests/testenv/testenv.go
🧬 Code Graph Analysis (2)
router-tests/prometheus_stream_metrics_test.go (3)
router-tests/testenv/testenv.go (7)
Run(105-122)Config(284-340)ConfigWithEdfsKafkaJSONTemplate(85-85)MetricOptions(263-277)Environment(1727-1763)ConfigWithEdfsNatsJSONTemplate(87-87)ConfigWithEdfsRedisJSONTemplate(89-89)router-tests/events/event_helpers.go (3)
EnsureTopicExists(37-54)ProduceKafkaMessage(16-35)ProduceRedisMessage(56-81)router-tests/testenv/utils.go (1)
AwaitChannelWithT(10-19)
router-tests/testenv/testenv.go (4)
router/pkg/pubsub/nats/adapter.go (1)
NewAdapter(356-379)router/pkg/pubsub/datasource/provider.go (1)
ProviderOpts(37-39)router/pkg/metric/stream_metric_store.go (1)
StreamMetricStore(41-47)router/pkg/metric/noop_stream_metrics.go (1)
NewNoopStreamMetricStore(15-15)
🔇 Additional comments (4)
router-tests/testenv/testenv.go (4)
273-276: LGTM! Stream metrics configuration added correctly.The addition of
EnableOTLPStreamMetricsandEnablePrometheusStreamMetricsfields to theMetricOptionsstruct properly extends the test environment to support stream metrics for both telemetry exporters.
1510-1510: LGTM! Prometheus stream metrics wired correctly.The
Streamsfield is properly configured fromEnablePrometheusStreamMetrics.
1533-1533: LGTM! OTLP stream metrics wired correctly.The
Streamsfield is properly configured fromEnableOTLPStreamMetrics.
2827-2829: LGTM! NATS adapter initialization with stream metrics.The NATS adapter is properly initialized with the
ProviderOptscontaining the no-op stream metric store for testing.
This PR contains code for event metrics for edfs. We use a generic produce and consume metrics, which is documented by semconv. https://opentelemetry.io/docs/specs/semconv/messaging/messaging-metrics/#producer-metrics
router.streams.sent.messages: counter of how many messages have been sent
router.streams.received.messages: counter of how many messages have been received
Both metrics share common attributes
The following is an example (extra common attributes like federated_graph_id, etc have been removed)
Note that this PR does not contain the graphql operation name as an attribute. cc: @StarpTech
Summary by CodeRabbit
New Features
Configuration
Improvements
Tests
Checklist