Skip to content

APM service map metrics: cardinality explosion, Prometheus compatibility, and late-span data loss #6710

@vamsimanohar

Description

@vamsimanohar

Summary

The otel_apm_service_map processor has several issues with metric generation when writing to Prometheus-compatible sinks (e.g., Amazon Managed Prometheus).

Background: How Prometheus expects metrics

  • Remote write is type-agnostic. Every sample is just labels + (timestamp, value). Prometheus TSDB does not know if a sample is gauge, counter, or histogram.
  • Counters must be cumulative. rate() works by diffing consecutive values. If the value drops (e.g., 15 → 5), Prometheus treats it as a counter reset and applies incorrect compensation.
  • Same-timestamp writes are rejected. Two samples for the same series with the same timestamp but different values → second write is rejected with ErrDuplicateSampleForTimestamp (source). Identical values are silently accepted as a no-op.
  • isMonotonic only exists on Sum metrics. Gauge, Histogram, Summary do not have this flag. It distinguishes counters (only go up) from non-monotonic sums (can go up and down).

Issue 1: randomKey UUID causes unbounded cardinality explosion

Every metric includes a random UUID label (ApmServiceMapMetricsUtil.java:277):

labelsWithRandomKey.put("randomKey", UUID.randomUUID().toString());

This creates ~120,000 new throwaway series per hour (5 hosts × 20 services × 10 operations × 4 metrics × 30 windows/hr). TSDB costs grow without bound, and rate()/increase() cannot work since each series has only one data point.

randomKey is currently masking Issues 2, 3, and 4. Since every write has unique labels, Prometheus never sees consecutive values or conflicting writes. These issues will surface once randomKey is removed.

Related PR: #6672 (replaces randomKey with stable trace_processor_host_id)

Issue 2: Delta temporality is not supported by Prometheus

All metrics use AGGREGATION_TEMPORALITY_DELTA with isMonotonic(true) (ApmServiceMapMetricsUtil.java:284-286). This says "I am a counter that only reports increments per window" — but Prometheus expects counters to be cumulative running totals.

Once series become stable (after removing randomKey), Prometheus will see values like 15 → 5 between windows and treat them as counter resets, producing incorrect rates.

If the intent is per-window counts, consider using Gauge type or non-monotonic Sum instead.

Issue 3: Late-arriving spans cause silent data loss

Only occurs once randomKey is removed. When a late span produces a metric with the same labels + timestamp as one already written, Prometheus rejects the second write (head_append.go:658-669):

  • Same timestamp, same value: silently accepted (no-op)
  • Same timestamp, different value: rejected (ErrDuplicateSampleForTimestamp)

At low late-span rates (0.01–0.1%), data loss is under 0.1%. At 5%+ late rates (backpressure), high-traffic operations could see ~4.5% undercount.

Issue 4: No multi-host protection by default

Without randomKey, multiple Data Prepper hosts processing the same service + operation produce identical metric series. Prometheus sees interleaved values from different hosts, causing counter resets and incorrect aggregations.

Related PR: #6672 (adds trace_processor_host_id)

Suggested approaches

  1. Replace randomKey with a stable host identifier (addressed by Remove randomKey label from APM service map metrics #6672)

  2. Evaluate temporality and metric type — if per-window counts are the intent, consider Gauge or non-monotonic Sum. If true counters are desired, switch to cumulative temporality with running totals.

  3. To make these true Prometheus counters, the processor would need to maintain a running total in memory across flush windows instead of resetting each window. For example, instead of emitting delta values 10, 15, 5 across three windows, emit cumulative values 10, 25, 30. This requires:

    • A Map<MetricKey, Long> running total that persists across window rotations
    • Accumulating each window's count: runningTotal.merge(key, windowCount, Long::sum)
    • Emitting the running total as the value with AGGREGATION_TEMPORALITY_CUMULATIVE
    • Setting startTimestamp to when the key was first created (not equal to timestamp)

    The tradeoff is memory — the running total map grows with cardinality and never shrinks unless an expiration mechanism is added (similar to metrics_expiration in the OTel Collector spanmetrics connector). On restart, all totals are lost and Prometheus sees a counter reset — this is expected and handled correctly by rate().

  4. Consider using processor timestamp instead of span endTime — in healthy pipelines where 95%+ of spans arrive on time, the processor timestamp is within seconds of the span's endTime, so time attribution is nearly identical. Late spans would get attributed to processing time instead of their actual time, but these are the spans that would otherwise be rejected and lost entirely. This is the same approach used by the OTel Collector spanmetrics connector, which uses clock.Now() at flush time for all metric timestamps.

  5. Consolidate error/fault into a label on the request metric instead of emitting separate error and fault metrics. Currently the processor emits three separate Sum metrics per key: request, error, and fault. This triples the number of time series. Instead, add a status label to a single request metric:

    Current approach (3 separate series per key):

    request{service="orders", operation="POST /checkout"} 100
    error{service="orders", operation="POST /checkout"} 3
    fault{service="orders", operation="POST /checkout"} 1
    

    Suggested approach (1 metric with status label):

    request{service="orders", operation="POST /checkout", status="ok"} 96
    request{service="orders", operation="POST /checkout", status="error"} 3
    request{service="orders", operation="POST /checkout", status="fault"} 1
    

    This reduces series count by ~3x and allows simpler queries:

    • Total requests: sum by(service, operation)(request)
    • Error rate: request{status="error"} / ignoring(status) sum by(service, operation)(request)
    • All non-ok: request{status!="ok"}
  6. Document the late-span rejection behavior and expected data-loss characteristics.

Environment

  • Data Prepper version: latest main branch
  • Sink: Amazon Managed Prometheus / any Prometheus-compatible TSDB

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Unplanned

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions