-
Notifications
You must be signed in to change notification settings - Fork 313
APM service map metrics: cardinality explosion, Prometheus compatibility, and late-span data loss #6710
Description
Summary
The otel_apm_service_map processor has several issues with metric generation when writing to Prometheus-compatible sinks (e.g., Amazon Managed Prometheus).
Background: How Prometheus expects metrics
- Remote write is type-agnostic. Every sample is just labels +
(timestamp, value). Prometheus TSDB does not know if a sample is gauge, counter, or histogram. - Counters must be cumulative.
rate()works by diffing consecutive values. If the value drops (e.g., 15 → 5), Prometheus treats it as a counter reset and applies incorrect compensation. - Same-timestamp writes are rejected. Two samples for the same series with the same timestamp but different values → second write is rejected with
ErrDuplicateSampleForTimestamp(source). Identical values are silently accepted as a no-op. isMonotoniconly exists on Sum metrics. Gauge, Histogram, Summary do not have this flag. It distinguishes counters (only go up) from non-monotonic sums (can go up and down).
Issue 1: randomKey UUID causes unbounded cardinality explosion
Every metric includes a random UUID label (ApmServiceMapMetricsUtil.java:277):
labelsWithRandomKey.put("randomKey", UUID.randomUUID().toString());This creates ~120,000 new throwaway series per hour (5 hosts × 20 services × 10 operations × 4 metrics × 30 windows/hr). TSDB costs grow without bound, and rate()/increase() cannot work since each series has only one data point.
randomKey is currently masking Issues 2, 3, and 4. Since every write has unique labels, Prometheus never sees consecutive values or conflicting writes. These issues will surface once randomKey is removed.
Related PR: #6672 (replaces randomKey with stable trace_processor_host_id)
Issue 2: Delta temporality is not supported by Prometheus
All metrics use AGGREGATION_TEMPORALITY_DELTA with isMonotonic(true) (ApmServiceMapMetricsUtil.java:284-286). This says "I am a counter that only reports increments per window" — but Prometheus expects counters to be cumulative running totals.
Once series become stable (after removing randomKey), Prometheus will see values like 15 → 5 between windows and treat them as counter resets, producing incorrect rates.
If the intent is per-window counts, consider using Gauge type or non-monotonic Sum instead.
Issue 3: Late-arriving spans cause silent data loss
Only occurs once randomKey is removed. When a late span produces a metric with the same labels + timestamp as one already written, Prometheus rejects the second write (head_append.go:658-669):
- Same timestamp, same value: silently accepted (no-op)
- Same timestamp, different value: rejected (
ErrDuplicateSampleForTimestamp)
At low late-span rates (0.01–0.1%), data loss is under 0.1%. At 5%+ late rates (backpressure), high-traffic operations could see ~4.5% undercount.
Issue 4: No multi-host protection by default
Without randomKey, multiple Data Prepper hosts processing the same service + operation produce identical metric series. Prometheus sees interleaved values from different hosts, causing counter resets and incorrect aggregations.
Related PR: #6672 (adds trace_processor_host_id)
Suggested approaches
-
Replace
randomKeywith a stable host identifier (addressed by Remove randomKey label from APM service map metrics #6672) -
Evaluate temporality and metric type — if per-window counts are the intent, consider Gauge or non-monotonic Sum. If true counters are desired, switch to cumulative temporality with running totals.
-
To make these true Prometheus counters, the processor would need to maintain a running total in memory across flush windows instead of resetting each window. For example, instead of emitting delta values
10, 15, 5across three windows, emit cumulative values10, 25, 30. This requires:- A
Map<MetricKey, Long>running total that persists across window rotations - Accumulating each window's count:
runningTotal.merge(key, windowCount, Long::sum) - Emitting the running total as the value with
AGGREGATION_TEMPORALITY_CUMULATIVE - Setting
startTimestampto when the key was first created (not equal totimestamp)
The tradeoff is memory — the running total map grows with cardinality and never shrinks unless an expiration mechanism is added (similar to
metrics_expirationin the OTel Collector spanmetrics connector). On restart, all totals are lost and Prometheus sees a counter reset — this is expected and handled correctly byrate(). - A
-
Consider using processor timestamp instead of span
endTime— in healthy pipelines where 95%+ of spans arrive on time, the processor timestamp is within seconds of the span'sendTime, so time attribution is nearly identical. Late spans would get attributed to processing time instead of their actual time, but these are the spans that would otherwise be rejected and lost entirely. This is the same approach used by the OTel Collector spanmetrics connector, which usesclock.Now()at flush time for all metric timestamps. -
Consolidate error/fault into a label on the request metric instead of emitting separate
errorandfaultmetrics. Currently the processor emits three separate Sum metrics per key:request,error, andfault. This triples the number of time series. Instead, add astatuslabel to a singlerequestmetric:Current approach (3 separate series per key):
request{service="orders", operation="POST /checkout"} 100 error{service="orders", operation="POST /checkout"} 3 fault{service="orders", operation="POST /checkout"} 1Suggested approach (1 metric with status label):
request{service="orders", operation="POST /checkout", status="ok"} 96 request{service="orders", operation="POST /checkout", status="error"} 3 request{service="orders", operation="POST /checkout", status="fault"} 1This reduces series count by ~3x and allows simpler queries:
- Total requests:
sum by(service, operation)(request) - Error rate:
request{status="error"} / ignoring(status) sum by(service, operation)(request) - All non-ok:
request{status!="ok"}
- Total requests:
-
Document the late-span rejection behavior and expected data-loss characteristics.
Environment
- Data Prepper version: latest main branch
- Sink: Amazon Managed Prometheus / any Prometheus-compatible TSDB
Metadata
Metadata
Assignees
Labels
Type
Projects
Status