APM service map metrics: cardinality explosion, Prometheus compatibility, and late-span data loss

## Summary

The `otel_apm_service_map` processor has several issues with metric generation when writing to Prometheus-compatible sinks (e.g., Amazon Managed Prometheus).

## Background: How Prometheus expects metrics

- **Remote write is type-agnostic.** Every sample is just labels + `(timestamp, value)`. Prometheus TSDB does not know if a sample is gauge, counter, or histogram.
- **Counters must be cumulative.** `rate()` works by diffing consecutive values. If the value drops (e.g., 15 → 5), Prometheus treats it as a counter reset and applies incorrect compensation.
- **Same-timestamp writes are rejected.** Two samples for the same series with the same timestamp but different values → second write is rejected with `ErrDuplicateSampleForTimestamp` ([source](https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L658)). Identical values are silently accepted as a no-op.
- **`isMonotonic` only exists on Sum metrics.** Gauge, Histogram, Summary do not have this flag. It distinguishes counters (only go up) from non-monotonic sums (can go up and down).

## Issue 1: `randomKey` UUID causes unbounded cardinality explosion

Every metric includes a random UUID label (`ApmServiceMapMetricsUtil.java:277`):

```java
labelsWithRandomKey.put("randomKey", UUID.randomUUID().toString());
```

This creates ~120,000 new throwaway series per hour (5 hosts × 20 services × 10 operations × 4 metrics × 30 windows/hr). TSDB costs grow without bound, and `rate()`/`increase()` cannot work since each series has only one data point.

**`randomKey` is currently masking Issues 2, 3, and 4.** Since every write has unique labels, Prometheus never sees consecutive values or conflicting writes. These issues will surface once `randomKey` is removed.

**Related PR:** #6672 (replaces `randomKey` with stable `trace_processor_host_id`)

## Issue 2: Delta temporality is not supported by Prometheus

All metrics use `AGGREGATION_TEMPORALITY_DELTA` with `isMonotonic(true)` (`ApmServiceMapMetricsUtil.java:284-286`). This says "I am a counter that only reports increments per window" — but Prometheus expects counters to be cumulative running totals.

Once series become stable (after removing `randomKey`), Prometheus will see values like `15 → 5` between windows and treat them as counter resets, producing incorrect rates.

If the intent is per-window counts, consider using `Gauge` type or non-monotonic Sum instead.

## Issue 3: Late-arriving spans cause silent data loss

**Only occurs once `randomKey` is removed.** When a late span produces a metric with the same labels + timestamp as one already written, Prometheus rejects the second write ([`head_append.go:658-669`](https://github.com/prometheus/prometheus/blob/main/tsdb/head_append.go#L658)):

- **Same timestamp, same value:** silently accepted (no-op)
- **Same timestamp, different value:** rejected (`ErrDuplicateSampleForTimestamp`)

At low late-span rates (0.01–0.1%), data loss is under 0.1%. At 5%+ late rates (backpressure), high-traffic operations could see ~4.5% undercount.

## Issue 4: No multi-host protection by default

Without `randomKey`, multiple Data Prepper hosts processing the same service + operation produce identical metric series. Prometheus sees interleaved values from different hosts, causing counter resets and incorrect aggregations.

**Related PR:** #6672 (adds `trace_processor_host_id`)

## Suggested approaches

1. **Replace `randomKey` with a stable host identifier** (addressed by #6672)

2. **Evaluate temporality and metric type** — if per-window counts are the intent, consider Gauge or non-monotonic Sum. If true counters are desired, switch to cumulative temporality with running totals.

3. **To make these true Prometheus counters**, the processor would need to maintain a running total in memory across flush windows instead of resetting each window. For example, instead of emitting delta values `10, 15, 5` across three windows, emit cumulative values `10, 25, 30`. This requires:
   - A `Map<MetricKey, Long>` running total that persists across window rotations
   - Accumulating each window's count: `runningTotal.merge(key, windowCount, Long::sum)`
   - Emitting the running total as the value with `AGGREGATION_TEMPORALITY_CUMULATIVE`
   - Setting `startTimestamp` to when the key was first created (not equal to `timestamp`)
   
   The tradeoff is memory — the running total map grows with cardinality and never shrinks unless an expiration mechanism is added (similar to `metrics_expiration` in the [OTel Collector spanmetrics connector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnector)). On restart, all totals are lost and Prometheus sees a counter reset — this is expected and handled correctly by `rate()`.

4. **Consider using processor timestamp instead of span `endTime`** — in healthy pipelines where 95%+ of spans arrive on time, the processor timestamp is within seconds of the span's `endTime`, so time attribution is nearly identical. Late spans would get attributed to processing time instead of their actual time, but these are the spans that would otherwise be rejected and lost entirely. This is the same approach used by the [OTel Collector spanmetrics connector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnector), which uses `clock.Now()` at flush time for all metric timestamps.

5. **Consolidate error/fault into a label on the request metric** instead of emitting separate `error` and `fault` metrics. Currently the processor emits three separate Sum metrics per key: `request`, `error`, and `fault`. This triples the number of time series. Instead, add a `status` label to a single `request` metric:

   Current approach (3 separate series per key):
   ```
   request{service="orders", operation="POST /checkout"} 100
   error{service="orders", operation="POST /checkout"} 3
   fault{service="orders", operation="POST /checkout"} 1
   ```

   Suggested approach (1 metric with status label):
   ```
   request{service="orders", operation="POST /checkout", status="ok"} 96
   request{service="orders", operation="POST /checkout", status="error"} 3
   request{service="orders", operation="POST /checkout", status="fault"} 1
   ```

   This reduces series count by ~3x and allows simpler queries:
   - Total requests: `sum by(service, operation)(request)`
   - Error rate: `request{status="error"} / ignoring(status) sum by(service, operation)(request)`
   - All non-ok: `request{status!="ok"}`

6. **Document the late-span rejection behavior** and expected data-loss characteristics.

## Environment

- Data Prepper version: latest main branch
- Sink: Amazon Managed Prometheus / any Prometheus-compatible TSDB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APM service map metrics: cardinality explosion, Prometheus compatibility, and late-span data loss #6710

Summary

Background: How Prometheus expects metrics

Issue 1: `randomKey` UUID causes unbounded cardinality explosion

Issue 2: Delta temporality is not supported by Prometheus

Issue 3: Late-arriving spans cause silent data loss

Issue 4: No multi-host protection by default

Suggested approaches

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

APM service map metrics: cardinality explosion, Prometheus compatibility, and late-span data loss #6710

Description

Summary

Background: How Prometheus expects metrics

Issue 1: randomKey UUID causes unbounded cardinality explosion

Issue 2: Delta temporality is not supported by Prometheus

Issue 3: Late-arriving spans cause silent data loss

Issue 4: No multi-host protection by default

Suggested approaches

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Issue 1: `randomKey` UUID causes unbounded cardinality explosion