Prometheus scrape via `prometheusreceiver`

Tracecore scrapes Prometheus-format endpoints via the upstream prometheusreceiver. This is the adoption shape for every vendor GPU exporter per RFC-0013 §2 (Adoption matrix): NVIDIA dcgm-exporter, AMD ROCm/device-metrics-exporter, Intel intel/xpumanager, Habana Prometheus Metric Exporter — and for the Kueue scheduler's metrics endpoint. Replaces the in-tree dcgm and kueue receivers per RFC-0013 §7 (Deletion list — v0.1.0).

Three OTTL transform processors run in series over the scraped metrics:

transform/gpu_vendor stamps the customer-stable gpu.vendor resource attribute (RFC-0013 §3) so dashboards survive a future swap between vendor exporters.
transform/dcgm_to_hw_semconv projects the raw DCGM_FI_* namespace onto the customer-stable hw.gpu.* / hw.errors namespace declared in docs/proposals/semconv-hw-gpu-extensions.md so the next-cycle pattern detectors (issue #260 patterns #1 NVLink, #3 HBM ECC, #4 thermal throttle, #5 PCIe AER, #10 CUDA OOM) read one vendor-neutral wire format. Per docs/rfcs/0014-metrics-to-logs-pattern-input.md the verdict-emission half extends patterndetectorprocessor with processor.WithMetrics — shipped for cuda_oom (#10) via #437 / PR #461, with sibling consumers for patterns #1 / #3 / #4 / #5 pending under #260. The transform below is the load-bearing wire-format contract that the metrics-path consumer reads.
transform/ib_to_hw_semconv projects node_exporter --collector.infiniband's node_infiniband_port_state_id onto the customer-stable hw.network.ib.* namespace (docs/ATTRIBUTES.md §hw.network.*, alpha) so pattern #2's link-flap detector reads the same vendor-neutral shape whether the underlying source is node_exporter, a Mellanox-specific exporter, or journald-kernel.md's mlx5_core stream. Same RFC-0014 metrics- path consumer dependency as the DCGM transform (cuda_oom #10 shipped via PR #461; IB link-flap sibling consumer pending).

Config

# docs/integrations/examples/prometheus-scrape.yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: dcgm-exporter
          scrape_interval: 15s
          scrape_timeout: 10s
          metrics_path: /metrics
          fallback_scrape_protocol: PrometheusText1.0.0
          static_configs:
            - targets:
                - REPLACE_WITH_DCGM_EXPORTER_TARGET

processors:
  transform/gpu_vendor:
    metric_statements:
      - context: datapoint
        statements:
          - set(resource.attributes["gpu.vendor"], "nvidia") where IsMatch(metric.name, "^DCGM_")
          - set(resource.attributes["gpu.vendor"], "amd") where IsMatch(metric.name, "^amdsmi_")
          - set(resource.attributes["gpu.vendor"], "intel") where IsMatch(metric.name, "^xpum_")
          - set(resource.attributes["gpu.vendor"], "habana") where IsMatch(metric.name, "^habanalabs_")
  batch:
    send_batch_size: 8192
    timeout: 10s

exporters:
  otlphttp:
    endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT
    compression: gzip
    timeout: 10s

service:
  pipelines:
    metrics/scrape:
      receivers: [prometheus]
      processors: [transform/gpu_vendor, batch]
      exporters: [otlphttp]

Validate with the in-tree binary:

./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yaml

Exit 0 means the config parses, every scrape target URL is well-formed, and the OTTL statements type-check against the metric-datapoint context.

Install dcgm-exporter

The recipe above scrapes NVIDIA's upstream dcgm-exporter (Apache-2.0). Install it from the canonical Helm repo:

helm repo add gpu-helm-charts \
  https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace gpu-monitoring --create-namespace \
  -f dcgm-exporter-values.yaml

The minimal dcgm-exporter-values.yaml that pairs with tracecore's prometheusreceiver scrape (no Prometheus Operator dependency) is:

# dcgm-exporter-values.yaml — minimal overlay for tracecore scrape
serviceMonitor:
  # tracecore scrapes via prometheusreceiver, not Prometheus Operator
  enabled: false
service:
  type: ClusterIP
  port: 9400
nodeSelector:
  # restrict the DaemonSet to GPU nodes; pair with NVIDIA's node-feature-
  # discovery or device-plugin label. Use the label your cluster stamps.
  nvidia.com/gpu.present: "true"
arguments:
  - "-f"
  - "/etc/dcgm-exporter/default-counters.csv"

To enable the pattern #1 NVLink series (commented out in upstream default-counters.csv — see the NVLink section below), mount a custom counters ConfigMap and point arguments at it via the -m flag per the chart's values.yaml documentation.

The chart renders a ServiceAccount, ConfigMap (default counters), Role + RoleBinding (read the ConfigMap), Service (ClusterIP on :9400), and DaemonSet (containerPort 9400, app.kubernetes.io/name=dcgm-exporter). The Service DNS name is dcgm-exporter.gpu-monitoring.svc; per-pod IPs live behind the app.kubernetes.io/name=dcgm-exporter selector.

Verify dcgm-exporter is scrapable

Confirm the exporter is healthy before pointing tracecore at it. Port-forward one pod and curl /metrics:

kubectl -n gpu-monitoring port-forward \
  $(kubectl -n gpu-monitoring get pod \
      -l app.kubernetes.io/name=dcgm-exporter \
      -o jsonpath='{.items[0].metadata.name}') \
  9400:9400 &
curl -sf http://localhost:9400/metrics | head -5

A healthy response begins with # HELP DCGM_FI_... and # TYPE ... lines in Prometheus text exposition format. Pattern-relevant prefixes to confirm are present:

Prefix	Patterns	Needs custom counters CSV?
`DCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTAL`	#3 HBM ECC	No
`DCGM_FI_DEV_*_VIOLATION`	#4 thermal throttle	No
`DCGM_FI_DEV_FB_{USED,FREE}`	#10 CUDA OOM	No
`DCGM_FI_PROF_PCIE_{TX,RX}_BYTES`	#5 PCIe AER	No (profiling enabled by default)
`DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES`	#1 NVLink	Yes — see NVLink section below

grep for the families you intend to alert on; a missing prefix means the corresponding pattern will never fire even if the recipe's OTTL stanza compiles cleanly. Then point tracecore's REPLACE_WITH_DCGM_EXPORTER_TARGET at either localhost:9400 (per- node DaemonSet shape) or dcgm-exporter.gpu-monitoring.svc:9400 (Deployment shape) and validate per the config section above.

Deployment shape

The right Kubernetes shape depends on the scrape target:

Per-node targets (NVIDIA dcgm-exporter, AMD/Intel/Habana per-node exporters): run tracecore as a DaemonSet and scrape localhost:<port> so each node's exporter is read by the tracecore pod on the same node. No cluster-wide service discovery required.
Cluster-scoped targets (Kueue's controller-manager metrics endpoint, single-replica vendor exporters): run tracecore as a single-replica Deployment and scrape the target's Service. Pair with kubernetes_sd_configs: if the target moves between pods on re-roll; for a stable Service ClusterIP, static_configs: is enough.

Adding authenticated targets (Kueue, vendor exporters)

The example scrapes a static unauthenticated target. For Kueue's controller-manager metrics endpoint (TLS + serviceaccount-token bearer):

        - job_name: kueue
          scheme: https
          scrape_interval: 30s
          metrics_path: /metrics
          fallback_scrape_protocol: PrometheusText1.0.0
          authorization:
            type: Bearer
            credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            server_name: kueue-controller-manager-metrics-service.kueue-system.svc
          static_configs:
            - targets:
                - kueue-controller-manager-metrics-service.kueue-system.svc:8443

Adjust server_name to match the Service's DNS name. The credentials_file path is the default ServiceAccount projected-token mount; if you use a custom token volume, update the path.

`gpu.vendor` resource-attribute mapping

The OTTL transform routes to a vendor tag based on the metric-name prefix each upstream exporter uses:

`metric.name` prefix	`gpu.vendor`	Upstream exporter
`DCGM_*`	`nvidia`	NVIDIA/dcgm-exporter
`amdsmi_*`	`amd`	ROCm/device-metrics-exporter
`xpum_*`	`intel`	intel/xpumanager
`habanalabs_*`	`habana`	Habana Prometheus Metric Exporter

The tag survives the RFC-0013 §3 contract; existing dashboards keyed on gpu.vendor continue to work after a vendor swap.

`DCGM_FI_` → `hw.gpu.` semconv projection

The second OTTL transform (transform/dcgm_to_hw_semconv in the example YAML) projects every load-bearing DCGM_FI_* series into the customer-stable namespace from docs/proposals/semconv-hw-gpu-extensions.md. The contract is one-direction: a downstream consumer reads only hw.gpu.* / hw.errors, never the raw DCGM names. Per RFC-0014 the pattern detectors built on top of this namespace (issue #260) land as a processor.WithMetrics extension to patterndetectorprocessor — not as an OTTL metrics-to-logs emitter — because OTel-contrib transformprocessor v0.130 cannot emit log records from a metrics pipeline. The cuda_oom (#10) consumer shipped via #437 / PR #461; siblings for #1 / #3 / #4 / #5 are pending under #260.

Resource attribution

dcgm-exporter stamps two cross-version label flavors per series: UUID / gpu_uuid for the NVML UUID, and gpu / GPU for the NVML index. The transform maps either onto the customer-stable resource attribute:

dcgm-exporter label	Resource attribute	Notes
`UUID` or `gpu_uuid`	`hw.id`	NVML UUID; durable join key. The transform prefers `UUID` and falls back to `gpu_uuid` when only the legacy label is present.
`gpu` or `GPU`	`hw.gpu.index`	NVML index; volatile across reboots. Same dual-label preference.
(computed)	`hw.type` = `"gpu"`	Stamped on every `DCGM_` series; gates downstream `hw.` filters against future non-GPU `hw.*` sources.
`pci_bus_id`	`hw.gpu.pci.bdf`	PCI bus-device-function; lifted only on `DCGM_FI_PROF_PCIE_{TX,RX}_BYTES` series so pattern #5's escalation can cross-reference dmesg AER lines on the same BDF.

Pattern #1 — NVLink degradation

Per-link Tx/Rx Counter. Each DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES series collapses into one metric name with the link index lifted into a datapoint attribute via OTTL ExtractPatterns:

Raw DCGM series	OTel metric	Datapoint attributes (added)
`DCGM_FI_PROF_NVLINK_L{N}_TX_BYTES` (N ∈ 0..17)	`hw.gpu.nvlink.io` (Counter, unit `By`)	`hw.gpu.nvlink.link={N}`, `network.io.direction=transmit`
`DCGM_FI_PROF_NVLINK_L{N}_RX_BYTES` (N ∈ 0..17)	`hw.gpu.nvlink.io` (Counter, unit `By`)	`hw.gpu.nvlink.link={N}`, `network.io.direction=receive`

The link index lift uses Int(ExtractPatterns(metric.name, "^DCGM_FI_PROF_NVLINK_L(?P<link>\\d+)_(TX|RX)_BYTES$")["link"]) so the resulting attribute is integer-typed (matches the semconv proposal's hw.gpu.nvlink.link: int). Per-link decomposition is the diagnostic-critical surface for pattern #1 silent NVLink degradation; without it the alert query has no group-by axis.

dcgm-exporter opt-in required. The DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES field IDs (1040..1075) are commented out in dcgm-exporter's upstream default-counters.csv. Operators must mount a custom counters ConfigMap and pass it via the chart's -m <ns>:<configmap> flag (or set arguments[1]=-f=/etc/dcgm-exporter/custom-counters.csv and a matching extraVolumes entry). Without this, the recipe compiles cleanly but emits zero hw.gpu.nvlink.io series — pattern #1 will never fire.

Pattern #3 — Uncorrectable HBM ECC

ECC counters expand into four series (correctable / uncorrectable × volatile / aggregate). Pattern #3 alerts on the uncorrectable volatile row; the rest are evidence context that the runbook references.

Raw DCGM series	OTel metric	Datapoint attributes (added)
`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	`hw.errors` (Counter, unit `{error}`)	`error.type=uncorrected`, `error.subtype=double_bit`, `error.persistence=volatile`
`DCGM_FI_DEV_ECC_DBE_AGG_TOTAL`	`hw.errors`	`error.type=uncorrected`, `error.subtype=double_bit`, `error.persistence=aggregate`
`DCGM_FI_DEV_ECC_SBE_VOL_TOTAL`	`hw.errors`	`error.type=corrected`, `error.subtype=single_bit`, `error.persistence=volatile`
`DCGM_FI_DEV_ECC_SBE_AGG_TOTAL`	`hw.errors`	`error.type=corrected`, `error.subtype=single_bit`, `error.persistence=aggregate`

The attribute names match the semconv hw.errors shape (see hw common). Pattern #3 doc consumes the error.persistence=volatile row in its alert query.

Pattern #4 — Thermal throttle cascade

Modern dcgm-exporter emits per-reason throttle counters as discrete DCGM_FI_DEV_*_VIOLATION series. Each maps onto hw.gpu.throttle.duration with a hw.gpu.throttle.reason attribute (semconv proposal §2).

Raw DCGM series	OTel metric	Datapoint attributes (added)
`DCGM_FI_DEV_THERMAL_VIOLATION`	`hw.gpu.throttle.duration` (Counter, unit `s`)	`hw.gpu.throttle.reason=thermal`
`DCGM_FI_DEV_POWER_VIOLATION`	`hw.gpu.throttle.duration`	`hw.gpu.throttle.reason=power`
`DCGM_FI_DEV_SYNC_BOOST_VIOLATION`	`hw.gpu.throttle.duration`	`hw.gpu.throttle.reason=sync_boost`
`DCGM_FI_DEV_BOARD_LIMIT_VIOLATION`	`hw.gpu.throttle.duration`	`hw.gpu.throttle.reason=hw_slowdown`

DCGM_FI_DEV_LOW_UTIL_VIOLATION is intentionally not mapped: the upstream semconv proposal's hw.gpu.throttle.reason enum (thermal, power, sync_boost, hw_slowdown, sw_thermal, display_clock, app_clock_setting) has no value for an "idle / low-utilization" throttle. Mapping it to a value outside the proposal would create forward-incompat drift once the SIG resolves the vocabulary. Tracked at #272 for the upstream proposal extension.

Pattern #4 doc alerts on the reason=thermal row; the other reasons are diagnostic context (power correlates with PSU sag, hw_slowdown is the "GPU has decided to clock itself down" hard signal).

Pattern #5 — PCIe AER cascade

DCGM exposes per-direction PCIe byte counters whose rate collapses when the link renegotiates to a lower generation / width. Pattern #5 watches the rate divergence across the host's GPU set.

Raw DCGM series	OTel metric	Datapoint attributes (added)
`DCGM_FI_PROF_PCIE_TX_BYTES`	`hw.gpu.io` (Counter, unit `By`)	`network.io.direction=transmit`
`DCGM_FI_PROF_PCIE_RX_BYTES`	`hw.gpu.io` (Counter, unit `By`)	`network.io.direction=receive`

The pci_bus_id label is lifted to the resource-level hw.gpu.pci.bdf so pattern #5's escalation matrix can cross- reference dmesg PCIe Bus Error: Corrected lines against the same BDF without joining series. Pattern #5 doc shows the divergence query in PromQL form.

Pattern #10 — CUDA OOM (framebuffer)

DCGM exposes the per-GPU framebuffer state as two Gauge series in bytes. They are the proximate signal pattern #10 joins to a RuntimeError: CUDA out of memory log record so the detector can discriminate true-OOM (≤5% free at fault time) from allocator fragmentation (>5% free at fault time). The OTTL projection lands the customer-stable hw.gpu.memory.{used,free} shape on the metrics pipeline; the total = used + free derivation lives at the bridge layer below.

Raw DCGM series	OTel metric	Datapoint attributes (unchanged)
`DCGM_FI_DEV_FB_USED`	`hw.gpu.memory.used` (Gauge, unit `By`)	none — vendor's `gpu`/`UUID` labels lifted to resource by the section above
`DCGM_FI_DEV_FB_FREE`	`hw.gpu.memory.free` (Gauge, unit `By`)	none — same

hw.gpu.memory.total is intentionally NOT projected at the OTTL metric-statements layer. transformprocessor v0.130 operates one datapoint at a time within a metric — there is no cross- series arithmetic that could compute total = used + free on a metrics pipeline (upstream README). The total is computed at the metrics-to-logs bridge layer where the two scalars are already projected onto a single log record; see the bridge-contract section below.

Pattern #10 doc consumes the joined record via module/processor/patterndetectorprocessor/cuda_oom.go's projectFBMemoryRecord (gate: both hw.gpu.memory.free AND hw.gpu.memory.total AND gpu.id on the same log record).

MIG caveat. On MIG-partitioned GPUs, DCGM_FI_DEV_FB_FREE reports the parent device, not the MIG slice. The detector spec (10-cuda-oom-deceptive.md §Edge cases) gates on hw.gpu.mig.enabled == true to skip MIG hosts until MIG-aware FB metrics are wired. The OTTL projection itself is MIG-safe — it just renames the parent-device series; the detector decides whether the renamed series is meaningful.

Pattern #2 — InfiniBand link flap

Source: node_exporter --collector.infiniband (the upstream Prometheus node-exporter infiniband collector which reads /sys/class/infiniband/<dev>/ports/<n>/phys_state and exposes the IBA-spec phys_state ID as an integer Gauge). Run the collector under tracecore's prometheusreceiver per RFC-0013 §2; the in-tree binary bundles prometheusreceiver so no extra component is required.

Raw node_exporter series	OTel metric	Datapoint attributes (added)
`node_infiniband_port_state_id{device, port}`	`hw.network.ib.port.state` (Gauge, IBA phys_state ID `1=Down` / `2=Init` / `3=Armed` / `4=Active`)	`hw.network.ib.device={device label}`, `hw.network.ib.port.num=Int({port label})`

The detector (module/processor/patterndetectorprocessor/ib_link_flap.go) reads these three attributes off a log record via port.Int() / state.Int() — the Int() cast on the port label is load-bearing because prometheusreceiver exposes Prometheus labels as strings while the projector calls Int() on the pdata Value. The companion series node_infiniband_state{state="<name>"} (string label) is intentionally not mapped: the detector compares against the patterns.IBPortState* integer constants, so the string variant would round-trip wrong.

The metric rename runs last so the where metric.name == "node_infiniband_port_state_id" guards on the attribute-stamp statements above still match the raw exporter name when each statement evaluates. Renaming first would short-circuit the attribute stamps because the second statement's guard would no longer find the original name.

Pattern #2 doc consumes the joined record via projectIBPortStateRecord (gate: hw.network.ib.port.state AND hw.network.ib.device AND hw.network.ib.port.num on the same log record, plus k8s.node.name on the resource). The metrics→logs emit half follows the RFC-0014 pattern — the cuda_oom (#10) precedent ships an in-tree consumer via #437 / PR #461 (processors.metrics.Metrics + bounded cross-stream buffer); the IB link-flap metrics-path consumer is a pending sibling follow-up. The bridge log-record schema (load-bearing for both the OTTL recipe path and the future in-tree consumer) is pinned in the next section.

Metrics-to-logs bridge contract (patterns #2, #3, #4, #5, #10)

The pattern detectors at module/processor/patterndetectorprocessor read log records as their primary input (processor.WithLogs). The DCGM scrape recipe above produces metric datapoints. Bridging the two at the OTTL layer is upstream-blocked at OTel-contrib v0.130 — no contrib processor or connector emits log records from a metrics pipeline (per RFC-0014).

RFC-0014's resolution path adds processor.WithMetrics directly on patterndetectorprocessor so the processor consumes the metrics pipeline in-tree and joins it against the logs path via a bounded cross-stream buffer. The cuda_oom (#10) consumer shipped via #437 / PR #461; metrics-path consumers for the other patterns (#2 IB link flap, #3 HBM ECC, #4 thermal throttle, #5 PCIe AER) are pending sibling follow-ups. Until those land, the bridge attribute contract below is the load-bearing wire-format an OTTL recipe (when one becomes expressible) OR the in-tree consumer MUST honor — the detector projections gate on this contract today, and any emitter (in-tree consumer or future OTTL recipe) that stamps these attributes fires the pattern end-to-end without changing the detector library.

Pattern #3 — `hw.errors.delta` (issue #273)

The HBM ECC detector gates on a log record carrying:

Attribute	Type	Source
`hw.errors.delta`	int	per-scrape delta of `hw.errors` counter (= `increase(hw_errors[scrape_interval])`)
`gpu.id`	string	PCI BDF resource attr from the DCGM series
`hw.gpu.index`	int (optional)	NVML index from the DCGM series
`error.type`	string	`uncorrected` for the alert row
`error.subtype`	string	`double_bit` for the alert row
`error.persistence`	string	`volatile` for the alert row
`k8s.node.name`	string (resource)	stamped by `k8sattributesprocessor`

The metric datapoint attribute set from the transform/dcgm_to_hw_semconv stanza above already carries error.type / error.subtype / error.persistence on the renamed hw.errors Counter, so the future emitter passes those through unchanged; the only new field is the per-scrape hw.errors.delta integer.

Pattern #4 — `hw.gpu.throttle.duration.delta` (issue #282)

The thermal-throttle detector gates on a log record carrying:

Attribute	Type	Source
`hw.gpu.throttle.duration.delta`	int (seconds)	per-scrape delta of `hw.gpu.throttle.duration`
`hw.gpu.throttle.reason`	string	`thermal` for the alert row
`gpu.id`	string	PCI BDF resource attr
`hw.gpu.index`	int (optional)	NVML index
`k8s.node.name`	string (resource)	stamped by `k8sattributesprocessor`

Units pinned to integer seconds because projectThermalThrottleRecord at module/processor/patterndetectorprocessor/patterndetector.go multiplies the delta by time.Second — the wire format MUST agree.

Pattern #5 — `tracecore.alert.pcie_rate_collapse.*` (issue #284)

Layer 2 of the PCIe AER cascade detector gates on a log record carrying:

Attribute	Type	Source
`tracecore.alert.pcie_rate_collapse.bytes_per_second`	double	`rate(hw.gpu.io[5m])` per GPU
`tracecore.alert.pcie_rate_collapse.baseline_bytes_per_second`	double	`quantile(0.5, rate(...)) by (k8s.node.name)`
`tracecore.alert.pcie_rate_collapse.direction`	string	`transmit` / `receive` (falls back to `network.io.direction`)
`gpu.id`	string	PCI BDF resource attr
`k8s.node.name`	string (resource)	stamped by `k8sattributesprocessor`

Namespacing under tracecore.alert.pcie_rate_collapse.* keeps the bridge log shape distinguishable from raw hw.gpu.io scrape samples downstream. Layer 1 (journald-kernel AER stanza) is documented in journald-kernel.md and ships independently of this bridge.

Pattern #2 — `hw.network.ib.port.state` (issue #393)

The InfiniBand link-flap detector (module/processor/patterndetectorprocessor/ib_link_flap.go::projectIBPortStateRecord) gates on a log record carrying:

Attribute	Type	Source
`hw.network.ib.port.state`	int	last `node_infiniband_port_state_id` Gauge sample for the `(device, port)` tuple at bridge-emit time; IBA phys_state ID (`1=Down`, `2=Init`, `3=Armed`, `4=Active`)
`hw.network.ib.device`	string	`device` label on the source series (e.g. `mlx5_0`)
`hw.network.ib.port.num`	int	`port` label on the source series, cast via OTTL `Int()`
`k8s.node.name`	string (resource)	stamped by `k8sattributesprocessor` on the DaemonSet

The metric datapoint attribute set from the transform/ib_to_hw_semconv stanza above already carries hw.network.ib.device and hw.network.ib.port.num; the future emitter passes those through unchanged. The hw.network.ib.port.state integer lifts directly from the renamed Gauge's datapoint value (one log record per (device, port, scrape) — emit-once-per-state-transition is a detector-side optimization, not a bridge-side gate; the detector's patterns.IBLinkFlapDetector counts transitions internally).

Log-record schema (verdict-input)

# Bridge-emitted log record consumed by patterndetectorprocessor's
# ib_link_flap detector. One log record per (device, port, scrape) —
# the detector counts transitions across consecutive records.
resource:
  attributes:
    k8s.node.name: gpu-node-0007         # str — REQUIRED. Flap predicate is per-node.
log_record:
  timestamp: 2026-06-01T10:04:30Z        # MUST be the scrape timestamp.
  body: ""                               # ignored by the detector.
  attributes:
    hw.network.ib.port.state: 1          # int — REQUIRED. IBA phys_state ID; detector compares against patterns.IBPortState* constants.
    hw.network.ib.device: mlx5_0         # str — REQUIRED. Per-NIC device name; flap predicate is per-device.
    hw.network.ib.port.num: 1            # int — REQUIRED. Port index; flap predicate is per-port (a 2-port HCA tracks each port separately).

Detector consumption

projectIBPortStateRecord (at module/processor/patterndetectorprocessor/ib_link_flap.go) extracts the three scalars and builds a patterns.IBPortStateRecord. The detector emits one verdict per (k8s.node.name, hw.network.ib.device, hw.network.ib.port.num) tuple when transition count within ib_link_flap_window (default 2min) crosses ib_link_flap_min_transitions (default 2). The unit tests TestPatternDetector_IBLinkFlapWiring* pin the canonical wire format above against the live detector.

Pattern #7 — `tracecore.alert.training_step_stalled.*` (issue #365)

The dataloader_hang detector's Layer 2 input is a training-step stall bridge log record derived from the trainer's gen_ai.training.step_duration_seconds Gauge. The detector (module/processor/patterndetectorprocessor/dataloader_hang.go::projectTrainingStepStallRecord) gates on a log record carrying:

Attribute	Type	Source
`tracecore.alert.training_step_stalled.no_progress_seconds`	int (seconds)	wall-clock duration since the last `gen_ai.training.step_duration_seconds` Gauge sample advanced; the bridge fires once when the value crosses `StallThreshold` (default 180s)
`tracecore.alert.training_step_stalled.last_step_ns`	int (optional, unix-ns)	timestamp of the last step-progress sample observed before the stall; falls back to the log record's Timestamp
`gen_ai.training.step`	int	last step index emitted by the trainer — the detector's warmup guard skips `step < 2`
`gen_ai.training.phase`	string (optional)	`train` / `eval` — the detector's eval-phase guard skips `phase == "eval"`
`k8s.pod.name`	string (resource)	training pod identity; stamped by `k8sattributesprocessor` on the trainer side
`k8s.namespace.name`	string (resource)	training pod namespace; same source
`k8s.node.name`	string (resource)	node hosting the training pod; same source

Namespacing under tracecore.alert.training_step_stalled.* keeps the bridge log shape distinguishable from raw gen_ai.training.step_duration_seconds Gauge samples downstream and mirrors the tracecore.alert.pcie_rate_collapse.* naming the pattern #5 bridge contract uses.

Unlike the DCGM-sourced bridges above, the input metric here is not an hw.* series — it is the upstream gen_ai.training.step_duration_seconds Gauge (per OTel GenAI semconv §Metrics, status: development at v0.130). Trainers emit this via OTel auto-instrumentation (opentelemetry-instrumentation-* for the framework in use) or an explicit Meter.create_gauge call inside the training loop. The recipe assumes the Gauge arrives via an OTLP push from the trainer pod — prometheusreceiver is one valid scrape path (Prometheus-format exposition of the Gauge by the trainer's metrics endpoint), but the same bridge attribute contract applies to OTLP-push topologies.

Why the recipe doesn't ship a bridge stanza today

Same RFC-0014 block as patterns #3 / #4 / #5 / #10: OTTL metric_statements cannot reference log.* paths at OTel-contrib v0.130, and no contrib connector emits log records from a metrics pipeline. The resolution path is either (a) an in-tree processor.WithMetrics extension to patterndetectorprocessor (tracked under #260; cuda_oom #10 shipped via #437 / PR #461; pattern #7 piggybacks on the same plumbing because the attribute contract is purely a wire-format pin) or (b) an upstream metricthresholdconnector contribution.

Log-record schema (verdict-input)

# Bridge-emitted log record consumed by patterndetectorprocessor's
# dataloader_hang detector. One log record per (training pod, stall
# crossing) — NOT one per Gauge sample.
resource:
  attributes:
    k8s.pod.name: trainer-rank-3         # str  — REQUIRED. Pod-scoped join key.
    k8s.namespace.name: training         # str  — required (verdict carries it).
    k8s.node.name: gpu-node-0007         # str  — REQUIRED. Node-scoped storage-event join key.
log_record:
  timestamp: 2026-06-01T10:04:30Z        # MUST be the stall-detection time (last_step + no_progress), not now()
  body: ""                               # ignored by the detector
  attributes:
    tracecore.alert.training_step_stalled.no_progress_seconds: 240   # int (seconds) — REQUIRED. Gate the detector reads.
    tracecore.alert.training_step_stalled.last_step_ns: 1717236030000000000  # int (unix-ns) — optional, sharpens the verdict timestamp.
    gen_ai.training.step: 42             # int — REQUIRED for the warmup guard (step >= 2).
    gen_ai.training.phase: train         # str — optional; "eval" gets skipped by the eval-phase guard.

Detector consumption

projectTrainingStepStallRecord (at module/processor/patterndetectorprocessor/projectors_shared.go) extracts the four required scalars and builds a patterns.TrainingStepStallRecord. The dataloader_hang detector joins each stall against a same-pod dataloader.error_class log record OR a same-node FailedMount / VolumeMountFailure Kubernetes Event within DataLoaderHangCorrelationWindow (default 5min); without a discriminator, no verdict fires (spec §"Detector evaluation rule" — stalls alone are not a hang because patterns #6 stragglers and #11 checkpointer also stall steps). The unit tests TestPatternDetector_DataLoaderHangWiring* (module/processor/patterndetectorprocessor/dataloader_hang_test.go) pin the canonical wire format above against the live detector.

Pattern #10 — `hw.gpu.memory.{free,total}` (issue #337)

Status: shipped via #437 / PR #461. patterndetectorprocessor now additionally implements processors.metrics.Metrics (ADR-0001 PR-B): the metrics-path consumer projects hw.gpu.memory.{free,total} pmetric.NumberDataPoints directly into patterns.FBMemoryRecord values, buffers them in a bounded ring keyed on processor component.ID, and the logs-path consumer drains the buffer at CUDA OOM-log time. Operators running dcgm-exporter into the metrics pipeline get full-confidence verdicts WITHOUT configuring the metrics→logs OTTL recipe below. The log-record schema below remains the load-bearing wire contract for the OTTL recipe path (the alternative when an operator can't run the in-tree consumer); the detector (module/processor/patterndetectorprocessor/cuda_oom.go::projectFBMemoryRecord) expects this exact shape and both paths converge on it.

The CUDA OOM detector gates on a log record carrying:

Attribute	Type	Source
`hw.gpu.memory.free`	int (bytes)	last `DCGM_FI_DEV_FB_FREE` Gauge sample for the GPU at bridge-emit time
`hw.gpu.memory.total`	int (bytes)	`DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE` at bridge-emit time (per-GPU sum within one scrape)
`gpu.id`	string	PCI BDF resource attr from the DCGM series (or `hw.gpu.pci.bdf` fallback)
`k8s.node.name`	string (resource)	stamped by `k8sattributesprocessor`

The bridge MUST stamp BOTH hw.gpu.memory.free AND hw.gpu.memory.total on the same log record — the detector's projectFBMemoryRecord gate at module/processor/patterndetectorprocessor/cuda_oom.go short-circuits if either is missing (the fragmentation discriminator needs both numerator and denominator joined to the same GPU / same scrape). One log record per (GPU, scrape) is the load-bearing shape: per-attribute records emitted on separate logs would defeat the join.

Log-record schema (verdict-input)

# Bridge-emitted log record consumed by patterndetectorprocessor's
# cuda_oom detector. Emitted once per (GPU, scrape) — NOT once per
# (GPU, attribute). Field types match the OTel pdata stamps the
# detector reads (Int / Str).
resource:
  attributes:
    k8s.node.name: gpu-node-0001         # str  — stamped by k8sattributesprocessor
    hw.id: GPU-3a4b...                   # str  — NVML UUID (optional join key)
    hw.gpu.pci.bdf: 0000:3b:00.0         # str  — PCI BDF (optional; gpu.id below is the cross-signal join)
log_record:
  timestamp: 2026-06-01T10:00:00Z        # MUST be the scrape timestamp, not now()
  body: ""                               # ignored by the detector — keep empty or set to a debug-friendly summary
  attributes:
    gpu.id: PCI:0000:3b:00               # str  — REQUIRED. Cross-signal join key (same shape as the OOM log record).
    hw.gpu.memory.free: 17179869184      # int  — REQUIRED. Bytes free on the GPU at the scrape.
    hw.gpu.memory.total: 85899345920     # int  — REQUIRED. Bytes total on the GPU (= used + free at the scrape).
    hw.gpu.memory.used: 68719476736      # int  — optional, evidence context. NOT gated on by the detector.
    hw.gpu.index: 3                      # int  — optional, evidence context.

Detector consumption

projectFBMemoryRecord (at module/processor/patterndetectorprocessor/cuda_oom.go:114) extracts the three required scalars and builds a patterns.FBMemoryRecord. The free-ratio (FreeBytes / TotalBytes) is then compared against cuda_oom_fb_free_fragmentation_threshold (default 0.05) — a ratio at-or-above threshold flips the verdict to cuda_oom.kind=fragmentation, below flips to true_oom. The unit test TestPatternDetector_CUDAOOMWiringEmitsFragmentationVerdict pins the canonical wire format above against the live detector.

Why the recipe doesn't ship a bridge stanza today

OTTL metric_statements cannot reference log.* paths at v0.130 (upstream README). Connectors that change signal type all emit metrics, not logs (countconnector, signaltometricsconnector, spanmetricsconnector). Per RFC-0014 the resolution path is either (a) an in-tree processor.WithMetrics extension to patterndetectorprocessor — shipped for cuda_oom (#10) via #437 / PR #461, with sibling consumers for patterns #2 / #3 / #4 / #5 pending under #260 — or (b) an upstream metricthresholdconnector contribution. The contract above stays stable across either resolution.

`[[adopt-over-build]]` posture

Every statement in transform/dcgm_to_hw_semconv uses upstream OTTL functions only: set, IsMatch, ExtractPatterns, Int. No new OTTL functions are introduced. If a future series cannot be projected with the existing function set, the right response is to propose the missing function upstream to OTel contrib — not to ship a tracecore-specific OTTL extension.

Identity-conflict caveat

OTTL set(metric.name, ...) renames a metric in place; the v0.130 README warns that "Transformation of metrics have the potential to affect the identity of a metric leading to an Identity Crisis." For this recipe the conflict is intentional: 36 input series (18 NVLink links × 2 directions) collapse into one output metric named hw.gpu.nvlink.io with distinct attribute sets per datapoint. Downstream OTel + Prometheus backends merge by (metric.name, attributes) and produce the expected per-link / per-direction series. If your backend rejects the merged shape, follow the upstream guidance to apply the rename inside a separate statement group from any other identity-affecting operation; this recipe already isolates the rename inside its own processor (transform/dcgm_to_hw_semconv).

Placeholders

Placeholder	What to fill in
`REPLACE_WITH_OTLP_HTTP_ENDPOINT`	The OTLP/HTTP base URL of your sink. `/v1/metrics` is appended automatically per the OTLP/HTTP spec.
`REPLACE_WITH_DCGM_EXPORTER_TARGET`	`localhost:9400` for a DaemonSet shape, or the dcgm-exporter Service DNS (`dcgm-exporter.kube-system.svc:9400`) for a Deployment shape.

Tracecore does not expand environment variables in YAML. Render the literals at deploy time via envsubst, a Helm template, or a Kubernetes secret-injection driver. The :port suffix is mandatory — prometheusreceiver rejects bare hostnames at validate.

Failure modes

Symptom	First check
`scrape_configs.targets[0]: address ... incorrect` at validate	The target placeholder still carries `REPLACE_WITH_DCGM_EXPORTER_TARGET` — the validator now rejects literal placeholders that look like hostnames. Render at deploy time.
Scrape returns 200 but no metrics flow	`prometheusreceiver` requires the response to be in Prometheus text exposition format. A target that returns OTLP-JSON or vendor-proprietary format silently drops. Curl the endpoint and confirm the first line starts with `# HELP`.
`gpu.vendor` empty on a known DCGM target	The exporter is on an old release that emits the legacy `dcgm_` prefix (lowercase). Either upgrade the exporter to a `DCGM_`-emitting build or extend the OTTL regex to `^[Dd][Cc][Gg][Mm]_`.
`cardinality limit exceeded` from the backend	`prometheusreceiver` does not cap series. Add a `filterprocessor` between `prometheus` and `transform/gpu_vendor` to drop metrics you don't query. Cap dcgm-exporter's `--collectors` flag to the families you alert on.
Bearer-token target returns 401	The ServiceAccount lacks the binding to the target's RBAC. For Kueue, the SA needs `nonResourceURLs: ["/metrics"] verbs: ["get"]` via a ClusterRoleBinding.
dcgm-exporter pod `CrashLoopBackOff` with `Failed to initialize NVML: Driver/library version mismatch`	The host's NVIDIA driver is older (or newer) than the DCGM library bundled into the dcgm-exporter image. `kubectl -n gpu-monitoring logs -l app.kubernetes.io/name=dcgm-exporter --tail=20` confirms. Align by upgrading the driver via the NVIDIA GPU Operator or `nvidia-driver-daemonset`, or by pinning the chart to a dcgm-exporter image tag that matches your driver minor version.
dcgm-exporter pod `CrashLoopBackOff` with `Failed to initialize NVML: Driver Not Loaded`	No NVIDIA driver is installed on the host — the DaemonSet was scheduled onto a non-GPU node. Tighten `nodeSelector` to a GPU-only label (e.g. `nvidia.com/gpu.present: "true"` from NVIDIA's node-feature-discovery, or your cluster's equivalent). If every GPU node already lacks a driver, install one before the chart will start.
dcgm-exporter pod `Running` but `/metrics` returns 500 / hangs	DCGM cannot reach NVML — usually the `nvidia-container-toolkit` runtime is not configured (container has no `/dev/nvidiactl`). Verify with `kubectl -n gpu-monitoring exec <pod> -- ls /dev/nvidia*`. The runtime must be the NVIDIA container runtime; see the Quickstart on Kubernetes prerequisites.
dcgm-exporter pod `Pending` with `forbidden: ... configmaps` event	The chart's `Role` + `RoleBinding` (`dcgm-exporter-read-cm`) was disabled or the ServiceAccount lost its binding. Re-render with the defaults (`rbac.create=true`, `serviceAccount.create=true`) — the pod must be able to read the `exporter-metrics-config-map` to load `default-counters.csv`.
`prometheusreceiver` logs `context deadline exceeded` while scraping dcgm-exporter	dcgm-exporter's first scrape after startup can take >10s on hosts with many GPUs because DCGM has to initialize per-device watches. Either raise the recipe's `scrape_timeout` (above) past `15s`, or set the chart's `arguments` to include `-d g` (GPUs only, no GPU-instances enumeration) to shrink the field watch set. The `scrape_interval` should remain ≥ `scrape_timeout` to avoid overlapping scrapes.

Upstream component docs: receiver/prometheusreceiver, processor/transformprocessor, processor/filterprocessor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prometheus scrape via `prometheusreceiver`

Config

Install dcgm-exporter

Verify dcgm-exporter is scrapable

Deployment shape

Adding authenticated targets (Kueue, vendor exporters)

`gpu.vendor` resource-attribute mapping

`DCGM_FI_` → `hw.gpu.` semconv projection

Resource attribution

Pattern #1 — NVLink degradation

Pattern #3 — Uncorrectable HBM ECC

Pattern #4 — Thermal throttle cascade

Pattern #5 — PCIe AER cascade

Pattern #10 — CUDA OOM (framebuffer)

Pattern #2 — InfiniBand link flap

Metrics-to-logs bridge contract (patterns #2, #3, #4, #5, #10)

Pattern #3 — `hw.errors.delta` (issue #273)

Pattern #4 — `hw.gpu.throttle.duration.delta` (issue #282)

Pattern #5 — `tracecore.alert.pcie_rate_collapse.*` (issue #284)

Pattern #2 — `hw.network.ib.port.state` (issue #393)

Log-record schema (verdict-input)

Detector consumption

Pattern #7 — `tracecore.alert.training_step_stalled.*` (issue #365)

Why the recipe doesn't ship a bridge stanza today

Log-record schema (verdict-input)

Detector consumption

Pattern #10 — `hw.gpu.memory.{free,total}` (issue #337)

Log-record schema (verdict-input)

Detector consumption

Why the recipe doesn't ship a bridge stanza today

`[[adopt-over-build]]` posture

Identity-conflict caveat

Placeholders

Failure modes

Uh oh!

FilesExpand file tree

prometheus-scrape.md

Latest commit

History

prometheus-scrape.md

File metadata and controls

Prometheus scrape via prometheusreceiver

Config

Install dcgm-exporter

Verify dcgm-exporter is scrapable

Deployment shape

Adding authenticated targets (Kueue, vendor exporters)

gpu.vendor resource-attribute mapping

DCGM_FI_* → hw.gpu.* semconv projection

Resource attribution

Pattern #1 — NVLink degradation

Pattern #3 — Uncorrectable HBM ECC

Pattern #4 — Thermal throttle cascade

Pattern #5 — PCIe AER cascade

Pattern #10 — CUDA OOM (framebuffer)

Pattern #2 — InfiniBand link flap

Metrics-to-logs bridge contract (patterns #2, #3, #4, #5, #10)

Pattern #3 — hw.errors.delta (issue #273)

Pattern #4 — hw.gpu.throttle.duration.delta (issue #282)

Pattern #5 — tracecore.alert.pcie_rate_collapse.* (issue #284)

Pattern #2 — hw.network.ib.port.state (issue #393)

Log-record schema (verdict-input)

Detector consumption

Pattern #7 — tracecore.alert.training_step_stalled.* (issue #365)

Why the recipe doesn't ship a bridge stanza today

Log-record schema (verdict-input)

Detector consumption

Pattern #10 — hw.gpu.memory.{free,total} (issue #337)

Log-record schema (verdict-input)

Detector consumption

Why the recipe doesn't ship a bridge stanza today

[[adopt-over-build]] posture

Identity-conflict caveat

Placeholders

Failure modes

Prometheus scrape via `prometheusreceiver`

`gpu.vendor` resource-attribute mapping

`DCGM_FI_` → `hw.gpu.` semconv projection

Pattern #3 — `hw.errors.delta` (issue #273)

Pattern #4 — `hw.gpu.throttle.duration.delta` (issue #282)

Pattern #5 — `tracecore.alert.pcie_rate_collapse.*` (issue #284)

Pattern #2 — `hw.network.ib.port.state` (issue #393)

Pattern #7 — `tracecore.alert.training_step_stalled.*` (issue #365)

Pattern #10 — `hw.gpu.memory.{free,total}` (issue #337)

`[[adopt-over-build]]` posture