Skip to content

Latest commit

 

History

History
813 lines (685 loc) · 44.2 KB

File metadata and controls

813 lines (685 loc) · 44.2 KB

Prometheus scrape via prometheusreceiver

Tracecore scrapes Prometheus-format endpoints via the upstream prometheusreceiver. This is the adoption shape for every vendor GPU exporter per RFC-0013 §2 (Adoption matrix): NVIDIA dcgm-exporter, AMD ROCm/device-metrics-exporter, Intel intel/xpumanager, Habana Prometheus Metric Exporter — and for the Kueue scheduler's metrics endpoint. Replaces the in-tree dcgm and kueue receivers per RFC-0013 §7 (Deletion list — v0.1.0).

Three OTTL transform processors run in series over the scraped metrics:

  1. transform/gpu_vendor stamps the customer-stable gpu.vendor resource attribute (RFC-0013 §3) so dashboards survive a future swap between vendor exporters.
  2. transform/dcgm_to_hw_semconv projects the raw DCGM_FI_* namespace onto the customer-stable hw.gpu.* / hw.errors namespace declared in docs/proposals/semconv-hw-gpu-extensions.md so the next-cycle pattern detectors (issue #260 patterns #1 NVLink, #3 HBM ECC, #4 thermal throttle, #5 PCIe AER, #10 CUDA OOM) read one vendor-neutral wire format. Per docs/rfcs/0014-metrics-to-logs-pattern-input.md the verdict-emission half extends patterndetectorprocessor with processor.WithMetrics — shipped for cuda_oom (#10) via #437 / PR #461, with sibling consumers for patterns #1 / #3 / #4 / #5 pending under #260. The transform below is the load-bearing wire-format contract that the metrics-path consumer reads.
  3. transform/ib_to_hw_semconv projects node_exporter --collector.infiniband's node_infiniband_port_state_id onto the customer-stable hw.network.ib.* namespace (docs/ATTRIBUTES.md §hw.network.*, alpha) so pattern #2's link-flap detector reads the same vendor-neutral shape whether the underlying source is node_exporter, a Mellanox-specific exporter, or journald-kernel.md's mlx5_core stream. Same RFC-0014 metrics- path consumer dependency as the DCGM transform (cuda_oom #10 shipped via PR #461; IB link-flap sibling consumer pending).

Config

# docs/integrations/examples/prometheus-scrape.yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: dcgm-exporter
          scrape_interval: 15s
          scrape_timeout: 10s
          metrics_path: /metrics
          fallback_scrape_protocol: PrometheusText1.0.0
          static_configs:
            - targets:
                - REPLACE_WITH_DCGM_EXPORTER_TARGET

processors:
  transform/gpu_vendor:
    metric_statements:
      - context: datapoint
        statements:
          - set(resource.attributes["gpu.vendor"], "nvidia") where IsMatch(metric.name, "^DCGM_")
          - set(resource.attributes["gpu.vendor"], "amd") where IsMatch(metric.name, "^amdsmi_")
          - set(resource.attributes["gpu.vendor"], "intel") where IsMatch(metric.name, "^xpum_")
          - set(resource.attributes["gpu.vendor"], "habana") where IsMatch(metric.name, "^habanalabs_")
  batch:
    send_batch_size: 8192
    timeout: 10s

exporters:
  otlphttp:
    endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT
    compression: gzip
    timeout: 10s

service:
  pipelines:
    metrics/scrape:
      receivers: [prometheus]
      processors: [transform/gpu_vendor, batch]
      exporters: [otlphttp]

Validate with the in-tree binary:

./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yaml

Exit 0 means the config parses, every scrape target URL is well-formed, and the OTTL statements type-check against the metric-datapoint context.

Install dcgm-exporter

The recipe above scrapes NVIDIA's upstream dcgm-exporter (Apache-2.0). Install it from the canonical Helm repo:

helm repo add gpu-helm-charts \
  https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace gpu-monitoring --create-namespace \
  -f dcgm-exporter-values.yaml

The minimal dcgm-exporter-values.yaml that pairs with tracecore's prometheusreceiver scrape (no Prometheus Operator dependency) is:

# dcgm-exporter-values.yaml — minimal overlay for tracecore scrape
serviceMonitor:
  # tracecore scrapes via prometheusreceiver, not Prometheus Operator
  enabled: false
service:
  type: ClusterIP
  port: 9400
nodeSelector:
  # restrict the DaemonSet to GPU nodes; pair with NVIDIA's node-feature-
  # discovery or device-plugin label. Use the label your cluster stamps.
  nvidia.com/gpu.present: "true"
arguments:
  - "-f"
  - "/etc/dcgm-exporter/default-counters.csv"

To enable the pattern #1 NVLink series (commented out in upstream default-counters.csv — see the NVLink section below), mount a custom counters ConfigMap and point arguments at it via the -m flag per the chart's values.yaml documentation.

The chart renders a ServiceAccount, ConfigMap (default counters), Role + RoleBinding (read the ConfigMap), Service (ClusterIP on :9400), and DaemonSet (containerPort 9400, app.kubernetes.io/name=dcgm-exporter). The Service DNS name is dcgm-exporter.gpu-monitoring.svc; per-pod IPs live behind the app.kubernetes.io/name=dcgm-exporter selector.

Verify dcgm-exporter is scrapable

Confirm the exporter is healthy before pointing tracecore at it. Port-forward one pod and curl /metrics:

kubectl -n gpu-monitoring port-forward \
  $(kubectl -n gpu-monitoring get pod \
      -l app.kubernetes.io/name=dcgm-exporter \
      -o jsonpath='{.items[0].metadata.name}') \
  9400:9400 &
curl -sf http://localhost:9400/metrics | head -5

A healthy response begins with # HELP DCGM_FI_... and # TYPE ... lines in Prometheus text exposition format. Pattern-relevant prefixes to confirm are present:

Prefix Patterns Needs custom counters CSV?
DCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTAL #3 HBM ECC No
DCGM_FI_DEV_*_VIOLATION #4 thermal throttle No
DCGM_FI_DEV_FB_{USED,FREE} #10 CUDA OOM No
DCGM_FI_PROF_PCIE_{TX,RX}_BYTES #5 PCIe AER No (profiling enabled by default)
DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES #1 NVLink Yes — see NVLink section below

grep for the families you intend to alert on; a missing prefix means the corresponding pattern will never fire even if the recipe's OTTL stanza compiles cleanly. Then point tracecore's REPLACE_WITH_DCGM_EXPORTER_TARGET at either localhost:9400 (per- node DaemonSet shape) or dcgm-exporter.gpu-monitoring.svc:9400 (Deployment shape) and validate per the config section above.

Deployment shape

The right Kubernetes shape depends on the scrape target:

  • Per-node targets (NVIDIA dcgm-exporter, AMD/Intel/Habana per-node exporters): run tracecore as a DaemonSet and scrape localhost:<port> so each node's exporter is read by the tracecore pod on the same node. No cluster-wide service discovery required.
  • Cluster-scoped targets (Kueue's controller-manager metrics endpoint, single-replica vendor exporters): run tracecore as a single-replica Deployment and scrape the target's Service. Pair with kubernetes_sd_configs: if the target moves between pods on re-roll; for a stable Service ClusterIP, static_configs: is enough.

Adding authenticated targets (Kueue, vendor exporters)

The example scrapes a static unauthenticated target. For Kueue's controller-manager metrics endpoint (TLS + serviceaccount-token bearer):

        - job_name: kueue
          scheme: https
          scrape_interval: 30s
          metrics_path: /metrics
          fallback_scrape_protocol: PrometheusText1.0.0
          authorization:
            type: Bearer
            credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            server_name: kueue-controller-manager-metrics-service.kueue-system.svc
          static_configs:
            - targets:
                - kueue-controller-manager-metrics-service.kueue-system.svc:8443

Adjust server_name to match the Service's DNS name. The credentials_file path is the default ServiceAccount projected-token mount; if you use a custom token volume, update the path.

gpu.vendor resource-attribute mapping

The OTTL transform routes to a vendor tag based on the metric-name prefix each upstream exporter uses:

metric.name prefix gpu.vendor Upstream exporter
DCGM_* nvidia NVIDIA/dcgm-exporter
amdsmi_* amd ROCm/device-metrics-exporter
xpum_* intel intel/xpumanager
habanalabs_* habana Habana Prometheus Metric Exporter

The tag survives the RFC-0013 §3 contract; existing dashboards keyed on gpu.vendor continue to work after a vendor swap.

DCGM_FI_*hw.gpu.* semconv projection

The second OTTL transform (transform/dcgm_to_hw_semconv in the example YAML) projects every load-bearing DCGM_FI_* series into the customer-stable namespace from docs/proposals/semconv-hw-gpu-extensions.md. The contract is one-direction: a downstream consumer reads only hw.gpu.* / hw.errors, never the raw DCGM names. Per RFC-0014 the pattern detectors built on top of this namespace (issue #260) land as a processor.WithMetrics extension to patterndetectorprocessor — not as an OTTL metrics-to-logs emitter — because OTel-contrib transformprocessor v0.130 cannot emit log records from a metrics pipeline. The cuda_oom (#10) consumer shipped via #437 / PR #461; siblings for #1 / #3 / #4 / #5 are pending under #260.

Resource attribution

dcgm-exporter stamps two cross-version label flavors per series: UUID / gpu_uuid for the NVML UUID, and gpu / GPU for the NVML index. The transform maps either onto the customer-stable resource attribute:

dcgm-exporter label Resource attribute Notes
UUID or gpu_uuid hw.id NVML UUID; durable join key. The transform prefers UUID and falls back to gpu_uuid when only the legacy label is present.
gpu or GPU hw.gpu.index NVML index; volatile across reboots. Same dual-label preference.
(computed) hw.type = "gpu" Stamped on every DCGM_* series; gates downstream hw.* filters against future non-GPU hw.* sources.
pci_bus_id hw.gpu.pci.bdf PCI bus-device-function; lifted only on DCGM_FI_PROF_PCIE_{TX,RX}_BYTES series so pattern #5's escalation can cross-reference dmesg AER lines on the same BDF.

Pattern #1 — NVLink degradation

Per-link Tx/Rx Counter. Each DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES series collapses into one metric name with the link index lifted into a datapoint attribute via OTTL ExtractPatterns:

Raw DCGM series OTel metric Datapoint attributes (added)
DCGM_FI_PROF_NVLINK_L{N}_TX_BYTES (N ∈ 0..17) hw.gpu.nvlink.io (Counter, unit By) hw.gpu.nvlink.link={N}, network.io.direction=transmit
DCGM_FI_PROF_NVLINK_L{N}_RX_BYTES (N ∈ 0..17) hw.gpu.nvlink.io (Counter, unit By) hw.gpu.nvlink.link={N}, network.io.direction=receive

The link index lift uses Int(ExtractPatterns(metric.name, "^DCGM_FI_PROF_NVLINK_L(?P<link>\\d+)_(TX|RX)_BYTES$")["link"]) so the resulting attribute is integer-typed (matches the semconv proposal's hw.gpu.nvlink.link: int). Per-link decomposition is the diagnostic-critical surface for pattern #1 silent NVLink degradation; without it the alert query has no group-by axis.

dcgm-exporter opt-in required. The DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES field IDs (1040..1075) are commented out in dcgm-exporter's upstream default-counters.csv. Operators must mount a custom counters ConfigMap and pass it via the chart's -m <ns>:<configmap> flag (or set arguments[1]=-f=/etc/dcgm-exporter/custom-counters.csv and a matching extraVolumes entry). Without this, the recipe compiles cleanly but emits zero hw.gpu.nvlink.io series — pattern #1 will never fire.

Pattern #3 — Uncorrectable HBM ECC

ECC counters expand into four series (correctable / uncorrectable × volatile / aggregate). Pattern #3 alerts on the uncorrectable volatile row; the rest are evidence context that the runbook references.

Raw DCGM series OTel metric Datapoint attributes (added)
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL hw.errors (Counter, unit {error}) error.type=uncorrected, error.subtype=double_bit, error.persistence=volatile
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL hw.errors error.type=uncorrected, error.subtype=double_bit, error.persistence=aggregate
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL hw.errors error.type=corrected, error.subtype=single_bit, error.persistence=volatile
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL hw.errors error.type=corrected, error.subtype=single_bit, error.persistence=aggregate

The attribute names match the semconv hw.errors shape (see hw common). Pattern #3 doc consumes the error.persistence=volatile row in its alert query.

Pattern #4 — Thermal throttle cascade

Modern dcgm-exporter emits per-reason throttle counters as discrete DCGM_FI_DEV_*_VIOLATION series. Each maps onto hw.gpu.throttle.duration with a hw.gpu.throttle.reason attribute (semconv proposal §2).

Raw DCGM series OTel metric Datapoint attributes (added)
DCGM_FI_DEV_THERMAL_VIOLATION hw.gpu.throttle.duration (Counter, unit s) hw.gpu.throttle.reason=thermal
DCGM_FI_DEV_POWER_VIOLATION hw.gpu.throttle.duration hw.gpu.throttle.reason=power
DCGM_FI_DEV_SYNC_BOOST_VIOLATION hw.gpu.throttle.duration hw.gpu.throttle.reason=sync_boost
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION hw.gpu.throttle.duration hw.gpu.throttle.reason=hw_slowdown

DCGM_FI_DEV_LOW_UTIL_VIOLATION is intentionally not mapped: the upstream semconv proposal's hw.gpu.throttle.reason enum (thermal, power, sync_boost, hw_slowdown, sw_thermal, display_clock, app_clock_setting) has no value for an "idle / low-utilization" throttle. Mapping it to a value outside the proposal would create forward-incompat drift once the SIG resolves the vocabulary. Tracked at #272 for the upstream proposal extension.

Pattern #4 doc alerts on the reason=thermal row; the other reasons are diagnostic context (power correlates with PSU sag, hw_slowdown is the "GPU has decided to clock itself down" hard signal).

Pattern #5 — PCIe AER cascade

DCGM exposes per-direction PCIe byte counters whose rate collapses when the link renegotiates to a lower generation / width. Pattern #5 watches the rate divergence across the host's GPU set.

Raw DCGM series OTel metric Datapoint attributes (added)
DCGM_FI_PROF_PCIE_TX_BYTES hw.gpu.io (Counter, unit By) network.io.direction=transmit
DCGM_FI_PROF_PCIE_RX_BYTES hw.gpu.io (Counter, unit By) network.io.direction=receive

The pci_bus_id label is lifted to the resource-level hw.gpu.pci.bdf so pattern #5's escalation matrix can cross- reference dmesg PCIe Bus Error: Corrected lines against the same BDF without joining series. Pattern #5 doc shows the divergence query in PromQL form.

Pattern #10 — CUDA OOM (framebuffer)

DCGM exposes the per-GPU framebuffer state as two Gauge series in bytes. They are the proximate signal pattern #10 joins to a RuntimeError: CUDA out of memory log record so the detector can discriminate true-OOM (≤5% free at fault time) from allocator fragmentation (>5% free at fault time). The OTTL projection lands the customer-stable hw.gpu.memory.{used,free} shape on the metrics pipeline; the total = used + free derivation lives at the bridge layer below.

Raw DCGM series OTel metric Datapoint attributes (unchanged)
DCGM_FI_DEV_FB_USED hw.gpu.memory.used (Gauge, unit By) none — vendor's gpu/UUID labels lifted to resource by the section above
DCGM_FI_DEV_FB_FREE hw.gpu.memory.free (Gauge, unit By) none — same

hw.gpu.memory.total is intentionally NOT projected at the OTTL metric-statements layer. transformprocessor v0.130 operates one datapoint at a time within a metric — there is no cross- series arithmetic that could compute total = used + free on a metrics pipeline (upstream README). The total is computed at the metrics-to-logs bridge layer where the two scalars are already projected onto a single log record; see the bridge-contract section below.

Pattern #10 doc consumes the joined record via module/processor/patterndetectorprocessor/cuda_oom.go's projectFBMemoryRecord (gate: both hw.gpu.memory.free AND hw.gpu.memory.total AND gpu.id on the same log record).

MIG caveat. On MIG-partitioned GPUs, DCGM_FI_DEV_FB_FREE reports the parent device, not the MIG slice. The detector spec (10-cuda-oom-deceptive.md §Edge cases) gates on hw.gpu.mig.enabled == true to skip MIG hosts until MIG-aware FB metrics are wired. The OTTL projection itself is MIG-safe — it just renames the parent-device series; the detector decides whether the renamed series is meaningful.

Pattern #2 — InfiniBand link flap

Source: node_exporter --collector.infiniband (the upstream Prometheus node-exporter infiniband collector which reads /sys/class/infiniband/<dev>/ports/<n>/phys_state and exposes the IBA-spec phys_state ID as an integer Gauge). Run the collector under tracecore's prometheusreceiver per RFC-0013 §2; the in-tree binary bundles prometheusreceiver so no extra component is required.

Raw node_exporter series OTel metric Datapoint attributes (added)
node_infiniband_port_state_id{device, port} hw.network.ib.port.state (Gauge, IBA phys_state ID 1=Down / 2=Init / 3=Armed / 4=Active) hw.network.ib.device={device label}, hw.network.ib.port.num=Int({port label})

The detector (module/processor/patterndetectorprocessor/ib_link_flap.go) reads these three attributes off a log record via port.Int() / state.Int() — the Int() cast on the port label is load-bearing because prometheusreceiver exposes Prometheus labels as strings while the projector calls Int() on the pdata Value. The companion series node_infiniband_state{state="<name>"} (string label) is intentionally not mapped: the detector compares against the patterns.IBPortState* integer constants, so the string variant would round-trip wrong.

The metric rename runs last so the where metric.name == "node_infiniband_port_state_id" guards on the attribute-stamp statements above still match the raw exporter name when each statement evaluates. Renaming first would short-circuit the attribute stamps because the second statement's guard would no longer find the original name.

Pattern #2 doc consumes the joined record via projectIBPortStateRecord (gate: hw.network.ib.port.state AND hw.network.ib.device AND hw.network.ib.port.num on the same log record, plus k8s.node.name on the resource). The metrics→logs emit half follows the RFC-0014 pattern — the cuda_oom (#10) precedent ships an in-tree consumer via #437 / PR #461 (processors.metrics.Metrics + bounded cross-stream buffer); the IB link-flap metrics-path consumer is a pending sibling follow-up. The bridge log-record schema (load-bearing for both the OTTL recipe path and the future in-tree consumer) is pinned in the next section.

Metrics-to-logs bridge contract (patterns #2, #3, #4, #5, #10)

The pattern detectors at module/processor/patterndetectorprocessor read log records as their primary input (processor.WithLogs). The DCGM scrape recipe above produces metric datapoints. Bridging the two at the OTTL layer is upstream-blocked at OTel-contrib v0.130 — no contrib processor or connector emits log records from a metrics pipeline (per RFC-0014).

RFC-0014's resolution path adds processor.WithMetrics directly on patterndetectorprocessor so the processor consumes the metrics pipeline in-tree and joins it against the logs path via a bounded cross-stream buffer. The cuda_oom (#10) consumer shipped via #437 / PR #461; metrics-path consumers for the other patterns (#2 IB link flap, #3 HBM ECC, #4 thermal throttle, #5 PCIe AER) are pending sibling follow-ups. Until those land, the bridge attribute contract below is the load-bearing wire-format an OTTL recipe (when one becomes expressible) OR the in-tree consumer MUST honor — the detector projections gate on this contract today, and any emitter (in-tree consumer or future OTTL recipe) that stamps these attributes fires the pattern end-to-end without changing the detector library.

Pattern #3 — hw.errors.delta (issue #273)

The HBM ECC detector gates on a log record carrying:

Attribute Type Source
hw.errors.delta int per-scrape delta of hw.errors counter (= increase(hw_errors[scrape_interval]))
gpu.id string PCI BDF resource attr from the DCGM series
hw.gpu.index int (optional) NVML index from the DCGM series
error.type string uncorrected for the alert row
error.subtype string double_bit for the alert row
error.persistence string volatile for the alert row
k8s.node.name string (resource) stamped by k8sattributesprocessor

The metric datapoint attribute set from the transform/dcgm_to_hw_semconv stanza above already carries error.type / error.subtype / error.persistence on the renamed hw.errors Counter, so the future emitter passes those through unchanged; the only new field is the per-scrape hw.errors.delta integer.

Pattern #4 — hw.gpu.throttle.duration.delta (issue #282)

The thermal-throttle detector gates on a log record carrying:

Attribute Type Source
hw.gpu.throttle.duration.delta int (seconds) per-scrape delta of hw.gpu.throttle.duration
hw.gpu.throttle.reason string thermal for the alert row
gpu.id string PCI BDF resource attr
hw.gpu.index int (optional) NVML index
k8s.node.name string (resource) stamped by k8sattributesprocessor

Units pinned to integer seconds because projectThermalThrottleRecord at module/processor/patterndetectorprocessor/patterndetector.go multiplies the delta by time.Second — the wire format MUST agree.

Pattern #5 — tracecore.alert.pcie_rate_collapse.* (issue #284)

Layer 2 of the PCIe AER cascade detector gates on a log record carrying:

Attribute Type Source
tracecore.alert.pcie_rate_collapse.bytes_per_second double rate(hw.gpu.io[5m]) per GPU
tracecore.alert.pcie_rate_collapse.baseline_bytes_per_second double quantile(0.5, rate(...)) by (k8s.node.name)
tracecore.alert.pcie_rate_collapse.direction string transmit / receive (falls back to network.io.direction)
gpu.id string PCI BDF resource attr
k8s.node.name string (resource) stamped by k8sattributesprocessor

Namespacing under tracecore.alert.pcie_rate_collapse.* keeps the bridge log shape distinguishable from raw hw.gpu.io scrape samples downstream. Layer 1 (journald-kernel AER stanza) is documented in journald-kernel.md and ships independently of this bridge.

Pattern #2 — hw.network.ib.port.state (issue #393)

The InfiniBand link-flap detector (module/processor/patterndetectorprocessor/ib_link_flap.go::projectIBPortStateRecord) gates on a log record carrying:

Attribute Type Source
hw.network.ib.port.state int last node_infiniband_port_state_id Gauge sample for the (device, port) tuple at bridge-emit time; IBA phys_state ID (1=Down, 2=Init, 3=Armed, 4=Active)
hw.network.ib.device string device label on the source series (e.g. mlx5_0)
hw.network.ib.port.num int port label on the source series, cast via OTTL Int()
k8s.node.name string (resource) stamped by k8sattributesprocessor on the DaemonSet

The metric datapoint attribute set from the transform/ib_to_hw_semconv stanza above already carries hw.network.ib.device and hw.network.ib.port.num; the future emitter passes those through unchanged. The hw.network.ib.port.state integer lifts directly from the renamed Gauge's datapoint value (one log record per (device, port, scrape) — emit-once-per-state-transition is a detector-side optimization, not a bridge-side gate; the detector's patterns.IBLinkFlapDetector counts transitions internally).

Log-record schema (verdict-input)
# Bridge-emitted log record consumed by patterndetectorprocessor's
# ib_link_flap detector. One log record per (device, port, scrape) —
# the detector counts transitions across consecutive records.
resource:
  attributes:
    k8s.node.name: gpu-node-0007         # str — REQUIRED. Flap predicate is per-node.
log_record:
  timestamp: 2026-06-01T10:04:30Z        # MUST be the scrape timestamp.
  body: ""                               # ignored by the detector.
  attributes:
    hw.network.ib.port.state: 1          # int — REQUIRED. IBA phys_state ID; detector compares against patterns.IBPortState* constants.
    hw.network.ib.device: mlx5_0         # str — REQUIRED. Per-NIC device name; flap predicate is per-device.
    hw.network.ib.port.num: 1            # int — REQUIRED. Port index; flap predicate is per-port (a 2-port HCA tracks each port separately).
Detector consumption

projectIBPortStateRecord (at module/processor/patterndetectorprocessor/ib_link_flap.go) extracts the three scalars and builds a patterns.IBPortStateRecord. The detector emits one verdict per (k8s.node.name, hw.network.ib.device, hw.network.ib.port.num) tuple when transition count within ib_link_flap_window (default 2min) crosses ib_link_flap_min_transitions (default 2). The unit tests TestPatternDetector_IBLinkFlapWiring* pin the canonical wire format above against the live detector.

Pattern #7 — tracecore.alert.training_step_stalled.* (issue #365)

The dataloader_hang detector's Layer 2 input is a training-step stall bridge log record derived from the trainer's gen_ai.training.step_duration_seconds Gauge. The detector (module/processor/patterndetectorprocessor/dataloader_hang.go::projectTrainingStepStallRecord) gates on a log record carrying:

Attribute Type Source
tracecore.alert.training_step_stalled.no_progress_seconds int (seconds) wall-clock duration since the last gen_ai.training.step_duration_seconds Gauge sample advanced; the bridge fires once when the value crosses StallThreshold (default 180s)
tracecore.alert.training_step_stalled.last_step_ns int (optional, unix-ns) timestamp of the last step-progress sample observed before the stall; falls back to the log record's Timestamp
gen_ai.training.step int last step index emitted by the trainer — the detector's warmup guard skips step < 2
gen_ai.training.phase string (optional) train / eval — the detector's eval-phase guard skips phase == "eval"
k8s.pod.name string (resource) training pod identity; stamped by k8sattributesprocessor on the trainer side
k8s.namespace.name string (resource) training pod namespace; same source
k8s.node.name string (resource) node hosting the training pod; same source

Namespacing under tracecore.alert.training_step_stalled.* keeps the bridge log shape distinguishable from raw gen_ai.training.step_duration_seconds Gauge samples downstream and mirrors the tracecore.alert.pcie_rate_collapse.* naming the pattern #5 bridge contract uses.

Unlike the DCGM-sourced bridges above, the input metric here is not an hw.* series — it is the upstream gen_ai.training.step_duration_seconds Gauge (per OTel GenAI semconv §Metrics, status: development at v0.130). Trainers emit this via OTel auto-instrumentation (opentelemetry-instrumentation-* for the framework in use) or an explicit Meter.create_gauge call inside the training loop. The recipe assumes the Gauge arrives via an OTLP push from the trainer pod — prometheusreceiver is one valid scrape path (Prometheus-format exposition of the Gauge by the trainer's metrics endpoint), but the same bridge attribute contract applies to OTLP-push topologies.

Why the recipe doesn't ship a bridge stanza today

Same RFC-0014 block as patterns #3 / #4 / #5 / #10: OTTL metric_statements cannot reference log.* paths at OTel-contrib v0.130, and no contrib connector emits log records from a metrics pipeline. The resolution path is either (a) an in-tree processor.WithMetrics extension to patterndetectorprocessor (tracked under #260; cuda_oom #10 shipped via #437 / PR #461; pattern #7 piggybacks on the same plumbing because the attribute contract is purely a wire-format pin) or (b) an upstream metricthresholdconnector contribution.

Log-record schema (verdict-input)
# Bridge-emitted log record consumed by patterndetectorprocessor's
# dataloader_hang detector. One log record per (training pod, stall
# crossing) — NOT one per Gauge sample.
resource:
  attributes:
    k8s.pod.name: trainer-rank-3         # str  — REQUIRED. Pod-scoped join key.
    k8s.namespace.name: training         # str  — required (verdict carries it).
    k8s.node.name: gpu-node-0007         # str  — REQUIRED. Node-scoped storage-event join key.
log_record:
  timestamp: 2026-06-01T10:04:30Z        # MUST be the stall-detection time (last_step + no_progress), not now()
  body: ""                               # ignored by the detector
  attributes:
    tracecore.alert.training_step_stalled.no_progress_seconds: 240   # int (seconds) — REQUIRED. Gate the detector reads.
    tracecore.alert.training_step_stalled.last_step_ns: 1717236030000000000  # int (unix-ns) — optional, sharpens the verdict timestamp.
    gen_ai.training.step: 42             # int — REQUIRED for the warmup guard (step >= 2).
    gen_ai.training.phase: train         # str — optional; "eval" gets skipped by the eval-phase guard.
Detector consumption

projectTrainingStepStallRecord (at module/processor/patterndetectorprocessor/projectors_shared.go) extracts the four required scalars and builds a patterns.TrainingStepStallRecord. The dataloader_hang detector joins each stall against a same-pod dataloader.error_class log record OR a same-node FailedMount / VolumeMountFailure Kubernetes Event within DataLoaderHangCorrelationWindow (default 5min); without a discriminator, no verdict fires (spec §"Detector evaluation rule" — stalls alone are not a hang because patterns #6 stragglers and #11 checkpointer also stall steps). The unit tests TestPatternDetector_DataLoaderHangWiring* (module/processor/patterndetectorprocessor/dataloader_hang_test.go) pin the canonical wire format above against the live detector.

Pattern #10 — hw.gpu.memory.{free,total} (issue #337)

Status: shipped via #437 / PR #461. patterndetectorprocessor now additionally implements processors.metrics.Metrics (ADR-0001 PR-B): the metrics-path consumer projects hw.gpu.memory.{free,total} pmetric.NumberDataPoints directly into patterns.FBMemoryRecord values, buffers them in a bounded ring keyed on processor component.ID, and the logs-path consumer drains the buffer at CUDA OOM-log time. Operators running dcgm-exporter into the metrics pipeline get full-confidence verdicts WITHOUT configuring the metrics→logs OTTL recipe below. The log-record schema below remains the load-bearing wire contract for the OTTL recipe path (the alternative when an operator can't run the in-tree consumer); the detector (module/processor/patterndetectorprocessor/cuda_oom.go::projectFBMemoryRecord) expects this exact shape and both paths converge on it.

The CUDA OOM detector gates on a log record carrying:

Attribute Type Source
hw.gpu.memory.free int (bytes) last DCGM_FI_DEV_FB_FREE Gauge sample for the GPU at bridge-emit time
hw.gpu.memory.total int (bytes) DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE at bridge-emit time (per-GPU sum within one scrape)
gpu.id string PCI BDF resource attr from the DCGM series (or hw.gpu.pci.bdf fallback)
k8s.node.name string (resource) stamped by k8sattributesprocessor

The bridge MUST stamp BOTH hw.gpu.memory.free AND hw.gpu.memory.total on the same log record — the detector's projectFBMemoryRecord gate at module/processor/patterndetectorprocessor/cuda_oom.go short-circuits if either is missing (the fragmentation discriminator needs both numerator and denominator joined to the same GPU / same scrape). One log record per (GPU, scrape) is the load-bearing shape: per-attribute records emitted on separate logs would defeat the join.

Log-record schema (verdict-input)
# Bridge-emitted log record consumed by patterndetectorprocessor's
# cuda_oom detector. Emitted once per (GPU, scrape) — NOT once per
# (GPU, attribute). Field types match the OTel pdata stamps the
# detector reads (Int / Str).
resource:
  attributes:
    k8s.node.name: gpu-node-0001         # str  — stamped by k8sattributesprocessor
    hw.id: GPU-3a4b...                   # str  — NVML UUID (optional join key)
    hw.gpu.pci.bdf: 0000:3b:00.0         # str  — PCI BDF (optional; gpu.id below is the cross-signal join)
log_record:
  timestamp: 2026-06-01T10:00:00Z        # MUST be the scrape timestamp, not now()
  body: ""                               # ignored by the detector — keep empty or set to a debug-friendly summary
  attributes:
    gpu.id: PCI:0000:3b:00               # str  — REQUIRED. Cross-signal join key (same shape as the OOM log record).
    hw.gpu.memory.free: 17179869184      # int  — REQUIRED. Bytes free on the GPU at the scrape.
    hw.gpu.memory.total: 85899345920     # int  — REQUIRED. Bytes total on the GPU (= used + free at the scrape).
    hw.gpu.memory.used: 68719476736      # int  — optional, evidence context. NOT gated on by the detector.
    hw.gpu.index: 3                      # int  — optional, evidence context.
Detector consumption

projectFBMemoryRecord (at module/processor/patterndetectorprocessor/cuda_oom.go:114) extracts the three required scalars and builds a patterns.FBMemoryRecord. The free-ratio (FreeBytes / TotalBytes) is then compared against cuda_oom_fb_free_fragmentation_threshold (default 0.05) — a ratio at-or-above threshold flips the verdict to cuda_oom.kind=fragmentation, below flips to true_oom. The unit test TestPatternDetector_CUDAOOMWiringEmitsFragmentationVerdict pins the canonical wire format above against the live detector.

Why the recipe doesn't ship a bridge stanza today

OTTL metric_statements cannot reference log.* paths at v0.130 (upstream README). Connectors that change signal type all emit metrics, not logs (countconnector, signaltometricsconnector, spanmetricsconnector). Per RFC-0014 the resolution path is either (a) an in-tree processor.WithMetrics extension to patterndetectorprocessor — shipped for cuda_oom (#10) via #437 / PR #461, with sibling consumers for patterns #2 / #3 / #4 / #5 pending under #260 — or (b) an upstream metricthresholdconnector contribution. The contract above stays stable across either resolution.

[[adopt-over-build]] posture

Every statement in transform/dcgm_to_hw_semconv uses upstream OTTL functions only: set, IsMatch, ExtractPatterns, Int. No new OTTL functions are introduced. If a future series cannot be projected with the existing function set, the right response is to propose the missing function upstream to OTel contrib — not to ship a tracecore-specific OTTL extension.

Identity-conflict caveat

OTTL set(metric.name, ...) renames a metric in place; the v0.130 README warns that "Transformation of metrics have the potential to affect the identity of a metric leading to an Identity Crisis." For this recipe the conflict is intentional: 36 input series (18 NVLink links × 2 directions) collapse into one output metric named hw.gpu.nvlink.io with distinct attribute sets per datapoint. Downstream OTel + Prometheus backends merge by (metric.name, attributes) and produce the expected per-link / per-direction series. If your backend rejects the merged shape, follow the upstream guidance to apply the rename inside a separate statement group from any other identity-affecting operation; this recipe already isolates the rename inside its own processor (transform/dcgm_to_hw_semconv).

Placeholders

Placeholder What to fill in
REPLACE_WITH_OTLP_HTTP_ENDPOINT The OTLP/HTTP base URL of your sink. /v1/metrics is appended automatically per the OTLP/HTTP spec.
REPLACE_WITH_DCGM_EXPORTER_TARGET localhost:9400 for a DaemonSet shape, or the dcgm-exporter Service DNS (dcgm-exporter.kube-system.svc:9400) for a Deployment shape.

Tracecore does not expand environment variables in YAML. Render the literals at deploy time via envsubst, a Helm template, or a Kubernetes secret-injection driver. The :port suffix is mandatory — prometheusreceiver rejects bare hostnames at validate.

Failure modes

Symptom First check
scrape_configs.targets[0]: address ... incorrect at validate The target placeholder still carries REPLACE_WITH_DCGM_EXPORTER_TARGET — the validator now rejects literal placeholders that look like hostnames. Render at deploy time.
Scrape returns 200 but no metrics flow prometheusreceiver requires the response to be in Prometheus text exposition format. A target that returns OTLP-JSON or vendor-proprietary format silently drops. Curl the endpoint and confirm the first line starts with # HELP.
gpu.vendor empty on a known DCGM target The exporter is on an old release that emits the legacy dcgm_* prefix (lowercase). Either upgrade the exporter to a DCGM_*-emitting build or extend the OTTL regex to ^[Dd][Cc][Gg][Mm]_.
cardinality limit exceeded from the backend prometheusreceiver does not cap series. Add a filterprocessor between prometheus and transform/gpu_vendor to drop metrics you don't query. Cap dcgm-exporter's --collectors flag to the families you alert on.
Bearer-token target returns 401 The ServiceAccount lacks the binding to the target's RBAC. For Kueue, the SA needs nonResourceURLs: ["/metrics"] verbs: ["get"] via a ClusterRoleBinding.
dcgm-exporter pod CrashLoopBackOff with Failed to initialize NVML: Driver/library version mismatch The host's NVIDIA driver is older (or newer) than the DCGM library bundled into the dcgm-exporter image. kubectl -n gpu-monitoring logs -l app.kubernetes.io/name=dcgm-exporter --tail=20 confirms. Align by upgrading the driver via the NVIDIA GPU Operator or nvidia-driver-daemonset, or by pinning the chart to a dcgm-exporter image tag that matches your driver minor version.
dcgm-exporter pod CrashLoopBackOff with Failed to initialize NVML: Driver Not Loaded No NVIDIA driver is installed on the host — the DaemonSet was scheduled onto a non-GPU node. Tighten nodeSelector to a GPU-only label (e.g. nvidia.com/gpu.present: "true" from NVIDIA's node-feature-discovery, or your cluster's equivalent). If every GPU node already lacks a driver, install one before the chart will start.
dcgm-exporter pod Running but /metrics returns 500 / hangs DCGM cannot reach NVML — usually the nvidia-container-toolkit runtime is not configured (container has no /dev/nvidiactl). Verify with kubectl -n gpu-monitoring exec <pod> -- ls /dev/nvidia*. The runtime must be the NVIDIA container runtime; see the Quickstart on Kubernetes prerequisites.
dcgm-exporter pod Pending with forbidden: ... configmaps event The chart's Role + RoleBinding (dcgm-exporter-read-cm) was disabled or the ServiceAccount lost its binding. Re-render with the defaults (rbac.create=true, serviceAccount.create=true) — the pod must be able to read the exporter-metrics-config-map to load default-counters.csv.
prometheusreceiver logs context deadline exceeded while scraping dcgm-exporter dcgm-exporter's first scrape after startup can take >10s on hosts with many GPUs because DCGM has to initialize per-device watches. Either raise the recipe's scrape_timeout (above) past 15s, or set the chart's arguments to include -d g (GPUs only, no GPU-instances enumeration) to shrink the field watch set. The scrape_interval should remain ≥ scrape_timeout to avoid overlapping scrapes.

Upstream component docs: receiver/prometheusreceiver, processor/transformprocessor, processor/filterprocessor.