Tracecore scrapes Prometheus-format endpoints via the upstream
prometheusreceiver. This is the adoption shape for every vendor
GPU exporter per
RFC-0013 §2 (Adoption matrix):
NVIDIA dcgm-exporter, AMD ROCm/device-metrics-exporter, Intel
intel/xpumanager, Habana Prometheus Metric Exporter — and for
the Kueue scheduler's metrics endpoint. Replaces the in-tree dcgm
and kueue receivers per RFC-0013 §7 (Deletion list — v0.1.0).
Three OTTL transform processors run in series over the scraped
metrics:
transform/gpu_vendorstamps the customer-stablegpu.vendorresource attribute (RFC-0013 §3) so dashboards survive a future swap between vendor exporters.transform/dcgm_to_hw_semconvprojects the rawDCGM_FI_*namespace onto the customer-stablehw.gpu.*/hw.errorsnamespace declared in docs/proposals/semconv-hw-gpu-extensions.md so the next-cycle pattern detectors (issue #260 patterns #1 NVLink, #3 HBM ECC, #4 thermal throttle, #5 PCIe AER, #10 CUDA OOM) read one vendor-neutral wire format. Per docs/rfcs/0014-metrics-to-logs-pattern-input.md the verdict-emission half extendspatterndetectorprocessorwithprocessor.WithMetrics— shipped for cuda_oom (#10) via #437 / PR #461, with sibling consumers for patterns #1 / #3 / #4 / #5 pending under #260. The transform below is the load-bearing wire-format contract that the metrics-path consumer reads.transform/ib_to_hw_semconvprojectsnode_exporter --collector.infiniband'snode_infiniband_port_state_idonto the customer-stablehw.network.ib.*namespace (docs/ATTRIBUTES.md §hw.network.*, alpha) so pattern #2's link-flap detector reads the same vendor-neutral shape whether the underlying source is node_exporter, a Mellanox-specific exporter, orjournald-kernel.md'smlx5_corestream. Same RFC-0014 metrics- path consumer dependency as the DCGM transform (cuda_oom #10 shipped via PR #461; IB link-flap sibling consumer pending).
# docs/integrations/examples/prometheus-scrape.yaml
receivers:
prometheus:
config:
scrape_configs:
- job_name: dcgm-exporter
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
fallback_scrape_protocol: PrometheusText1.0.0
static_configs:
- targets:
- REPLACE_WITH_DCGM_EXPORTER_TARGET
processors:
transform/gpu_vendor:
metric_statements:
- context: datapoint
statements:
- set(resource.attributes["gpu.vendor"], "nvidia") where IsMatch(metric.name, "^DCGM_")
- set(resource.attributes["gpu.vendor"], "amd") where IsMatch(metric.name, "^amdsmi_")
- set(resource.attributes["gpu.vendor"], "intel") where IsMatch(metric.name, "^xpum_")
- set(resource.attributes["gpu.vendor"], "habana") where IsMatch(metric.name, "^habanalabs_")
batch:
send_batch_size: 8192
timeout: 10s
exporters:
otlphttp:
endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT
compression: gzip
timeout: 10s
service:
pipelines:
metrics/scrape:
receivers: [prometheus]
processors: [transform/gpu_vendor, batch]
exporters: [otlphttp]Validate with the in-tree binary:
./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yamlExit 0 means the config parses, every scrape target URL is well-formed, and the OTTL statements type-check against the metric-datapoint context.
The recipe above scrapes NVIDIA's upstream
dcgm-exporter (Apache-2.0).
Install it from the canonical Helm repo:
helm repo add gpu-helm-charts \
https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace gpu-monitoring --create-namespace \
-f dcgm-exporter-values.yamlThe minimal dcgm-exporter-values.yaml that pairs with tracecore's
prometheusreceiver scrape (no Prometheus Operator dependency) is:
# dcgm-exporter-values.yaml — minimal overlay for tracecore scrape
serviceMonitor:
# tracecore scrapes via prometheusreceiver, not Prometheus Operator
enabled: false
service:
type: ClusterIP
port: 9400
nodeSelector:
# restrict the DaemonSet to GPU nodes; pair with NVIDIA's node-feature-
# discovery or device-plugin label. Use the label your cluster stamps.
nvidia.com/gpu.present: "true"
arguments:
- "-f"
- "/etc/dcgm-exporter/default-counters.csv"To enable the pattern #1 NVLink series (commented out in upstream
default-counters.csv — see the NVLink section below), mount a custom
counters ConfigMap and point arguments at it via the -m flag per
the chart's values.yaml
documentation.
The chart renders a ServiceAccount, ConfigMap (default counters),
Role + RoleBinding (read the ConfigMap), Service (ClusterIP on
:9400), and DaemonSet (containerPort 9400,
app.kubernetes.io/name=dcgm-exporter). The Service DNS name is
dcgm-exporter.gpu-monitoring.svc; per-pod IPs live behind the
app.kubernetes.io/name=dcgm-exporter selector.
Confirm the exporter is healthy before pointing tracecore at it.
Port-forward one pod and curl /metrics:
kubectl -n gpu-monitoring port-forward \
$(kubectl -n gpu-monitoring get pod \
-l app.kubernetes.io/name=dcgm-exporter \
-o jsonpath='{.items[0].metadata.name}') \
9400:9400 &
curl -sf http://localhost:9400/metrics | head -5A healthy response begins with # HELP DCGM_FI_... and # TYPE ...
lines in Prometheus text exposition format. Pattern-relevant prefixes
to confirm are present:
| Prefix | Patterns | Needs custom counters CSV? |
|---|---|---|
DCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTAL |
#3 HBM ECC | No |
DCGM_FI_DEV_*_VIOLATION |
#4 thermal throttle | No |
DCGM_FI_DEV_FB_{USED,FREE} |
#10 CUDA OOM | No |
DCGM_FI_PROF_PCIE_{TX,RX}_BYTES |
#5 PCIe AER | No (profiling enabled by default) |
DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES |
#1 NVLink | Yes — see NVLink section below |
grep for the families you intend to alert on; a missing prefix
means the corresponding pattern will never fire even if the recipe's
OTTL stanza compiles cleanly. Then point tracecore's
REPLACE_WITH_DCGM_EXPORTER_TARGET at either localhost:9400 (per-
node DaemonSet shape) or dcgm-exporter.gpu-monitoring.svc:9400
(Deployment shape) and validate per the config section above.
The right Kubernetes shape depends on the scrape target:
- Per-node targets (NVIDIA
dcgm-exporter, AMD/Intel/Habana per-node exporters): run tracecore as aDaemonSetand scrapelocalhost:<port>so each node's exporter is read by the tracecore pod on the same node. No cluster-wide service discovery required. - Cluster-scoped targets (Kueue's controller-manager metrics
endpoint, single-replica vendor exporters): run tracecore as a
single-replica
Deploymentand scrape the target's Service. Pair withkubernetes_sd_configs:if the target moves between pods on re-roll; for a stable Service ClusterIP,static_configs:is enough.
The example scrapes a static unauthenticated target. For Kueue's controller-manager metrics endpoint (TLS + serviceaccount-token bearer):
- job_name: kueue
scheme: https
scrape_interval: 30s
metrics_path: /metrics
fallback_scrape_protocol: PrometheusText1.0.0
authorization:
type: Bearer
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kueue-controller-manager-metrics-service.kueue-system.svc
static_configs:
- targets:
- kueue-controller-manager-metrics-service.kueue-system.svc:8443Adjust server_name to match the Service's DNS name. The
credentials_file path is the default ServiceAccount projected-token
mount; if you use a custom token volume, update the path.
The OTTL transform routes to a vendor tag based on the metric-name prefix each upstream exporter uses:
metric.name prefix |
gpu.vendor |
Upstream exporter |
|---|---|---|
DCGM_* |
nvidia |
NVIDIA/dcgm-exporter |
amdsmi_* |
amd |
ROCm/device-metrics-exporter |
xpum_* |
intel |
intel/xpumanager |
habanalabs_* |
habana |
Habana Prometheus Metric Exporter |
The tag survives the RFC-0013 §3
contract; existing dashboards keyed on gpu.vendor continue to
work after a vendor swap.
The second OTTL transform (transform/dcgm_to_hw_semconv in the
example YAML) projects every load-bearing DCGM_FI_* series into
the customer-stable namespace from
docs/proposals/semconv-hw-gpu-extensions.md.
The contract is one-direction: a downstream consumer reads only
hw.gpu.* / hw.errors, never the raw DCGM names. Per
RFC-0014 the
pattern detectors built on top of this namespace (issue #260)
land as a processor.WithMetrics extension to
patterndetectorprocessor — not as an OTTL metrics-to-logs
emitter — because OTel-contrib transformprocessor v0.130 cannot
emit log records from a metrics pipeline. The cuda_oom (#10)
consumer shipped via
#437 /
PR #461; siblings for #1 / #3 / #4 / #5 are pending under #260.
dcgm-exporter stamps two cross-version label flavors per series:
UUID / gpu_uuid for the NVML UUID, and gpu / GPU for the
NVML index. The transform maps either onto the customer-stable
resource attribute:
| dcgm-exporter label | Resource attribute | Notes |
|---|---|---|
UUID or gpu_uuid |
hw.id |
NVML UUID; durable join key. The transform prefers UUID and falls back to gpu_uuid when only the legacy label is present. |
gpu or GPU |
hw.gpu.index |
NVML index; volatile across reboots. Same dual-label preference. |
| (computed) | hw.type = "gpu" |
Stamped on every DCGM_* series; gates downstream hw.* filters against future non-GPU hw.* sources. |
pci_bus_id |
hw.gpu.pci.bdf |
PCI bus-device-function; lifted only on DCGM_FI_PROF_PCIE_{TX,RX}_BYTES series so pattern #5's escalation can cross-reference dmesg AER lines on the same BDF. |
Per-link Tx/Rx Counter. Each DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES
series collapses into one metric name with the link index lifted
into a datapoint attribute via OTTL ExtractPatterns:
| Raw DCGM series | OTel metric | Datapoint attributes (added) |
|---|---|---|
DCGM_FI_PROF_NVLINK_L{N}_TX_BYTES (N ∈ 0..17) |
hw.gpu.nvlink.io (Counter, unit By) |
hw.gpu.nvlink.link={N}, network.io.direction=transmit |
DCGM_FI_PROF_NVLINK_L{N}_RX_BYTES (N ∈ 0..17) |
hw.gpu.nvlink.io (Counter, unit By) |
hw.gpu.nvlink.link={N}, network.io.direction=receive |
The link index lift uses Int(ExtractPatterns(metric.name, "^DCGM_FI_PROF_NVLINK_L(?P<link>\\d+)_(TX|RX)_BYTES$")["link"])
so the resulting attribute is integer-typed (matches the semconv
proposal's hw.gpu.nvlink.link: int). Per-link decomposition is
the diagnostic-critical surface for
pattern #1 silent NVLink degradation;
without it the alert query has no group-by axis.
dcgm-exporter opt-in required. The
DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTESfield IDs (1040..1075) are commented out in dcgm-exporter's upstreamdefault-counters.csv. Operators must mount a custom counters ConfigMap and pass it via the chart's-m <ns>:<configmap>flag (or setarguments[1]=-f=/etc/dcgm-exporter/custom-counters.csvand a matchingextraVolumesentry). Without this, the recipe compiles cleanly but emits zerohw.gpu.nvlink.ioseries — pattern #1 will never fire.
ECC counters expand into four series (correctable / uncorrectable × volatile / aggregate). Pattern #3 alerts on the uncorrectable volatile row; the rest are evidence context that the runbook references.
| Raw DCGM series | OTel metric | Datapoint attributes (added) |
|---|---|---|
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL |
hw.errors (Counter, unit {error}) |
error.type=uncorrected, error.subtype=double_bit, error.persistence=volatile |
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL |
hw.errors |
error.type=uncorrected, error.subtype=double_bit, error.persistence=aggregate |
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL |
hw.errors |
error.type=corrected, error.subtype=single_bit, error.persistence=volatile |
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL |
hw.errors |
error.type=corrected, error.subtype=single_bit, error.persistence=aggregate |
The attribute names match the semconv hw.errors shape (see
hw common).
Pattern #3 doc consumes the
error.persistence=volatile row in its alert query.
Modern dcgm-exporter emits per-reason throttle counters as
discrete DCGM_FI_DEV_*_VIOLATION series. Each maps onto
hw.gpu.throttle.duration with a hw.gpu.throttle.reason
attribute (semconv proposal §2).
| Raw DCGM series | OTel metric | Datapoint attributes (added) |
|---|---|---|
DCGM_FI_DEV_THERMAL_VIOLATION |
hw.gpu.throttle.duration (Counter, unit s) |
hw.gpu.throttle.reason=thermal |
DCGM_FI_DEV_POWER_VIOLATION |
hw.gpu.throttle.duration |
hw.gpu.throttle.reason=power |
DCGM_FI_DEV_SYNC_BOOST_VIOLATION |
hw.gpu.throttle.duration |
hw.gpu.throttle.reason=sync_boost |
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION |
hw.gpu.throttle.duration |
hw.gpu.throttle.reason=hw_slowdown |
DCGM_FI_DEV_LOW_UTIL_VIOLATION is intentionally not mapped:
the upstream semconv proposal's hw.gpu.throttle.reason enum
(thermal, power, sync_boost, hw_slowdown, sw_thermal,
display_clock, app_clock_setting) has no value for an
"idle / low-utilization" throttle. Mapping it to a value outside
the proposal would create forward-incompat drift once the SIG
resolves the vocabulary. Tracked at
#272 for the
upstream proposal extension.
Pattern #4 doc alerts
on the reason=thermal row; the other reasons are diagnostic
context (power correlates with PSU sag, hw_slowdown is the
"GPU has decided to clock itself down" hard signal).
DCGM exposes per-direction PCIe byte counters whose rate collapses when the link renegotiates to a lower generation / width. Pattern #5 watches the rate divergence across the host's GPU set.
| Raw DCGM series | OTel metric | Datapoint attributes (added) |
|---|---|---|
DCGM_FI_PROF_PCIE_TX_BYTES |
hw.gpu.io (Counter, unit By) |
network.io.direction=transmit |
DCGM_FI_PROF_PCIE_RX_BYTES |
hw.gpu.io (Counter, unit By) |
network.io.direction=receive |
The pci_bus_id label is lifted to the resource-level
hw.gpu.pci.bdf so pattern #5's escalation matrix can cross-
reference dmesg PCIe Bus Error: Corrected lines against the
same BDF without joining series. Pattern #5 doc
shows the divergence query in PromQL form.
DCGM exposes the per-GPU framebuffer state as two Gauge series
in bytes. They are the proximate signal pattern #10 joins to a
RuntimeError: CUDA out of memory log record so the detector
can discriminate true-OOM (≤5% free at fault time) from
allocator fragmentation (>5% free at fault time). The OTTL
projection lands the customer-stable hw.gpu.memory.{used,free}
shape on the metrics pipeline; the total = used + free
derivation lives at the bridge layer below.
| Raw DCGM series | OTel metric | Datapoint attributes (unchanged) |
|---|---|---|
DCGM_FI_DEV_FB_USED |
hw.gpu.memory.used (Gauge, unit By) |
none — vendor's gpu/UUID labels lifted to resource by the section above |
DCGM_FI_DEV_FB_FREE |
hw.gpu.memory.free (Gauge, unit By) |
none — same |
hw.gpu.memory.total is intentionally NOT projected at the OTTL
metric-statements layer. transformprocessor v0.130 operates
one datapoint at a time within a metric — there is no cross-
series arithmetic that could compute total = used + free on a
metrics pipeline (upstream README).
The total is computed at the metrics-to-logs bridge layer where
the two scalars are already projected onto a single log record;
see the bridge-contract section below.
Pattern #10 doc consumes
the joined record via module/processor/patterndetectorprocessor/cuda_oom.go's
projectFBMemoryRecord (gate: both hw.gpu.memory.free AND
hw.gpu.memory.total AND gpu.id on the same log record).
MIG caveat. On MIG-partitioned GPUs,
DCGM_FI_DEV_FB_FREEreports the parent device, not the MIG slice. The detector spec (10-cuda-oom-deceptive.md §Edge cases) gates onhw.gpu.mig.enabled == trueto skip MIG hosts until MIG-aware FB metrics are wired. The OTTL projection itself is MIG-safe — it just renames the parent-device series; the detector decides whether the renamed series is meaningful.
Source: node_exporter --collector.infiniband (the upstream Prometheus
node-exporter infiniband collector
which reads /sys/class/infiniband/<dev>/ports/<n>/phys_state and
exposes the IBA-spec phys_state ID as an integer Gauge). Run the
collector under tracecore's prometheusreceiver per
RFC-0013 §2;
the in-tree binary bundles prometheusreceiver so no extra
component is required.
| Raw node_exporter series | OTel metric | Datapoint attributes (added) |
|---|---|---|
node_infiniband_port_state_id{device, port} |
hw.network.ib.port.state (Gauge, IBA phys_state ID 1=Down / 2=Init / 3=Armed / 4=Active) |
hw.network.ib.device={device label}, hw.network.ib.port.num=Int({port label}) |
The detector
(module/processor/patterndetectorprocessor/ib_link_flap.go)
reads these three attributes off a log record via port.Int() /
state.Int() — the Int() cast on the port label is load-bearing
because prometheusreceiver exposes Prometheus labels as strings
while the projector calls Int() on the pdata Value. The companion
series node_infiniband_state{state="<name>"} (string label) is
intentionally not mapped: the detector compares against the
patterns.IBPortState* integer constants, so the string variant
would round-trip wrong.
The metric rename runs last so the where metric.name == "node_infiniband_port_state_id" guards on the attribute-stamp
statements above still match the raw exporter name when each
statement evaluates. Renaming first would short-circuit the
attribute stamps because the second statement's guard would no
longer find the original name.
Pattern #2 doc consumes the
joined record via
projectIBPortStateRecord
(gate: hw.network.ib.port.state AND hw.network.ib.device AND
hw.network.ib.port.num on the same log record, plus
k8s.node.name on the resource). The metrics→logs emit half follows
the RFC-0014 pattern — the cuda_oom (#10) precedent ships an in-tree
consumer via #437
/ PR #461 (processors.metrics.Metrics + bounded cross-stream
buffer); the IB link-flap metrics-path consumer is a pending sibling
follow-up. The bridge log-record schema (load-bearing for both the
OTTL recipe path and the future in-tree consumer) is pinned in the
next section.
The pattern detectors at module/processor/patterndetectorprocessor
read log records as their primary input (processor.WithLogs).
The DCGM scrape recipe above produces metric datapoints. Bridging
the two at the OTTL layer is upstream-blocked at OTel-contrib v0.130
— no contrib processor or connector emits log records from a metrics
pipeline (per
RFC-0014).
RFC-0014's resolution path adds processor.WithMetrics directly on
patterndetectorprocessor so the processor consumes the metrics
pipeline in-tree and joins it against the logs path via a bounded
cross-stream buffer. The cuda_oom (#10) consumer shipped via
#437 / PR #461;
metrics-path consumers for the other patterns (#2 IB link flap,
#3 HBM ECC, #4 thermal throttle, #5 PCIe AER) are pending sibling
follow-ups. Until those land, the bridge attribute contract below is
the load-bearing wire-format an OTTL recipe (when one becomes
expressible) OR the in-tree consumer MUST honor — the detector
projections gate on this contract today, and any emitter (in-tree
consumer or future OTTL recipe) that stamps these attributes fires
the pattern end-to-end without changing the detector library.
Pattern #3 — hw.errors.delta (issue #273)
The HBM ECC detector gates on a log record carrying:
| Attribute | Type | Source |
|---|---|---|
hw.errors.delta |
int | per-scrape delta of hw.errors counter (= increase(hw_errors[scrape_interval])) |
gpu.id |
string | PCI BDF resource attr from the DCGM series |
hw.gpu.index |
int (optional) | NVML index from the DCGM series |
error.type |
string | uncorrected for the alert row |
error.subtype |
string | double_bit for the alert row |
error.persistence |
string | volatile for the alert row |
k8s.node.name |
string (resource) | stamped by k8sattributesprocessor |
The metric datapoint attribute set from the transform/dcgm_to_hw_semconv
stanza above already carries error.type / error.subtype /
error.persistence on the renamed hw.errors Counter, so the future
emitter passes those through unchanged; the only new field is the
per-scrape hw.errors.delta integer.
Pattern #4 — hw.gpu.throttle.duration.delta (issue #282)
The thermal-throttle detector gates on a log record carrying:
| Attribute | Type | Source |
|---|---|---|
hw.gpu.throttle.duration.delta |
int (seconds) | per-scrape delta of hw.gpu.throttle.duration |
hw.gpu.throttle.reason |
string | thermal for the alert row |
gpu.id |
string | PCI BDF resource attr |
hw.gpu.index |
int (optional) | NVML index |
k8s.node.name |
string (resource) | stamped by k8sattributesprocessor |
Units pinned to integer seconds because projectThermalThrottleRecord
at module/processor/patterndetectorprocessor/patterndetector.go
multiplies the delta by time.Second — the wire format MUST agree.
Pattern #5 — tracecore.alert.pcie_rate_collapse.* (issue #284)
Layer 2 of the PCIe AER cascade detector gates on a log record carrying:
| Attribute | Type | Source |
|---|---|---|
tracecore.alert.pcie_rate_collapse.bytes_per_second |
double | rate(hw.gpu.io[5m]) per GPU |
tracecore.alert.pcie_rate_collapse.baseline_bytes_per_second |
double | quantile(0.5, rate(...)) by (k8s.node.name) |
tracecore.alert.pcie_rate_collapse.direction |
string | transmit / receive (falls back to network.io.direction) |
gpu.id |
string | PCI BDF resource attr |
k8s.node.name |
string (resource) | stamped by k8sattributesprocessor |
Namespacing under tracecore.alert.pcie_rate_collapse.* keeps the
bridge log shape distinguishable from raw hw.gpu.io scrape samples
downstream. Layer 1 (journald-kernel AER stanza) is documented in
journald-kernel.md and ships independently
of this bridge.
Pattern #2 — hw.network.ib.port.state (issue #393)
The InfiniBand link-flap detector
(module/processor/patterndetectorprocessor/ib_link_flap.go::projectIBPortStateRecord)
gates on a log record carrying:
| Attribute | Type | Source |
|---|---|---|
hw.network.ib.port.state |
int | last node_infiniband_port_state_id Gauge sample for the (device, port) tuple at bridge-emit time; IBA phys_state ID (1=Down, 2=Init, 3=Armed, 4=Active) |
hw.network.ib.device |
string | device label on the source series (e.g. mlx5_0) |
hw.network.ib.port.num |
int | port label on the source series, cast via OTTL Int() |
k8s.node.name |
string (resource) | stamped by k8sattributesprocessor on the DaemonSet |
The metric datapoint attribute set from the transform/ib_to_hw_semconv
stanza above already carries hw.network.ib.device and
hw.network.ib.port.num; the future emitter passes those through
unchanged. The hw.network.ib.port.state integer lifts directly from
the renamed Gauge's datapoint value (one log record per (device, port, scrape) — emit-once-per-state-transition is a detector-side
optimization, not a bridge-side gate; the detector's
patterns.IBLinkFlapDetector
counts transitions internally).
# Bridge-emitted log record consumed by patterndetectorprocessor's
# ib_link_flap detector. One log record per (device, port, scrape) —
# the detector counts transitions across consecutive records.
resource:
attributes:
k8s.node.name: gpu-node-0007 # str — REQUIRED. Flap predicate is per-node.
log_record:
timestamp: 2026-06-01T10:04:30Z # MUST be the scrape timestamp.
body: "" # ignored by the detector.
attributes:
hw.network.ib.port.state: 1 # int — REQUIRED. IBA phys_state ID; detector compares against patterns.IBPortState* constants.
hw.network.ib.device: mlx5_0 # str — REQUIRED. Per-NIC device name; flap predicate is per-device.
hw.network.ib.port.num: 1 # int — REQUIRED. Port index; flap predicate is per-port (a 2-port HCA tracks each port separately).projectIBPortStateRecord (at
module/processor/patterndetectorprocessor/ib_link_flap.go) extracts
the three scalars and builds a patterns.IBPortStateRecord. The
detector emits one verdict per (k8s.node.name, hw.network.ib.device, hw.network.ib.port.num) tuple when transition count within
ib_link_flap_window (default 2min) crosses
ib_link_flap_min_transitions (default 2). The unit tests
TestPatternDetector_IBLinkFlapWiring*
pin the canonical wire format above against the live detector.
Pattern #7 — tracecore.alert.training_step_stalled.* (issue #365)
The dataloader_hang detector's Layer 2 input is a training-step
stall bridge log record derived from the trainer's
gen_ai.training.step_duration_seconds Gauge. The detector
(module/processor/patterndetectorprocessor/dataloader_hang.go::projectTrainingStepStallRecord)
gates on a log record carrying:
| Attribute | Type | Source |
|---|---|---|
tracecore.alert.training_step_stalled.no_progress_seconds |
int (seconds) | wall-clock duration since the last gen_ai.training.step_duration_seconds Gauge sample advanced; the bridge fires once when the value crosses StallThreshold (default 180s) |
tracecore.alert.training_step_stalled.last_step_ns |
int (optional, unix-ns) | timestamp of the last step-progress sample observed before the stall; falls back to the log record's Timestamp |
gen_ai.training.step |
int | last step index emitted by the trainer — the detector's warmup guard skips step < 2 |
gen_ai.training.phase |
string (optional) | train / eval — the detector's eval-phase guard skips phase == "eval" |
k8s.pod.name |
string (resource) | training pod identity; stamped by k8sattributesprocessor on the trainer side |
k8s.namespace.name |
string (resource) | training pod namespace; same source |
k8s.node.name |
string (resource) | node hosting the training pod; same source |
Namespacing under tracecore.alert.training_step_stalled.* keeps the
bridge log shape distinguishable from raw
gen_ai.training.step_duration_seconds Gauge samples downstream and
mirrors the
tracecore.alert.pcie_rate_collapse.*
naming the pattern #5 bridge contract uses.
Unlike the DCGM-sourced bridges above, the input metric here is
not an hw.* series — it is the upstream
gen_ai.training.step_duration_seconds Gauge (per
OTel GenAI semconv §Metrics,
status: development at v0.130). Trainers emit this via OTel
auto-instrumentation (opentelemetry-instrumentation-* for the
framework in use) or an explicit Meter.create_gauge call inside the
training loop. The recipe assumes the Gauge arrives via an OTLP
push from the trainer pod — prometheusreceiver is one valid scrape
path (Prometheus-format exposition of the Gauge by the trainer's
metrics endpoint), but the same bridge attribute contract applies to
OTLP-push topologies.
Same RFC-0014 block as patterns #3 / #4 / #5 / #10: OTTL
metric_statements cannot reference log.* paths at OTel-contrib
v0.130, and no contrib connector emits log records from a metrics
pipeline. The resolution path is either (a) an in-tree
processor.WithMetrics extension to patterndetectorprocessor
(tracked under #260;
cuda_oom #10 shipped via
#437 /
PR #461; pattern #7 piggybacks on the same plumbing because the
attribute contract is purely a wire-format pin) or (b) an upstream
metricthresholdconnector contribution.
# Bridge-emitted log record consumed by patterndetectorprocessor's
# dataloader_hang detector. One log record per (training pod, stall
# crossing) — NOT one per Gauge sample.
resource:
attributes:
k8s.pod.name: trainer-rank-3 # str — REQUIRED. Pod-scoped join key.
k8s.namespace.name: training # str — required (verdict carries it).
k8s.node.name: gpu-node-0007 # str — REQUIRED. Node-scoped storage-event join key.
log_record:
timestamp: 2026-06-01T10:04:30Z # MUST be the stall-detection time (last_step + no_progress), not now()
body: "" # ignored by the detector
attributes:
tracecore.alert.training_step_stalled.no_progress_seconds: 240 # int (seconds) — REQUIRED. Gate the detector reads.
tracecore.alert.training_step_stalled.last_step_ns: 1717236030000000000 # int (unix-ns) — optional, sharpens the verdict timestamp.
gen_ai.training.step: 42 # int — REQUIRED for the warmup guard (step >= 2).
gen_ai.training.phase: train # str — optional; "eval" gets skipped by the eval-phase guard.projectTrainingStepStallRecord (at
module/processor/patterndetectorprocessor/projectors_shared.go)
extracts the four required scalars and builds a
patterns.TrainingStepStallRecord. The dataloader_hang detector
joins each stall against a same-pod
dataloader.error_class
log record OR a same-node FailedMount / VolumeMountFailure
Kubernetes Event within
DataLoaderHangCorrelationWindow (default 5min); without a
discriminator, no verdict fires (spec §"Detector evaluation rule"
— stalls alone are not a hang because patterns #6 stragglers and
#11 checkpointer also stall steps). The unit tests
TestPatternDetector_DataLoaderHangWiring*
(module/processor/patterndetectorprocessor/dataloader_hang_test.go)
pin the canonical wire format above against the live detector.
Pattern #10 — hw.gpu.memory.{free,total} (issue #337)
Status: shipped via #437 / PR #461. patterndetectorprocessor now additionally implements processors.metrics.Metrics (ADR-0001 PR-B): the metrics-path consumer projects hw.gpu.memory.{free,total} pmetric.NumberDataPoints directly into patterns.FBMemoryRecord values, buffers them in a bounded ring keyed on processor component.ID, and the logs-path consumer drains the buffer at CUDA OOM-log time. Operators running dcgm-exporter into the metrics pipeline get full-confidence verdicts WITHOUT configuring the metrics→logs OTTL recipe below. The log-record schema below remains the load-bearing wire contract for the OTTL recipe path (the alternative when an operator can't run the in-tree consumer); the detector (module/processor/patterndetectorprocessor/cuda_oom.go::projectFBMemoryRecord) expects this exact shape and both paths converge on it.
The CUDA OOM detector gates on a log record carrying:
| Attribute | Type | Source |
|---|---|---|
hw.gpu.memory.free |
int (bytes) | last DCGM_FI_DEV_FB_FREE Gauge sample for the GPU at bridge-emit time |
hw.gpu.memory.total |
int (bytes) | DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE at bridge-emit time (per-GPU sum within one scrape) |
gpu.id |
string | PCI BDF resource attr from the DCGM series (or hw.gpu.pci.bdf fallback) |
k8s.node.name |
string (resource) | stamped by k8sattributesprocessor |
The bridge MUST stamp BOTH hw.gpu.memory.free AND
hw.gpu.memory.total on the same log record — the detector's
projectFBMemoryRecord gate at module/processor/patterndetectorprocessor/cuda_oom.go
short-circuits if either is missing (the fragmentation discriminator
needs both numerator and denominator joined to the same GPU /
same scrape). One log record per (GPU, scrape) is the load-bearing
shape: per-attribute records emitted on separate logs would defeat
the join.
# Bridge-emitted log record consumed by patterndetectorprocessor's
# cuda_oom detector. Emitted once per (GPU, scrape) — NOT once per
# (GPU, attribute). Field types match the OTel pdata stamps the
# detector reads (Int / Str).
resource:
attributes:
k8s.node.name: gpu-node-0001 # str — stamped by k8sattributesprocessor
hw.id: GPU-3a4b... # str — NVML UUID (optional join key)
hw.gpu.pci.bdf: 0000:3b:00.0 # str — PCI BDF (optional; gpu.id below is the cross-signal join)
log_record:
timestamp: 2026-06-01T10:00:00Z # MUST be the scrape timestamp, not now()
body: "" # ignored by the detector — keep empty or set to a debug-friendly summary
attributes:
gpu.id: PCI:0000:3b:00 # str — REQUIRED. Cross-signal join key (same shape as the OOM log record).
hw.gpu.memory.free: 17179869184 # int — REQUIRED. Bytes free on the GPU at the scrape.
hw.gpu.memory.total: 85899345920 # int — REQUIRED. Bytes total on the GPU (= used + free at the scrape).
hw.gpu.memory.used: 68719476736 # int — optional, evidence context. NOT gated on by the detector.
hw.gpu.index: 3 # int — optional, evidence context.projectFBMemoryRecord (at module/processor/patterndetectorprocessor/cuda_oom.go:114)
extracts the three required scalars and builds a patterns.FBMemoryRecord.
The free-ratio (FreeBytes / TotalBytes) is then compared against
cuda_oom_fb_free_fragmentation_threshold (default 0.05) — a
ratio at-or-above threshold flips the verdict to
cuda_oom.kind=fragmentation, below flips to true_oom. The
unit test TestPatternDetector_CUDAOOMWiringEmitsFragmentationVerdict
pins the canonical wire format above against the live detector.
OTTL metric_statements cannot reference log.* paths at v0.130
(upstream README).
Connectors that change signal type all emit metrics, not logs
(countconnector, signaltometricsconnector, spanmetricsconnector).
Per RFC-0014 the resolution path is either (a) an in-tree
processor.WithMetrics extension to patterndetectorprocessor —
shipped for cuda_oom (#10) via
#437 /
PR #461, with sibling consumers for patterns #2 / #3 / #4 / #5
pending under
#260 — or
(b) an upstream metricthresholdconnector contribution.
The contract above stays stable across either resolution.
Every statement in transform/dcgm_to_hw_semconv uses upstream
OTTL functions only: set, IsMatch, ExtractPatterns, Int.
No new OTTL functions are introduced. If a future series cannot
be projected with the existing function set, the right response
is to propose the missing function upstream to OTel contrib —
not to ship a tracecore-specific OTTL extension.
OTTL set(metric.name, ...) renames a metric in place; the v0.130
README warns that "Transformation of metrics have the potential to
affect the identity of a metric leading to an Identity Crisis."
For this recipe the conflict is intentional: 36 input series
(18 NVLink links × 2 directions) collapse into one output metric
named hw.gpu.nvlink.io with distinct attribute sets per
datapoint. Downstream OTel + Prometheus backends merge by
(metric.name, attributes) and produce the expected per-link /
per-direction series. If your backend rejects the merged shape,
follow the upstream guidance to apply the rename inside a
separate statement group from any other identity-affecting
operation; this recipe already isolates the rename inside its
own processor (transform/dcgm_to_hw_semconv).
| Placeholder | What to fill in |
|---|---|
REPLACE_WITH_OTLP_HTTP_ENDPOINT |
The OTLP/HTTP base URL of your sink. /v1/metrics is appended automatically per the OTLP/HTTP spec. |
REPLACE_WITH_DCGM_EXPORTER_TARGET |
localhost:9400 for a DaemonSet shape, or the dcgm-exporter Service DNS (dcgm-exporter.kube-system.svc:9400) for a Deployment shape. |
Tracecore does not expand environment variables in YAML. Render the
literals at deploy time via envsubst, a Helm template, or a
Kubernetes secret-injection driver. The :port suffix is mandatory
— prometheusreceiver rejects bare hostnames at validate.
| Symptom | First check |
|---|---|
scrape_configs.targets[0]: address ... incorrect at validate |
The target placeholder still carries REPLACE_WITH_DCGM_EXPORTER_TARGET — the validator now rejects literal placeholders that look like hostnames. Render at deploy time. |
| Scrape returns 200 but no metrics flow | prometheusreceiver requires the response to be in Prometheus text exposition format. A target that returns OTLP-JSON or vendor-proprietary format silently drops. Curl the endpoint and confirm the first line starts with # HELP. |
gpu.vendor empty on a known DCGM target |
The exporter is on an old release that emits the legacy dcgm_* prefix (lowercase). Either upgrade the exporter to a DCGM_*-emitting build or extend the OTTL regex to ^[Dd][Cc][Gg][Mm]_. |
cardinality limit exceeded from the backend |
prometheusreceiver does not cap series. Add a filterprocessor between prometheus and transform/gpu_vendor to drop metrics you don't query. Cap dcgm-exporter's --collectors flag to the families you alert on. |
| Bearer-token target returns 401 | The ServiceAccount lacks the binding to the target's RBAC. For Kueue, the SA needs nonResourceURLs: ["/metrics"] verbs: ["get"] via a ClusterRoleBinding. |
dcgm-exporter pod CrashLoopBackOff with Failed to initialize NVML: Driver/library version mismatch |
The host's NVIDIA driver is older (or newer) than the DCGM library bundled into the dcgm-exporter image. kubectl -n gpu-monitoring logs -l app.kubernetes.io/name=dcgm-exporter --tail=20 confirms. Align by upgrading the driver via the NVIDIA GPU Operator or nvidia-driver-daemonset, or by pinning the chart to a dcgm-exporter image tag that matches your driver minor version. |
dcgm-exporter pod CrashLoopBackOff with Failed to initialize NVML: Driver Not Loaded |
No NVIDIA driver is installed on the host — the DaemonSet was scheduled onto a non-GPU node. Tighten nodeSelector to a GPU-only label (e.g. nvidia.com/gpu.present: "true" from NVIDIA's node-feature-discovery, or your cluster's equivalent). If every GPU node already lacks a driver, install one before the chart will start. |
dcgm-exporter pod Running but /metrics returns 500 / hangs |
DCGM cannot reach NVML — usually the nvidia-container-toolkit runtime is not configured (container has no /dev/nvidiactl). Verify with kubectl -n gpu-monitoring exec <pod> -- ls /dev/nvidia*. The runtime must be the NVIDIA container runtime; see the Quickstart on Kubernetes prerequisites. |
dcgm-exporter pod Pending with forbidden: ... configmaps event |
The chart's Role + RoleBinding (dcgm-exporter-read-cm) was disabled or the ServiceAccount lost its binding. Re-render with the defaults (rbac.create=true, serviceAccount.create=true) — the pod must be able to read the exporter-metrics-config-map to load default-counters.csv. |
prometheusreceiver logs context deadline exceeded while scraping dcgm-exporter |
dcgm-exporter's first scrape after startup can take >10s on hosts with many GPUs because DCGM has to initialize per-device watches. Either raise the recipe's scrape_timeout (above) past 15s, or set the chart's arguments to include -d g (GPUs only, no GPU-instances enumeration) to shrink the field watch set. The scrape_interval should remain ≥ scrape_timeout to avoid overlapping scrapes. |
Upstream component docs:
receiver/prometheusreceiver,
processor/transformprocessor,
processor/filterprocessor.