Tracecore tails container stdout/stderr files under /var/log/pods/
on each node using the upstream filelogreceiver with a container
parser stanza. The k8sattributesprocessor enriches each record with
pod, namespace, and workload identity; the file_storage extension
checkpoints read offsets across restarts so log lines are not
re-shipped on rollouts. An OTTL transform/dataloader_errors stanza
projects per-driver PyTorch DataLoader error vocabulary (FUSE, S3,
Lustre, multiprocessing queue, worker-killed) onto the customer-stable
dataloader.error_class / dataloader.worker_pid attributes that
pattern #7's detector consumes.
A sibling transform/cuda_oom stanza projects PyTorch's
RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit> line
onto the customer-stable cuda_oom.tried_alloc_bytes (Int, bytes;
unit-normalized) + cuda_oom.gpu_index (Int) attributes that
pattern #10's detector consumes.
Replaces the in-tree containerstdout receiver scheduled for deletion
at v0.2.0 per
RFC-0013 §migration PR-K
and §7 (Deletion list).
# docs/integrations/examples/filelog-container.yaml
extensions:
file_storage/checkpoints:
directory: /var/lib/tracecore/filelog
create_directory: true
timeout: 1s
compaction:
directory: /var/lib/tracecore/filelog
on_start: true
on_rebound: true
receivers:
filelog/container:
include:
- /var/log/pods/*/*/*.log
exclude:
- /var/log/pods/*/tracecore/*.log
start_at: end
include_file_path: true
include_file_name: false
storage: file_storage/checkpoints
operators:
- id: container-parser
type: container
format: auto
add_metadata_from_filepath: true
- id: severity-parser
type: severity_parser
parse_from: attributes.stream
mapping:
error: stderr
info: stdout
if: 'attributes.stream != nil'
processors:
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.namespace.name
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.node.name
- k8s.container.name
labels:
- tag_name: app
key: app.kubernetes.io/name
from: pod
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: resource_attribute
name: k8s.namespace.name
- from: resource_attribute
name: k8s.pod.name
batch:
send_batch_size: 8192
timeout: 5s
send_batch_max_size: 16384
exporters:
otlphttp:
endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT
compression: gzip
timeout: 10s
service:
extensions: [file_storage/checkpoints]
pipelines:
logs/container:
receivers: [filelog/container]
processors: [k8sattributes, batch]
exporters: [otlphttp]Validate with the in-tree binary:
./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yamlExit 0 means the config parses, every component name resolves,
create_directory: true will materialize the checkpoint directory at
boot, and the OTLP endpoint URL is well-formed once the placeholder
has been rendered.
Run tracecore as a DaemonSet so every node-local log file is read
by the pod scheduled on that node. The DaemonSet pod template must
mount three host paths read-write or read-only as listed:
/var/log/pods(read-only) — kubelet writes per-container log files here. Required byfilelog/container::include./var/log/containers(read-only) — symlink farm kubelet maintains for the container parser'sadd_metadata_from_filepathto resolve pod / namespace / container names without an API call./var/lib/tracecore/filelog(read-write) —file_storagecheckpoints.create_directory: trueremoves the need for an initContainer to pre-create the path.
The DaemonSet ServiceAccount needs
pods get,list,watch + nodes get for k8sattributesprocessor.
The transform/dataloader_errors processor projects per-driver
PyTorch DataLoader error vocabulary onto the customer-stable
dataloader.error_class +
dataloader.worker_pid attributes that
pattern #7's detector (projectDataLoaderErrorRecord
at module/processor/patterndetectorprocessor/dataloader_hang.go)
consumes. Per-driver stanzas — not one omnibus regex — because the
operator runbook (pattern #7 §"Edge cases / false-positive guards"
and the verdict's discriminator scalar) branches on the storage
class. A single regex would collapse FUSE / S3 / Lustre into one
"DataLoader error" class and lose the storage-vs-runtime routing
signal.
dataloader.error_class |
Body substring(s) gated on | Runbook branch |
|---|---|---|
DataLoader worker killed |
DataLoader worker (pid N) is killed (PyTorch) |
Inspect worker exit cause (dmesg for OOM on the node, kubelet logs for OOMKilled); raise num_workers timeouts; verify persistent_workers=True is not masking the dead worker. |
FUSE transport error |
Transport endpoint is not connected, Software caused connection abort |
Inspect FUSE mount health (mountpoint -q /path); restart the CSI driver pod on the affected node; check kernel dmesg for FUSE driver crashes. |
S3 throttle |
SlowDown, 503 Service Unavailable, Please reduce your request rate |
Raise S3 endpoint capacity or rebalance prefixes; consider a CDN / front-cache; check the gateway's per-IP rate-limit configuration. |
Stale file handle |
Stale file handle (kernel ESTALE from Lustre / GPFS / NFS-v4) |
lfs check servers (Lustre); mmlsmount + mmhealth (GPFS); exportfs -r + remount (NFS). |
DataLoader queue empty |
queue.Empty, _queue.Empty, multiprocessing.queues.Empty |
A worker died silently without the kernel-level signal trail — check the worker function for uncaught exceptions; the discriminator stays worker_killed because the runbook still points at the DataLoader runtime, not storage. |
Connection reset by peer |
Connection reset by peer (catch-all, only if no prior stanza matched) |
Per-shard fetcher service that the trainer dials directly has died — check that service's logs; consider a circuit-breaker so the DataLoader fails fast. |
The dataloader.worker_pid int extraction runs independently of the
class branches: any line matching DataLoader worker (pid N) is killed
stamps the pid so the detector can correlate against dmesg / kubelet
OOMKilled events. The pid is optional input — the detector projects
it onto the verdict only when present (see DataLoaderErrorRecord.WorkerPID
at module/pkg/patterns/dataloader_hang.go).
Per-driver vs. omnibus. Pattern #7 spec Open Q#3 asked whether to use a single FUSE regex or per-driver stanzas; the answer landed as per-driver, because the runbook branches on storage class and an omnibus regex would erase that signal. New error classes (e.g. a future Ceph-class driver) extend the table here, not by widening an existing regex.
The transform/cuda_oom processor projects PyTorch's canonical
out-of-memory stderr line — RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 79.18 GiB of which 16.00 GiB is free. — onto the customer-stable
cuda_oom.tried_alloc_bytes +
cuda_oom.gpu_index attributes that
pattern #10's detector
(projectCUDAOOMLogRecord at
module/processor/patterndetectorprocessor/cuda_oom.go) consumes.
The detector's projection gate is BOTH cuda_oom.tried_alloc_bytes
AND gpu.id (PCI BDF per
RFC-0013 §3);
this stanza stamps the bytes scalar and the human-visible GPU index
off the body. gpu.id is not stamped here — the CUDA-runtime
ordinal cuda_oom.gpu_index is a CUDA enumeration index, not a PCI
BDF. Two operator-configurable paths populate gpu.id on the log
resource so the detector's resource-attr fallback reads it:
gpu.id source path |
When to use |
|---|---|
k8sattributesprocessor + nvidia.com/gpu device-plugin resource |
The trainer pod requests one GPU via resources.limits.nvidia.com/gpu: 1. The NVIDIA device plugin annotates the pod with the allocated PCI BDF (nvidia.com/gpu-PCIDeviceBusID since device-plugin v0.16). Extend k8sattributes::extract::annotations to lift this annotation onto the log resource as gpu.id. Cheapest path — already in the cluster's GPU scheduling fabric. |
DCGM BDF-lookup transform indexed by cuda_oom.gpu_index |
Multi-GPU pods (one container ↔ N GPUs) where the device-plugin annotation is the per-pod list, not the per-OOM GPU. Scrape the DCGM exporter's DCGM_FI_DEV_PCI_BUSID series, materialize a per-host {gpu_index → BDF} lookup, then add a sibling OTTL stanza that joins cuda_oom.gpu_index against the table to stamp gpu.id. Sibling to the pattern-2 / pattern-10 DCGM recipe. |
The recipe uses four per-unit-prefix branches (KiB / MiB / GiB / TiB)
because OTTL has no capture-group-conditional dispatch — the
multiplier must be a literal int64 per stanza. The body match
captures whole (digits before the decimal) and frac (two digits
after) and computes
Int(whole) * UNIT + Int(frac) * (UNIT / 100). PyTorch's
format_size always emits %.2f, so the 2-digit frac capture is
exhaustive; the integer-divide-by-100 floor caps precision loss at
under 1% of the unit base (max ~10 MB on a 99.99 GiB alloc, three
orders of magnitude under the detector's 5% fragmentation threshold).
| Body shape | Captured | Stamped attributes |
|---|---|---|
CUDA out of memory. Tried to allocate \d+\.\d{2} KiB |
whole, frac (×2 digits) |
cuda_oom.tried_alloc_bytes = whole*1024 + frac*10 |
CUDA out of memory. Tried to allocate \d+\.\d{2} MiB |
whole, frac |
cuda_oom.tried_alloc_bytes = whole*1048576 + frac*10485 |
CUDA out of memory. Tried to allocate \d+\.\d{2} GiB |
whole, frac |
cuda_oom.tried_alloc_bytes = whole*1073741824 + frac*10737418 |
CUDA out of memory. Tried to allocate \d+\.\d{2} TiB |
whole, frac |
cuda_oom.tried_alloc_bytes = whole*1099511627776 + frac*10995116277 |
... GPU \d+ has a total capacity |
idx |
cuda_oom.gpu_index = idx |
The where IsMatch(body, "CUDA out of memory\. Tried to allocate")
guard is tight on the OOM-summary line, so generic CUDA errors
(an illegal memory access was encountered, NCCL watchdog timeouts,
DataLoader worker (pid N) is killed) do not trip the stanza —
keeping the detector quiet on non-OOM stderr noise.
Multi-line tracebacks. A PyTorch OOM emits the summary line followed by a Python traceback (
File "train.py", line 42, in ...). The container parser flattens each newline-delimited log line into its own log record; only the summary line matches the regex above, so the detector sees exactly one stamp per OOM event regardless of traceback depth. This is pattern #10 spec Open Q#2's answer.
| Placeholder | What to fill in |
|---|---|
REPLACE_WITH_OTLP_HTTP_ENDPOINT |
The OTLP/HTTP base URL of your sink. /v1/logs is appended automatically per the OTLP/HTTP spec — do not include it. |
Tracecore does not expand environment variables in YAML. Render the
literal endpoint at deploy time via envsubst, a Helm template, or a
Kubernetes secret-injection driver. The placeholder is loud (the
exporter rejects it on first dispatch) so a misconfigured rollout
fails immediately instead of silently dropping logs.
| Symptom | First check |
|---|---|
directory must exist at validate |
Remove a stray operator override of file_storage::directory — create_directory: true only fires for the path declared in the same extension stanza. |
Logs flow but k8s.pod.name is empty |
The DaemonSet ServiceAccount is missing pods get,list,watch. Check kubectl auth can-i list pods --as system:serviceaccount:<ns>:<sa>. |
| Duplicate log lines after a restart | start_at: end ships only NEW lines; if you see duplicates the checkpoint directory is on an emptyDir or hostPath that got recreated. Move it to a node-local persistent path under /var/lib/. |
failed to open /var/log/pods/... |
The DaemonSet pod is missing the /var/log/pods hostPath mount, or the path is mounted read-write (kubelet's --root-dir overrides shift this on some distros). Mount read-only at the kubelet's actual root. |
| High-cardinality label explosion | The container parser surfaces every label from app.kubernetes.io/name plus whatever you add under extract::labels. Audit the list against the receiving backend's cardinality budget before adding more. |
| Pattern #7 verdict never fires despite known DataLoader stalls | The transform/dataloader_errors stanzas gate on substring matches against the container body. If your trainer wraps DataLoader errors (e.g. a custom logger that prefixes with JSON), the body shape changes. Confirm via `kubectl logs --container= --previous 2>&1 |
dataloader.error_class empty on a known error line |
The OTTL stanza fell through silently — the body substring did not match any branch. Add a row to the table above and a matching set(attributes["dataloader.error_class"], ...) statement. The detector's projection gate requires the attribute, so a missing class drops the discriminator. |
| Pattern #10 verdict never fires despite a known CUDA OOM | The transform/cuda_oom stanzas gate on substring matches against the container body. Confirm via kubectl logs <trainer-pod> --container=<c> --previous 2>&1 | grep -E 'CUDA out of memory\. Tried to allocate'. If the trainer wraps PyTorch errors (custom logger, JSON envelope), the body shape changes — extend the IsMatch predicates to match the wrapper format. Also check that gpu.id is being stamped onto the log resource via one of the two paths in the cuda_oom.* section: a missing gpu.id drops the projection at cuda_oom.go's gate and the detector stays quiet. |
cuda_oom.tried_alloc_bytes stamped with a wildly wrong magnitude |
A unit-prefix branch was modified without updating its multiplier, or the body shape drifted from %.2f. PyTorch's format_size has used %.2f for the entire CUDA-allocator lifetime; if a customer fork emits %.0f or %.4f the recipe's \d{2} capture misses, and the stanza fails open (no stamp) rather than producing a wrong value. Verify against pytorch/c10/util/Exception.h's formatter. |
Upstream component docs:
receiver/filelogreceiver,
receiver/filelogreceiver/internal/parser/container,
processor/k8sattributesprocessor,
processor/transformprocessor,
extension/storage/filestorage.
Self-telemetry counters appear under the standard
otelcol_receiver_* and otelcol_processor_* metric families from
service/telemetry.