Container stdout via `filelogreceiver` + container parser

Tracecore tails container stdout/stderr files under /var/log/pods/ on each node using the upstream filelogreceiver with a container parser stanza. The k8sattributesprocessor enriches each record with pod, namespace, and workload identity; the file_storage extension checkpoints read offsets across restarts so log lines are not re-shipped on rollouts. An OTTL transform/dataloader_errors stanza projects per-driver PyTorch DataLoader error vocabulary (FUSE, S3, Lustre, multiprocessing queue, worker-killed) onto the customer-stable dataloader.error_class / dataloader.worker_pid attributes that pattern #7's detector consumes. A sibling transform/cuda_oom stanza projects PyTorch's RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit> line onto the customer-stable cuda_oom.tried_alloc_bytes (Int, bytes; unit-normalized) + cuda_oom.gpu_index (Int) attributes that pattern #10's detector consumes. Replaces the in-tree containerstdout receiver scheduled for deletion at v0.2.0 per RFC-0013 §migration PR-K and §7 (Deletion list).

Config

# docs/integrations/examples/filelog-container.yaml
extensions:
  file_storage/checkpoints:
    directory: /var/lib/tracecore/filelog
    create_directory: true
    timeout: 1s
    compaction:
      directory: /var/lib/tracecore/filelog
      on_start: true
      on_rebound: true

receivers:
  filelog/container:
    include:
      - /var/log/pods/*/*/*.log
    exclude:
      - /var/log/pods/*/tracecore/*.log
    start_at: end
    include_file_path: true
    include_file_name: false
    storage: file_storage/checkpoints
    operators:
      - id: container-parser
        type: container
        format: auto
        add_metadata_from_filepath: true
      - id: severity-parser
        type: severity_parser
        parse_from: attributes.stream
        mapping:
          error: stderr
          info: stdout
        if: 'attributes.stream != nil'

processors:
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.deployment.name
        - k8s.statefulset.name
        - k8s.daemonset.name
        - k8s.node.name
        - k8s.container.name
      labels:
        - tag_name: app
          key: app.kubernetes.io/name
          from: pod
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.uid
      - sources:
          - from: resource_attribute
            name: k8s.namespace.name
          - from: resource_attribute
            name: k8s.pod.name
  batch:
    send_batch_size: 8192
    timeout: 5s
    send_batch_max_size: 16384

exporters:
  otlphttp:
    endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT
    compression: gzip
    timeout: 10s

service:
  extensions: [file_storage/checkpoints]
  pipelines:
    logs/container:
      receivers: [filelog/container]
      processors: [k8sattributes, batch]
      exporters: [otlphttp]

Validate with the in-tree binary:

./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yaml

Exit 0 means the config parses, every component name resolves, create_directory: true will materialize the checkpoint directory at boot, and the OTLP endpoint URL is well-formed once the placeholder has been rendered.

Deployment shape

Run tracecore as a DaemonSet so every node-local log file is read by the pod scheduled on that node. The DaemonSet pod template must mount three host paths read-write or read-only as listed:

/var/log/pods (read-only) — kubelet writes per-container log files here. Required by filelog/container::include.
/var/log/containers (read-only) — symlink farm kubelet maintains for the container parser's add_metadata_from_filepath to resolve pod / namespace / container names without an API call.
/var/lib/tracecore/filelog (read-write) — file_storage checkpoints. create_directory: true removes the need for an initContainer to pre-create the path.

The DaemonSet ServiceAccount needs pods get,list,watch + nodes get for k8sattributesprocessor.

`dataloader.*` attribute stanza (pattern #7)

The transform/dataloader_errors processor projects per-driver PyTorch DataLoader error vocabulary onto the customer-stable dataloader.error_class + dataloader.worker_pid attributes that pattern #7's detector (projectDataLoaderErrorRecord at module/processor/patterndetectorprocessor/dataloader_hang.go) consumes. Per-driver stanzas — not one omnibus regex — because the operator runbook (pattern #7 §"Edge cases / false-positive guards" and the verdict's discriminator scalar) branches on the storage class. A single regex would collapse FUSE / S3 / Lustre into one "DataLoader error" class and lose the storage-vs-runtime routing signal.

`dataloader.error_class`	Body substring(s) gated on	Runbook branch
`DataLoader worker killed`	`DataLoader worker (pid N) is killed` (PyTorch)	Inspect worker exit cause (`dmesg` for OOM on the node, kubelet logs for `OOMKilled`); raise `num_workers` timeouts; verify `persistent_workers=True` is not masking the dead worker.
`FUSE transport error`	`Transport endpoint is not connected`, `Software caused connection abort`	Inspect FUSE mount health (`mountpoint -q /path`); restart the CSI driver pod on the affected node; check kernel `dmesg` for FUSE driver crashes.
`S3 throttle`	`SlowDown`, `503 Service Unavailable`, `Please reduce your request rate`	Raise S3 endpoint capacity or rebalance prefixes; consider a CDN / front-cache; check the gateway's per-IP rate-limit configuration.
`Stale file handle`	`Stale file handle` (kernel `ESTALE` from Lustre / GPFS / NFS-v4)	`lfs check servers` (Lustre); `mmlsmount` + `mmhealth` (GPFS); `exportfs -r` + remount (NFS).
`DataLoader queue empty`	`queue.Empty`, `_queue.Empty`, `multiprocessing.queues.Empty`	A worker died silently without the kernel-level signal trail — check the worker function for uncaught exceptions; the discriminator stays `worker_killed` because the runbook still points at the DataLoader runtime, not storage.
`Connection reset by peer`	`Connection reset by peer` (catch-all, only if no prior stanza matched)	Per-shard fetcher service that the trainer dials directly has died — check that service's logs; consider a circuit-breaker so the DataLoader fails fast.

The dataloader.worker_pid int extraction runs independently of the class branches: any line matching DataLoader worker (pid N) is killed stamps the pid so the detector can correlate against dmesg / kubelet OOMKilled events. The pid is optional input — the detector projects it onto the verdict only when present (see DataLoaderErrorRecord.WorkerPID at module/pkg/patterns/dataloader_hang.go).

Per-driver vs. omnibus. Pattern #7 spec Open Q#3 asked whether to use a single FUSE regex or per-driver stanzas; the answer landed as per-driver, because the runbook branches on storage class and an omnibus regex would erase that signal. New error classes (e.g. a future Ceph-class driver) extend the table here, not by widening an existing regex.

`cuda_oom.*` attribute stanza (pattern #10)

The transform/cuda_oom processor projects PyTorch's canonical out-of-memory stderr line — RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 79.18 GiB of which 16.00 GiB is free. — onto the customer-stable cuda_oom.tried_alloc_bytes + cuda_oom.gpu_index attributes that pattern #10's detector (projectCUDAOOMLogRecord at module/processor/patterndetectorprocessor/cuda_oom.go) consumes. The detector's projection gate is BOTH cuda_oom.tried_alloc_bytes AND gpu.id (PCI BDF per RFC-0013 §3); this stanza stamps the bytes scalar and the human-visible GPU index off the body. gpu.id is not stamped here — the CUDA-runtime ordinal cuda_oom.gpu_index is a CUDA enumeration index, not a PCI BDF. Two operator-configurable paths populate gpu.id on the log resource so the detector's resource-attr fallback reads it:

`gpu.id` source path	When to use
k8sattributesprocessor + `nvidia.com/gpu` device-plugin resource	The trainer pod requests one GPU via `resources.limits.nvidia.com/gpu: 1`. The NVIDIA device plugin annotates the pod with the allocated PCI BDF (`nvidia.com/gpu-PCIDeviceBusID` since device-plugin v0.16). Extend `k8sattributes::extract::annotations` to lift this annotation onto the log resource as `gpu.id`. Cheapest path — already in the cluster's GPU scheduling fabric.
DCGM BDF-lookup transform indexed by `cuda_oom.gpu_index`	Multi-GPU pods (one container ↔ N GPUs) where the device-plugin annotation is the per-pod list, not the per-OOM GPU. Scrape the DCGM exporter's `DCGM_FI_DEV_PCI_BUSID` series, materialize a per-host `{gpu_index → BDF}` lookup, then add a sibling OTTL stanza that joins `cuda_oom.gpu_index` against the table to stamp `gpu.id`. Sibling to the pattern-2 / pattern-10 DCGM recipe.

The recipe uses four per-unit-prefix branches (KiB / MiB / GiB / TiB) because OTTL has no capture-group-conditional dispatch — the multiplier must be a literal int64 per stanza. The body match captures whole (digits before the decimal) and frac (two digits after) and computes Int(whole) * UNIT + Int(frac) * (UNIT / 100). PyTorch's format_size always emits %.2f, so the 2-digit frac capture is exhaustive; the integer-divide-by-100 floor caps precision loss at under 1% of the unit base (max ~10 MB on a 99.99 GiB alloc, three orders of magnitude under the detector's 5% fragmentation threshold).

Body shape	Captured	Stamped attributes
`CUDA out of memory. Tried to allocate \d+\.\d{2} KiB`	`whole`, `frac` (×2 digits)	`cuda_oom.tried_alloc_bytes = whole1024 + frac10`
`CUDA out of memory. Tried to allocate \d+\.\d{2} MiB`	`whole`, `frac`	`cuda_oom.tried_alloc_bytes = whole1048576 + frac10485`
`CUDA out of memory. Tried to allocate \d+\.\d{2} GiB`	`whole`, `frac`	`cuda_oom.tried_alloc_bytes = whole1073741824 + frac10737418`
`CUDA out of memory. Tried to allocate \d+\.\d{2} TiB`	`whole`, `frac`	`cuda_oom.tried_alloc_bytes = whole1099511627776 + frac10995116277`
`... GPU \d+ has a total capacity`	`idx`	`cuda_oom.gpu_index = idx`

The where IsMatch(body, "CUDA out of memory\. Tried to allocate") guard is tight on the OOM-summary line, so generic CUDA errors (an illegal memory access was encountered, NCCL watchdog timeouts, DataLoader worker (pid N) is killed) do not trip the stanza — keeping the detector quiet on non-OOM stderr noise.

Multi-line tracebacks. A PyTorch OOM emits the summary line followed by a Python traceback (File "train.py", line 42, in ...). The container parser flattens each newline-delimited log line into its own log record; only the summary line matches the regex above, so the detector sees exactly one stamp per OOM event regardless of traceback depth. This is pattern #10 spec Open Q#2's answer.

Placeholders

Placeholder	What to fill in
`REPLACE_WITH_OTLP_HTTP_ENDPOINT`	The OTLP/HTTP base URL of your sink. `/v1/logs` is appended automatically per the OTLP/HTTP spec — do not include it.

Tracecore does not expand environment variables in YAML. Render the literal endpoint at deploy time via envsubst, a Helm template, or a Kubernetes secret-injection driver. The placeholder is loud (the exporter rejects it on first dispatch) so a misconfigured rollout fails immediately instead of silently dropping logs.

Failure modes

Symptom	First check
`directory must exist` at validate	Remove a stray operator override of `file_storage::directory` — `create_directory: true` only fires for the path declared in the same extension stanza.
Logs flow but `k8s.pod.name` is empty	The DaemonSet ServiceAccount is missing `pods get,list,watch`. Check `kubectl auth can-i list pods --as system:serviceaccount:<ns>:<sa>`.
Duplicate log lines after a restart	`start_at: end` ships only NEW lines; if you see duplicates the checkpoint directory is on an emptyDir or hostPath that got recreated. Move it to a node-local persistent path under `/var/lib/`.
`failed to open /var/log/pods/...`	The DaemonSet pod is missing the `/var/log/pods` hostPath mount, or the path is mounted read-write (kubelet's `--root-dir` overrides shift this on some distros). Mount read-only at the kubelet's actual root.
High-cardinality label explosion	The container parser surfaces every label from `app.kubernetes.io/name` plus whatever you add under `extract::labels`. Audit the list against the receiving backend's cardinality budget before adding more.
Pattern #7 verdict never fires despite known DataLoader stalls	The `transform/dataloader_errors` stanzas gate on substring matches against the container `body`. If your trainer wraps DataLoader errors (e.g. a custom logger that prefixes with JSON), the body shape changes. Confirm via `kubectl logs --container= --previous 2>&1
`dataloader.error_class` empty on a known error line	The OTTL stanza fell through silently — the body substring did not match any branch. Add a row to the table above and a matching `set(attributes["dataloader.error_class"], ...)` statement. The detector's projection gate requires the attribute, so a missing class drops the discriminator.
Pattern #10 verdict never fires despite a known CUDA OOM	The `transform/cuda_oom` stanzas gate on substring matches against the container `body`. Confirm via `kubectl logs <trainer-pod> --container=<c> --previous 2>&1 \| grep -E 'CUDA out of memory\. Tried to allocate'`. If the trainer wraps PyTorch errors (custom logger, JSON envelope), the body shape changes — extend the `IsMatch` predicates to match the wrapper format. Also check that `gpu.id` is being stamped onto the log resource via one of the two paths in the `cuda_oom.*` section: a missing `gpu.id` drops the projection at `cuda_oom.go`'s gate and the detector stays quiet.
`cuda_oom.tried_alloc_bytes` stamped with a wildly wrong magnitude	A unit-prefix branch was modified without updating its multiplier, or the body shape drifted from `%.2f`. PyTorch's `format_size` has used `%.2f` for the entire CUDA-allocator lifetime; if a customer fork emits `%.0f` or `%.4f` the recipe's `\d{2}` capture misses, and the stanza fails open (no stamp) rather than producing a wrong value. Verify against `pytorch/c10/util/Exception.h`'s formatter.

Upstream component docs: receiver/filelogreceiver, receiver/filelogreceiver/internal/parser/container, processor/k8sattributesprocessor, processor/transformprocessor, extension/storage/filestorage. Self-telemetry counters appear under the standard otelcol_receiver_* and otelcol_processor_* metric families from service/telemetry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Container stdout via `filelogreceiver` + container parser

Config

Deployment shape

`dataloader.*` attribute stanza (pattern #7)

`cuda_oom.*` attribute stanza (pattern #10)

Placeholders

Failure modes

Uh oh!

FilesExpand file tree

filelog-container.md

Latest commit

History

filelog-container.md

File metadata and controls

Container stdout via filelogreceiver + container parser

Config

Deployment shape

dataloader.* attribute stanza (pattern #7)

cuda_oom.* attribute stanza (pattern #10)

Placeholders

Failure modes

Container stdout via `filelogreceiver` + container parser

`dataloader.*` attribute stanza (pattern #7)

`cuda_oom.*` attribute stanza (pattern #10)