Skip to content

Latest commit

 

History

History
268 lines (235 loc) · 15.9 KB

File metadata and controls

268 lines (235 loc) · 15.9 KB

Container stdout via filelogreceiver + container parser

Tracecore tails container stdout/stderr files under /var/log/pods/ on each node using the upstream filelogreceiver with a container parser stanza. The k8sattributesprocessor enriches each record with pod, namespace, and workload identity; the file_storage extension checkpoints read offsets across restarts so log lines are not re-shipped on rollouts. An OTTL transform/dataloader_errors stanza projects per-driver PyTorch DataLoader error vocabulary (FUSE, S3, Lustre, multiprocessing queue, worker-killed) onto the customer-stable dataloader.error_class / dataloader.worker_pid attributes that pattern #7's detector consumes. A sibling transform/cuda_oom stanza projects PyTorch's RuntimeError: CUDA out of memory. Tried to allocate X.YY <unit> line onto the customer-stable cuda_oom.tried_alloc_bytes (Int, bytes; unit-normalized) + cuda_oom.gpu_index (Int) attributes that pattern #10's detector consumes. Replaces the in-tree containerstdout receiver scheduled for deletion at v0.2.0 per RFC-0013 §migration PR-K and §7 (Deletion list).

Config

# docs/integrations/examples/filelog-container.yaml
extensions:
  file_storage/checkpoints:
    directory: /var/lib/tracecore/filelog
    create_directory: true
    timeout: 1s
    compaction:
      directory: /var/lib/tracecore/filelog
      on_start: true
      on_rebound: true

receivers:
  filelog/container:
    include:
      - /var/log/pods/*/*/*.log
    exclude:
      - /var/log/pods/*/tracecore/*.log
    start_at: end
    include_file_path: true
    include_file_name: false
    storage: file_storage/checkpoints
    operators:
      - id: container-parser
        type: container
        format: auto
        add_metadata_from_filepath: true
      - id: severity-parser
        type: severity_parser
        parse_from: attributes.stream
        mapping:
          error: stderr
          info: stdout
        if: 'attributes.stream != nil'

processors:
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.deployment.name
        - k8s.statefulset.name
        - k8s.daemonset.name
        - k8s.node.name
        - k8s.container.name
      labels:
        - tag_name: app
          key: app.kubernetes.io/name
          from: pod
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.uid
      - sources:
          - from: resource_attribute
            name: k8s.namespace.name
          - from: resource_attribute
            name: k8s.pod.name
  batch:
    send_batch_size: 8192
    timeout: 5s
    send_batch_max_size: 16384

exporters:
  otlphttp:
    endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT
    compression: gzip
    timeout: 10s

service:
  extensions: [file_storage/checkpoints]
  pipelines:
    logs/container:
      receivers: [filelog/container]
      processors: [k8sattributes, batch]
      exporters: [otlphttp]

Validate with the in-tree binary:

./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yaml

Exit 0 means the config parses, every component name resolves, create_directory: true will materialize the checkpoint directory at boot, and the OTLP endpoint URL is well-formed once the placeholder has been rendered.

Deployment shape

Run tracecore as a DaemonSet so every node-local log file is read by the pod scheduled on that node. The DaemonSet pod template must mount three host paths read-write or read-only as listed:

  • /var/log/pods (read-only) — kubelet writes per-container log files here. Required by filelog/container::include.
  • /var/log/containers (read-only) — symlink farm kubelet maintains for the container parser's add_metadata_from_filepath to resolve pod / namespace / container names without an API call.
  • /var/lib/tracecore/filelog (read-write) — file_storage checkpoints. create_directory: true removes the need for an initContainer to pre-create the path.

The DaemonSet ServiceAccount needs pods get,list,watch + nodes get for k8sattributesprocessor.

dataloader.* attribute stanza (pattern #7)

The transform/dataloader_errors processor projects per-driver PyTorch DataLoader error vocabulary onto the customer-stable dataloader.error_class + dataloader.worker_pid attributes that pattern #7's detector (projectDataLoaderErrorRecord at module/processor/patterndetectorprocessor/dataloader_hang.go) consumes. Per-driver stanzas — not one omnibus regex — because the operator runbook (pattern #7 §"Edge cases / false-positive guards" and the verdict's discriminator scalar) branches on the storage class. A single regex would collapse FUSE / S3 / Lustre into one "DataLoader error" class and lose the storage-vs-runtime routing signal.

dataloader.error_class Body substring(s) gated on Runbook branch
DataLoader worker killed DataLoader worker (pid N) is killed (PyTorch) Inspect worker exit cause (dmesg for OOM on the node, kubelet logs for OOMKilled); raise num_workers timeouts; verify persistent_workers=True is not masking the dead worker.
FUSE transport error Transport endpoint is not connected, Software caused connection abort Inspect FUSE mount health (mountpoint -q /path); restart the CSI driver pod on the affected node; check kernel dmesg for FUSE driver crashes.
S3 throttle SlowDown, 503 Service Unavailable, Please reduce your request rate Raise S3 endpoint capacity or rebalance prefixes; consider a CDN / front-cache; check the gateway's per-IP rate-limit configuration.
Stale file handle Stale file handle (kernel ESTALE from Lustre / GPFS / NFS-v4) lfs check servers (Lustre); mmlsmount + mmhealth (GPFS); exportfs -r + remount (NFS).
DataLoader queue empty queue.Empty, _queue.Empty, multiprocessing.queues.Empty A worker died silently without the kernel-level signal trail — check the worker function for uncaught exceptions; the discriminator stays worker_killed because the runbook still points at the DataLoader runtime, not storage.
Connection reset by peer Connection reset by peer (catch-all, only if no prior stanza matched) Per-shard fetcher service that the trainer dials directly has died — check that service's logs; consider a circuit-breaker so the DataLoader fails fast.

The dataloader.worker_pid int extraction runs independently of the class branches: any line matching DataLoader worker (pid N) is killed stamps the pid so the detector can correlate against dmesg / kubelet OOMKilled events. The pid is optional input — the detector projects it onto the verdict only when present (see DataLoaderErrorRecord.WorkerPID at module/pkg/patterns/dataloader_hang.go).

Per-driver vs. omnibus. Pattern #7 spec Open Q#3 asked whether to use a single FUSE regex or per-driver stanzas; the answer landed as per-driver, because the runbook branches on storage class and an omnibus regex would erase that signal. New error classes (e.g. a future Ceph-class driver) extend the table here, not by widening an existing regex.

cuda_oom.* attribute stanza (pattern #10)

The transform/cuda_oom processor projects PyTorch's canonical out-of-memory stderr line — RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 79.18 GiB of which 16.00 GiB is free. — onto the customer-stable cuda_oom.tried_alloc_bytes + cuda_oom.gpu_index attributes that pattern #10's detector (projectCUDAOOMLogRecord at module/processor/patterndetectorprocessor/cuda_oom.go) consumes. The detector's projection gate is BOTH cuda_oom.tried_alloc_bytes AND gpu.id (PCI BDF per RFC-0013 §3); this stanza stamps the bytes scalar and the human-visible GPU index off the body. gpu.id is not stamped here — the CUDA-runtime ordinal cuda_oom.gpu_index is a CUDA enumeration index, not a PCI BDF. Two operator-configurable paths populate gpu.id on the log resource so the detector's resource-attr fallback reads it:

gpu.id source path When to use
k8sattributesprocessor + nvidia.com/gpu device-plugin resource The trainer pod requests one GPU via resources.limits.nvidia.com/gpu: 1. The NVIDIA device plugin annotates the pod with the allocated PCI BDF (nvidia.com/gpu-PCIDeviceBusID since device-plugin v0.16). Extend k8sattributes::extract::annotations to lift this annotation onto the log resource as gpu.id. Cheapest path — already in the cluster's GPU scheduling fabric.
DCGM BDF-lookup transform indexed by cuda_oom.gpu_index Multi-GPU pods (one container ↔ N GPUs) where the device-plugin annotation is the per-pod list, not the per-OOM GPU. Scrape the DCGM exporter's DCGM_FI_DEV_PCI_BUSID series, materialize a per-host {gpu_index → BDF} lookup, then add a sibling OTTL stanza that joins cuda_oom.gpu_index against the table to stamp gpu.id. Sibling to the pattern-2 / pattern-10 DCGM recipe.

The recipe uses four per-unit-prefix branches (KiB / MiB / GiB / TiB) because OTTL has no capture-group-conditional dispatch — the multiplier must be a literal int64 per stanza. The body match captures whole (digits before the decimal) and frac (two digits after) and computes Int(whole) * UNIT + Int(frac) * (UNIT / 100). PyTorch's format_size always emits %.2f, so the 2-digit frac capture is exhaustive; the integer-divide-by-100 floor caps precision loss at under 1% of the unit base (max ~10 MB on a 99.99 GiB alloc, three orders of magnitude under the detector's 5% fragmentation threshold).

Body shape Captured Stamped attributes
CUDA out of memory. Tried to allocate \d+\.\d{2} KiB whole, frac (×2 digits) cuda_oom.tried_alloc_bytes = whole*1024 + frac*10
CUDA out of memory. Tried to allocate \d+\.\d{2} MiB whole, frac cuda_oom.tried_alloc_bytes = whole*1048576 + frac*10485
CUDA out of memory. Tried to allocate \d+\.\d{2} GiB whole, frac cuda_oom.tried_alloc_bytes = whole*1073741824 + frac*10737418
CUDA out of memory. Tried to allocate \d+\.\d{2} TiB whole, frac cuda_oom.tried_alloc_bytes = whole*1099511627776 + frac*10995116277
... GPU \d+ has a total capacity idx cuda_oom.gpu_index = idx

The where IsMatch(body, "CUDA out of memory\. Tried to allocate") guard is tight on the OOM-summary line, so generic CUDA errors (an illegal memory access was encountered, NCCL watchdog timeouts, DataLoader worker (pid N) is killed) do not trip the stanza — keeping the detector quiet on non-OOM stderr noise.

Multi-line tracebacks. A PyTorch OOM emits the summary line followed by a Python traceback (File "train.py", line 42, in ...). The container parser flattens each newline-delimited log line into its own log record; only the summary line matches the regex above, so the detector sees exactly one stamp per OOM event regardless of traceback depth. This is pattern #10 spec Open Q#2's answer.

Placeholders

Placeholder What to fill in
REPLACE_WITH_OTLP_HTTP_ENDPOINT The OTLP/HTTP base URL of your sink. /v1/logs is appended automatically per the OTLP/HTTP spec — do not include it.

Tracecore does not expand environment variables in YAML. Render the literal endpoint at deploy time via envsubst, a Helm template, or a Kubernetes secret-injection driver. The placeholder is loud (the exporter rejects it on first dispatch) so a misconfigured rollout fails immediately instead of silently dropping logs.

Failure modes

Symptom First check
directory must exist at validate Remove a stray operator override of file_storage::directorycreate_directory: true only fires for the path declared in the same extension stanza.
Logs flow but k8s.pod.name is empty The DaemonSet ServiceAccount is missing pods get,list,watch. Check kubectl auth can-i list pods --as system:serviceaccount:<ns>:<sa>.
Duplicate log lines after a restart start_at: end ships only NEW lines; if you see duplicates the checkpoint directory is on an emptyDir or hostPath that got recreated. Move it to a node-local persistent path under /var/lib/.
failed to open /var/log/pods/... The DaemonSet pod is missing the /var/log/pods hostPath mount, or the path is mounted read-write (kubelet's --root-dir overrides shift this on some distros). Mount read-only at the kubelet's actual root.
High-cardinality label explosion The container parser surfaces every label from app.kubernetes.io/name plus whatever you add under extract::labels. Audit the list against the receiving backend's cardinality budget before adding more.
Pattern #7 verdict never fires despite known DataLoader stalls The transform/dataloader_errors stanzas gate on substring matches against the container body. If your trainer wraps DataLoader errors (e.g. a custom logger that prefixes with JSON), the body shape changes. Confirm via `kubectl logs --container= --previous 2>&1
dataloader.error_class empty on a known error line The OTTL stanza fell through silently — the body substring did not match any branch. Add a row to the table above and a matching set(attributes["dataloader.error_class"], ...) statement. The detector's projection gate requires the attribute, so a missing class drops the discriminator.
Pattern #10 verdict never fires despite a known CUDA OOM The transform/cuda_oom stanzas gate on substring matches against the container body. Confirm via kubectl logs <trainer-pod> --container=<c> --previous 2>&1 | grep -E 'CUDA out of memory\. Tried to allocate'. If the trainer wraps PyTorch errors (custom logger, JSON envelope), the body shape changes — extend the IsMatch predicates to match the wrapper format. Also check that gpu.id is being stamped onto the log resource via one of the two paths in the cuda_oom.* section: a missing gpu.id drops the projection at cuda_oom.go's gate and the detector stays quiet.
cuda_oom.tried_alloc_bytes stamped with a wildly wrong magnitude A unit-prefix branch was modified without updating its multiplier, or the body shape drifted from %.2f. PyTorch's format_size has used %.2f for the entire CUDA-allocator lifetime; if a customer fork emits %.0f or %.4f the recipe's \d{2} capture misses, and the stanza fails open (no stamp) rather than producing a wrong value. Verify against pytorch/c10/util/Exception.h's formatter.

Upstream component docs: receiver/filelogreceiver, receiver/filelogreceiver/internal/parser/container, processor/k8sattributesprocessor, processor/transformprocessor, extension/storage/filestorage. Self-telemetry counters appear under the standard otelcol_receiver_* and otelcol_processor_* metric families from service/telemetry.