Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Tracecore docs index

One-line purpose + audience per file. The top-level README has the "who is this for?" routing table; this file is the fall-through index for everything in docs/.

New contributors: read rfcs/0013-distro-first-pivot.md first. It is the binding architectural posture - tracecore is an OCB-assembled OTel Collector distribution plus a pattern library; in-house code is bounded to the four moat scopes in §6; the bundled Helm-chart OTTL recipe preserves the customer-stable contracts in §3 across the receiver swaps.

Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 external user

Top-level

File Audience Purpose
STRATEGY.md 🏛️ Long-term architectural posture; the OTel-Collector-compatible-by-default principle; accepted divergences with rationale.
getting-started.md 👤 5 commands from clone to first OTLP byte; no GPU required.
FAILURE-MODES.md 👤 🛠️ Runtime-level failure inventory (lifecycle, signals, data flow, shutdown timing, self-telemetry surface). Per-component modes live in each component's RUNBOOK.
FOLLOWUPS.md 🏛️ Redirect stub. Tracked deferrals now live as per-milestone shards under followups/.
FLAKY-TESTS.md 🛠️ Known CI flakes + retry policy.
HARDWARE-TESTING.md 🛠️ GPU testbed setup for receiver authors and CI runner provisioners.
STYLE-docs.md 🛠️ Documentation style guide: WHY-not-WHAT, CI-enforced invariants, banned phrases.
STYLE-errors.md 🛠️ Error-message style guide; extends top-level STYLE.md.
nps.md 🏛️ NPS-style operator-feedback methodology (survey instrument + calculation).
adoption-pipeline.md 🏛️ Partner outreach funnel + NPS pipeline mechanics + public-counter rules for the O5 hero KPI.
reference-environments.md 🏛️ 👤 Binding spec for the two NORTHSTARS O2 reference tiers (Minimal CI-runnable; Production-realistic 32-GPU H100).
SUPPORT-MATRIX.md 👤 🏛️ v1.0-rc1 platform/version/GPU envelope: k8s versions, OTel collector pin, Go, Linux distros, NVIDIA/AMD/Intel tiers, CNI plugins. Every row cites the source-of-truth file or CI gate.
reproducibility.md 🌐 👤 Third-party verification recipe for published releases: rebuild + diffoscope + cosign + SLSA + SBOM.
maintainership.md 🏛️ 🛠️ Governance: who has commit access, how RFCs are sponsored, how security issues are handled.
ATTRIBUTES.md 👤 🛠️ Customer-stable attribute namespace inventory + soft-lock policy. Every pattern.* / tracecore.* / hw.gpu.* / k8s.* / nccl.fr.* / kernelevents.* / gen_ai.training.* key the collector emits or consumes, with stability tags and the v0.4-advisory → v1.0-enforced rename policy.
v1-rc1-cut-criteria.md 🏛️ Twelve falsifiable rubrics for the v1.0.0-rc1 cut (deriving from NORTHSTARS O1-O7) + Tier-2 GA path-clearing items + out-of-scope deferrals. Authoritative rubric source for MILESTONES.md M22.
history/v1-rc1/ 🏛️ Archived v1.0-rc1 audit snapshots — governance-gaps + operational-gaps. See history/v1-rc1/README.md.
standards-roadmap.md 🏛️ NORTHSTARS O4 tracking artifact for the gen_ai.training.* semconv upstream motion. Inventory of upstream + tracecore-emitted training keys, proposal set for PR-1/PR-2, SIG cadence (Tuesdays 09:00 PT), competing-proposal risk (rl.* Issue #88), and cross-ref to in-repo work that depends on each PR landing.

Subdirectories

Path Audience Purpose
rfcs/ 🏛️ 🛠️ Architecture decision records. See rfcs/README.md for the status index.
patterns/ 🛠️ 👤 Root-cause-pattern walkthroughs (shipped detectors: NVLink, HBM ECC, thermal, PCIe AER) plus design specs for the 8 planned v1 patterns (#2, #7-#13).
proposals/ 🏛️ Drafts pending upstream (semconv extensions, etc.).
research/ 🛠️ Synthesized findings from reading external sources (OTel collector internals, benchmark baselines).
schemas/ 🛠️ Receiver schema documents pointed at by emitted SchemaURL.
sdk/ 🌐 👤 Verdict-consumption SDKs (Python + Go) — typed clients for the v1.0-rc1 envelope. Closes v1-rc1 cut criterion 12.
examples/ 👤 Reference operator artifacts (Prometheus alerts, Grafana dashboard, with-telemetry config).
followups/ 🏛️ Per-milestone follow-up shards + cross-cutting _needs-prod-data / _needs-gpu buckets. See followups/README.md for filing convention.
integrations/ 👤 Validated recipes for shipping tracecore output to specific backends. See per-recipe rows below.
migration/ 👤 Per-minor-release upgrade guides covering every operator-visible break. One file per release boundary; the consolidated v0.x-to-v1.0.md flattens the v0.x → v1.0.0-rc1 hop (cut criterion 11).
reference-architectures/ 👤 Three operator-facing deployment archs (single-cluster, multi-cluster aggregation, air-gapped); each names scale envelope, required components, values overlay, NetworkPolicy posture, and validation checklist (cut criterion 9).
notes/ 🛠️ 🏛️ Working notes on process, CI, PR workflow, reviews, conftest, autonomous-run logs. See notes/README.md.

Integrations

Backend (exporter-side) recipes:

File Audience Purpose
integrations/otel-backend.md 👤 OTLP/HTTP to a generic OpenTelemetry Collector via the upstream otlphttp exporter.
integrations/honeycomb.md 👤 Direct OTLP/HTTP to Honeycomb via the upstream otlphttp exporter.
integrations/datadog.md 👤 Datadog via the bundled datadogexporter.
integrations/clickhouse-direct.md 👤 Self-hosted ClickHouse via the bundled clickhouseexporter.
integrations/loki.md 👤 Grafana Loki via OTLP/HTTP native ingestion (otlphttp exporter, X-Scope-OrgID tenant header); labels-vs-structured-metadata mapping for pattern.* verdict attributes.
integrations/tempo.md 👤 Grafana Tempo (OSS, AGPL-3.0) trace backend via the upstream otlphttp exporter.
integrations/multi-cluster.md 👤 Multi-cluster federation v0 (read-only roll-up): N source clusters stamp cluster.id via OTTL transform, forward OTLP/HTTP to a central aggregation collector that fans out to backends. Production deploys MUST use the bearer-token or mTLS auth shape documented in Cross-cluster authentication; the unauth path is for in-trust-boundary validation only.
integrations/cert-manager-mtls.md 👤 cert-manager-issued mTLS for multi-cluster OTLP egress: ClusterIssuer + Certificate manifests, Secret mount layout, tls: block wiring under /etc/tracecore/tls/.

Source (receiver-side) recipes — RFC-0013 §migration PR-J replacements for the deleted in-tree receivers:

File Audience Purpose
integrations/filelog-container.md 👤 Container stdout/stderr tailing via filelogreceiver + container parser + k8sattributesprocessor + file_storage. Replaces containerstdout.
integrations/journald-kernel.md 👤 Kernel + systemd events via journaldreceiver + filelogreceiver (kmsg) + OTTL transform preserving kernelevents.xid / gpu.id. Replaces kernelevents.
integrations/k8sobjects-events.md 👤 Kubernetes events via k8sobjectsreceiver + OTTL transform preserving the eleven-entry k8s.event.hint enum. Replaces k8sevents.
integrations/prometheus-scrape.md 👤 Generic Prometheus scrape via prometheusreceiver (dcgm-exporter, AMD/Intel/Habana exporters, Kueue) + OTTL gpu.vendor normalization. Replaces dcgm and kueue.

Per-component docs

Path Audience Purpose
module/receiver/ncclfrreceiver/README.md 👤 🛠️ NCCL FlightRecorder receiver + safe-pickle parser scope (RFC-0013 PR-I.1b moved out of components/).
module/receiver/ncclfrreceiver/RUNBOOK.md 👤 Operator playbook + per-kind triage (incl. pickle deny-boundary).
components/receivers/pyspy/README.md 👤 🛠️ On-demand Python stack-sampling receiver (faulthandler-based). - scheduled for deletion per RFC-0013 §7
components/receivers/pyspy/RUNBOOK.md 👤 Operator playbook + per-kind triage (RFC-0009 degraded modes). - scheduled for deletion per RFC-0013 §7

What goes where (for contributors)

  • Why a load-bearing decision was made → an RFC under rfcs/.
  • What a quarterly commitment looks like → MILESTONES.md.
  • Tracked deferrals (with revisit triggers) → the matching per-milestone shard under followups/ (or _needs-prod-data.md / _needs-gpu.md for resource-gated items). See followups/README.md for which shard owns what.
  • A failure mode + the test that pins it → FAILURE-MODES.md (runtime) or the component RUNBOOK (per-component).
  • A pattern that transfers across receivers → in-source as a doc comment on the canonical implementation; the next author copies the code.
  • An external-research summary, measurement baseline, or RFC audit-trail deep-diveresearch/. Files here are bounded investigations that fed an RFC and remain useful as that RFC's audit trail. Linked from the RFC, MILESTONES.md entry, or another research file (no orphans). See research/README.md.
  • An anchored micro-lesson on process / CI / PR workflow / reviews / autonomous-run retrosnotes/. Entries are short imperative-titled bodies with a mandatory anchor (file path, test name, command, or grep query). Captured via the learn-from-mistakes skill. See notes/README.md.

research/ vs notes/ boundary in one line: research/ is topic-scoped deep dives that feed a downstream RFC or proposal; notes/ is anchored process lessons that the next session looks up on demand. If it cites external sources or runs power-analysis arithmetic, it is research; if it ends in "next time, do X / don't do Y" against a grep-able anchor, it is a note.