Skip to content

Latest commit

 

History

History
204 lines (169 loc) · 9.35 KB

File metadata and controls

204 lines (169 loc) · 9.35 KB

Grafana Loki

Loki ingests OTLP/HTTP logs natively at /otlp/v1/logs since Loki 3.0 (2024). Tracecore reaches it directly through the upstream otlphttp exporter bundled in the OCB-assembled tracecore distro; no Loki-specific exporter is required, and the deprecated contrib lokiexporter is intentionally not bundled (RFC-0013 §2 adoption matrix). The tenant ID travels in the X-Scope-OrgID header.

Deployment shape:

tracecore (otlphttp exporter) ──▶ Loki distributor (/otlp/v1/logs)

Config

# docs/integrations/examples/loki.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlphttp/loki:
    endpoint: http://loki-distributor.observability.svc.cluster.local:3100/otlp
    compression: gzip
    headers:
      X-Scope-OrgID: tracecore

service:
  pipelines:
    logs/loki:
      receivers: [otlp]
      exporters: [otlphttp/loki]

Validate with the in-tree binary:

./tracecore validate --config=docs/integrations/examples/loki.yaml

Endpoint and tenant

  • The endpoint is the Loki distributor's HTTP listener at the path /otlp; the otlphttp exporter appends the OTLP-spec /v1/logs suffix automatically, so the request lands at /otlp/v1/logs. Do not include /v1/logs in the YAML — the exporter rejects the duplicated path.
  • X-Scope-OrgID identifies the tenant when Loki's distributor runs with auth_enabled: true. Single-tenant clusters (auth_enabled: false) accept requests without the header and route them under the synthetic tenant fake; you can drop the headers: block in that case.
  • Loki Operator and Grafana Enterprise Logs (GEL) layer additional multi-tenant auth on top (e.g. mTLS gateways, per-tenant rate limits); those are optional, not required for the basic OSS install.

Labels vs. structured metadata (the cardinality footgun)

Loki indexes logs by stream labels and stores everything else as structured metadata (queryable in LogQL, NOT indexed). Label cardinality directly drives index size and query cost; the canonical Loki guidance is to keep label values in the low hundreds per stream.

The distributor's OTLP receiver maps OTLP attributes in three buckets:

Source Default mapping Cardinality risk
OTLP resource attributes Index labels (only the ones in default_resource_attributes_as_index_labels) Bounded; the default list is curated.
OTLP scope attributes Structured metadata Low — instrumentation-scope is rarely high-cardinality.
OTLP log attributes Structured metadata Safe by default; high-cardinality keys (e.g. pattern.verdict_json) stay out of the label index.

The Loki-side defaults at the distributor pick up these resource attributes as stream labels (from default_resource_attributes_as_index_labels):

service.name, service.namespace, deployment.environment, deployment.environment.name, cloud.region, cloud.availability_zone, k8s.cluster.name, k8s.namespace.name, k8s.container.name, container.name, k8s.replicaset.name, k8s.deployment.name, k8s.statefulset.name, k8s.daemonset.name, k8s.cronjob.name, k8s.job.name.

Operator-side tuning lives in Loki's config, not in tracecore:

# loki.yaml (on the LOKI side, NOT in tracecore)
limits_config:
  allow_structured_metadata: true   # default in Loki 3.0+
  otlp_config:
    resource_attributes:
      attributes_config:
        - action: index_label
          regex: k8s\.node\.name     # opt-in: index by node
    log_attributes:
      - action: structured_metadata
        attributes:
          - pattern.id
          - pattern.headline
          - pattern.remediation
          - pattern.confidence
          - pattern.verdict_json

When OTLP attributes flow into Loki via the native OTLP endpoint (/otlp/v1/logs, this recipe's target), they land as structured metadata with dots translated to underscores at the LogQL surface — no bucket prefix. An attribute pattern.id on a log record is queried as pattern_id; a resource attribute k8s.node.name is queried as k8s_node_name. Verify against Loki upstream's "Format considerations" doc (docs/sources/shared/otel.md); the structured-metadata + dots → underscores normalization is stable since Loki 3.0.

Promtail / Grafana Alloy users see different keys. When tracecore logs are routed through a Promtail / Alloy pipeline with a JSON parser stage (| json), OTLP attributes appear as JSON-body fields with the attributes_ / resources_ bucket prefix (e.g. attributes_pattern_id, resources_k8s_node_name). That is the Promtail-extraction surface, NOT the native OTLP surface. This recipe targets the native endpoint; if you must use Promtail/Alloy, add | json to LogQL queries and switch to the prefixed names.

Tracecore-specific attributes

The patterndetectorprocessor emits verdict records carrying these attributes (defined in module/processor/patterndetectorprocessor/patterndetector.go):

  • pattern.id, pattern.headline, pattern.remediation, pattern.confidence, pattern.verdict_json
  • k8s.pod.name, k8s.pod.namespace, k8s.node.name
  • k8s.event.reason
  • nccl.fr.pg_id, nccl.fr.collective_seq_id, nccl.fr.hanging_ranks_count

All ship as log attributes, so all land in Loki as structured metadata by default. This is the right shape: pattern.verdict_json in particular is per-incident JSON and would explode the label index if promoted. The dashboards consume them as pattern_id, k8s_node_name, etc. — bare-underscored, no bucket prefix, matching the native-OTLP surface (see ## See also below).

Only resource attributes on the verdict's containing log record are candidates for the label index, and the default list above already covers k8s.namespace.name / k8s.cluster.name / service.name / the rest of the k8s workload axis.

Retention

Retention is configured on the Loki side via compactor.retention_* and per-stream limits_config.retention_period. Tracecore does not control retention; the recipe assumes the operator has set a global retention compatible with the verdict signal (~14-30d is typical for incident review; longer for compliance). If the cluster has retention disabled, verdicts accumulate indefinitely until disk fills — set at least a default retention_period before pointing tracecore at the cluster.

Secret handling

Same shape as the other recipes: render the literal X-Scope-OrgID value at deploy time through envsubst, Helm, or a CSI secret driver if the tenant identifier is sensitive. The example file ships the literal tracecore so tracecore validate succeeds offline. Single- tenant Loki clusters can drop the headers: block entirely.

Failure modes

Symptom First check
HTTP 401 / 403 from Loki Auth gateway in front of the distributor is rejecting the request. Confirm the deployed X-Scope-OrgID value matches the gateway's tenant allow-list.
HTTP 400 the request body is too large Tracecore is sending batches above limits_config.distributor.ingestion_rate_mb. Lower the batchprocessor flush size or raise the Loki limit.
HTTP 400 structured metadata is not allowed Loki is below 3.0 OR limits_config.allow_structured_metadata is false. Upgrade Loki, or flip the limit. The OTLP receiver always emits structured metadata for non-label attributes.
HTTP 429 with Retry-After Loki's per-tenant ingestion rate-limit is engaged. Either aggregate at tracecore (batchprocessor) before the exporter or raise ingestion_rate_mb / ingestion_burst_size_mb on the Loki side.
Verdicts arrive but pattern.id is missing from LogQL The Loki distributor dropped log attributes per otlp_config.log_attributes. Confirm the operator-side config includes action: structured_metadata for pattern.* (see the labels-vs-metadata section above).
Repeated TLS handshake failures The default trust store covers most managed Lokis. If a corporate proxy MITMs egress, install the proxy CA in the system trust store; do not enable insecure_skip_verify in production.
Stream cardinality alerts on the Loki cluster Confirm no high-cardinality OTLP resource attribute (e.g. service.instance.id) was added to default_resource_attributes_as_index_labels; that list defaults sanely but is the most common operator footgun.

See also