v0.2 ModelService observability.traces: OTLP gateway emission

Customers running production AI workloads need per-request observability — prompts, completions, tokens, latency, eval scores, user sessions. That layer is filled today by Langfuse, Langsmith, Phoenix, Helicone, Datadog LLM Observability, and homegrown OTel pipelines. Modelplane sits at the gateway and sees every request; it's the natural emission point. But picking one app-observability backend creates lock-in for the others.

The right shape is: **the gateway emits OpenTelemetry traces with `gen_ai.*` semantic conventions to an operator-configured OTLP endpoint**. Customers point at Langfuse, Langsmith, Phoenix, a custom OTel collector, or anything else that speaks OTLP. Modelplane stays neutral.

This is the v0.2 companion to [#74 fleet signal bus](https://github.com/modelplaneai/modelplane/issues/74). Both share OTLP as the transport, but carry different signal types for different audiences.

## Two observability planes, one transport

|  | [#74 signal bus](https://github.com/modelplaneai/modelplane/issues/74) | This issue (OTel gateway traces) |
|---|---|---|
| **Signal** | Aggregated metrics, sketches, time series | Per-request spans (`gen_ai.*`) |
| **Volume** | KB/s per cluster | MB/s per active deployment (sampled) |
| **Retention** | 30d sketches in control plane | Hours hot / months cold in user's backend |
| **Audience** | Operators, scheduler, fleet matcher | Application developers, eval pipelines |
| **PII exposure** | Hashes / aggregates only | Full prompts/responses |
| **Backend** | Modelplane control plane | User's choice (Langfuse / Langsmith / Phoenix / custom) |

Same OTLP collector substrate; one fanout aggregates spans into #74 sketches, the other forwards raw spans to the user's configured backend.

## Shape

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
spec:
  endpoints:
    - selector: { matchLabels: { modelplane.ai/deployment: kimi-k2 } }
  observability:
    traces:
      enabled: true
      endpoint: https://cloud.langfuse.com/api/public/otel    # or langsmith / phoenix / custom otel-collector
      headers:
        Authorization:
          valueFrom: { secretKeyRef: { name: langfuse-creds, key: auth } }
      sampling:
        ratio: 1.0
```

Gateway emits `gen_ai.completion`, `gen_ai.embedding`, `gen_ai.tool_call` spans to the configured OTLP endpoint with standard attributes (`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.response.finish_reasons`, latency, error).

## Cache-family attribute enrichment

The gateway already reads cache-family signals via #74 for routing decisions; same info goes onto outgoing spans as attributes, letting operators correlate per-request latency with cache state in their tracing backend:

```
gen_ai.request.model         = "meta-llama/Llama-3.3-70B-Instruct"
gen_ai.usage.input_tokens    = 4231
gen_ai.usage.output_tokens   = 187
modelplane.cache.weights     = "llama-3-3-70b"             # ModelCache ref
modelplane.cache.prefix_warmth = "hot"                     # HotPrefixPool hit
modelplane.cache.kv_tier_hit = "L2"                        # KVOffloadTier tier
modelplane.routing.cluster   = "us-east-1"
```

## v0.2 scope

- `ModelService.spec.observability.traces` block with the shape above
- Gateway-side OTel instrumentation with `gen_ai.*` semantic conventions
- Standard attributes: model name, tokens, latency, finish reason, error class
- Cache-family attribute enrichment (consuming [#66 ModelCache](https://github.com/modelplaneai/modelplane/issues/66) / [#72 KVOffloadTier](https://github.com/modelplaneai/modelplane/issues/72) / [#73 HotPrefixPool](https://github.com/modelplaneai/modelplane/issues/73) status)
- Single OTLP endpoint per ModelService; sampling ratio configurable
- Same OTLP collector substrate as [#74](https://github.com/modelplaneai/modelplane/issues/74) — one workload-plane agent, two control-plane outputs

## Related

- [#74 Fleet signal bus](https://github.com/modelplaneai/modelplane/issues/74) — parallel primitive sharing OTLP transport
- [#66 ModelCache](https://github.com/modelplaneai/modelplane/issues/66), [#72 KVOffloadTier](https://github.com/modelplaneai/modelplane/issues/72)
- [#71 ModelService routing affinity](https://github.com/modelplaneai/modelplane/issues/71) — routing decisions land on spans as attributes
- [PR #64 design doc](https://github.com/modelplaneai/modelplane/pull/64) — `ModelService` shape

## References

- OpenTelemetry `gen_ai.*` semantic conventions (still stabilizing; aligning with the working group's current draft)
- Langfuse OTel ingestion: https://langfuse.com/docs/opentelemetry/get-started
- Langsmith OTel ingestion: https://docs.smith.langchain.com/observability/how_to_guides/trace_with_opentelemetry
- Arize Phoenix OTel: https://arize.com/docs/phoenix/tracing/llm-traces


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.2 ModelService observability.traces: OTLP gateway emission #77

Two observability planes, one transport

Shape

Cache-family attribute enrichment

v0.2 scope

Related

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	#74 signal bus	This issue (OTel gateway traces)
Signal	Aggregated metrics, sketches, time series	Per-request spans (`gen_ai.*`)
Volume	KB/s per cluster	MB/s per active deployment (sampled)
Retention	30d sketches in control plane	Hours hot / months cold in user's backend
Audience	Operators, scheduler, fleet matcher	Application developers, eval pipelines
PII exposure	Hashes / aggregates only	Full prompts/responses
Backend	Modelplane control plane	User's choice (Langfuse / Langsmith / Phoenix / custom)

Uh oh!

v0.2 ModelService observability.traces: OTLP gateway emission #77

Description

Two observability planes, one transport

Shape

Cache-family attribute enrichment

v0.2 scope

Related

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions