Problem
There is no standardized observability for Gitclaw agent executions. Users have no visibility into:
- What LLM calls are being made — which provider, model, token usage, cost, latency, finish reason
- What tools are being executed — which tool, duration, success/failure
- Overall session behavior — total cost, total tokens, number of LLM roundtrips vs tool calls
Without this, debugging agent behavior, optimizing cost, and monitoring production agents requires manual logging and guesswork.
Proposed Solution
Add OpenTelemetry-based instrumentation using a hybrid 3-layer approach:
Layer 1: HTTP-level interception (LLM calls)
- Use
@opentelemetry/instrumentation-undici to auto-instrument outbound HTTP calls to LLM providers (OpenAI, Anthropic, Google, Groq, Mistral, xAI, AWS Bedrock)
- A custom
SpanProcessor detects LLM provider URLs and enriches spans with gen_ai.* semantic conventions
Layer 2: Event-based enrichment (structured LLM data)
- On each
message_end event from the agent loop, create a gen_ai.chat span with:
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cost_usd
gen_ai.request.model, gen_ai.response.finish_reasons, gen_ai.system
- This captures data that isn't available at the raw HTTP level (token counts, cost, stop reason)
Layer 3: Tool call wrapping (application level)
- Every tool execution (built-in, declarative, plugin, SDK-injected) is wrapped in a
gitclaw.tool.execute span
- Captures:
tool.name, tool.call_id, tool.duration_ms, tool.status, tool.error_message
Trace shape
gitclaw.session (root)
├── gen_ai.chat (LLM call → Anthropic)
│ gen_ai.usage.input_tokens=1523, output_tokens=200, cost_usd=0.003
│ gen_ai.response.finish_reasons=tool_use
├── gitclaw.tool.execute (cli)
│ tool.name=cli, duration_ms=2340, status=success
├── gen_ai.chat (LLM call → Anthropic)
│ gen_ai.usage.input_tokens=2100, output_tokens=150
├── gitclaw.tool.execute (write)
│ tool.name=write, duration_ms=12, status=success
├── gen_ai.chat (LLM call → Anthropic)
│ gen_ai.response.finish_reasons=stop
└── session totals: tokens=7073, cost=$0.012, tool_calls=2, llm_calls=3
Key design decisions
- Zero overhead when disabled —
@opentelemetry/api returns no-op instances by default; no performance impact unless initTelemetry() is called
- Opt-in SDK packages — Only
@opentelemetry/api (~50KB) is a hard dependency; all SDK/exporter packages are optional peer dependencies
- Backend agnostic — Exports via OTLP HTTP, compatible with Jaeger, Grafana Tempo, Datadog, Honeycomb, Axiom, or any OTel Collector
- Plugin authors get access —
tracer and meter exposed on GitclawPluginApi so plugins can emit custom spans/metrics
Usage
import { initTelemetry, query } from "gitclaw";
await initTelemetry({
serviceName: "my-agent",
exporterEndpoint: "http://localhost:4318",
});
for await (const msg of query({ prompt: "Fix the bug" })) {
// traces + metrics exported automatically
}
Metrics emitted
| Metric |
Type |
Description |
gen_ai.client.token.usage |
Counter |
Token consumption by model and type |
gen_ai.client.operation.duration |
Histogram |
LLM call latency |
gitclaw.session.duration_ms |
Histogram |
End-to-end session duration |
gitclaw.session.cost_usd |
Counter |
Session cost by agent and model |
gitclaw.tool.calls |
Counter |
Tool invocations by name and status |
gitclaw.tool.duration_ms |
Histogram |
Tool execution latency |
Alternatives Considered
-
Framework-level tracing (instrument every internal operation) — Rejected. Tracing manifest parsing, plugin loading, skill discovery, etc. high maintenance burden, and provides data most users don't need. Internal debugging can use standard logging.
-
Custom tracing abstraction — Rejected. OpenTelemetry is the industry standard, vendor-neutral, and already supported by every major observability platform. Building a custom
solution would fragment the ecosystem.
-
SDK-level wrapping (wrap pi-ai client calls) — Partially adopted. HTTP-level interception is cleaner and survives SDK swaps, but message_end event enrichment is needed for
structured data (tokens, cost) that isn't in raw HTTP responses.
Additional Context
- Follows OpenTelemetry GenAI Semantic Conventions
- The underlying LLM library (
@mariozechner/pi-ai) uses Undici as its HTTP client, which is why @opentelemetry/instrumentation-undici is used instead of instrumentation-http
- Compatible with quick local testing via Jaeger all-in-one:
docker run -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one
Problem
There is no standardized observability for Gitclaw agent executions. Users have no visibility into:
Without this, debugging agent behavior, optimizing cost, and monitoring production agents requires manual logging and guesswork.
Proposed Solution
Add OpenTelemetry-based instrumentation using a hybrid 3-layer approach:
Layer 1: HTTP-level interception (LLM calls)
@opentelemetry/instrumentation-undicito auto-instrument outbound HTTP calls to LLM providers (OpenAI, Anthropic, Google, Groq, Mistral, xAI, AWS Bedrock)SpanProcessordetects LLM provider URLs and enriches spans withgen_ai.*semantic conventionsLayer 2: Event-based enrichment (structured LLM data)
message_endevent from the agent loop, create agen_ai.chatspan with:gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.usage.cost_usdgen_ai.request.model,gen_ai.response.finish_reasons,gen_ai.systemLayer 3: Tool call wrapping (application level)
gitclaw.tool.executespantool.name,tool.call_id,tool.duration_ms,tool.status,tool.error_messageTrace shape
Key design decisions
@opentelemetry/apireturns no-op instances by default; no performance impact unlessinitTelemetry()is called@opentelemetry/api(~50KB) is a hard dependency; all SDK/exporter packages are optional peer dependenciestracerandmeterexposed onGitclawPluginApiso plugins can emit custom spans/metricsUsage
Metrics emitted
gen_ai.client.token.usagegen_ai.client.operation.durationgitclaw.session.duration_msgitclaw.session.cost_usdgitclaw.tool.callsgitclaw.tool.duration_msAlternatives Considered
Framework-level tracing (instrument every internal operation) — Rejected. Tracing manifest parsing, plugin loading, skill discovery, etc. high maintenance burden, and provides data most users don't need. Internal debugging can use standard logging.
Custom tracing abstraction — Rejected. OpenTelemetry is the industry standard, vendor-neutral, and already supported by every major observability platform. Building a custom
solution would fragment the ecosystem.
SDK-level wrapping (wrap pi-ai client calls) — Partially adopted. HTTP-level interception is cleaner and survives SDK swaps, but
message_endevent enrichment is needed forstructured data (tokens, cost) that isn't in raw HTTP responses.
Additional Context
@mariozechner/pi-ai) uses Undici as its HTTP client, which is why@opentelemetry/instrumentation-undiciis used instead ofinstrumentation-http