feat: add extensible observability via metrics, tracing, and logging#11
feat: add extensible observability via metrics, tracing, and logging#11
Conversation
appleboy
commented
Feb 15, 2026
- Add zero-overhead observability interfaces for metrics, tracing, and structured logging
- Implement functional options to configure custom metrics collectors, tracers, and loggers
- Integrate metrics, logging, and tracing into the entire HTTP retry request lifecycle
- Provide default no-op implementations to eliminate nil checks when observability is disabled
- Add Slog (structured logging) adapter and supporting tests
- Add example implementations for Prometheus, OpenTelemetry, and Slog in the examples directory
- Add comprehensive documentation for observability features and usage patterns
- Add thorough test coverage for metrics, tracing, and logging integration
- Improve and clarify README to highlight observability capabilities and add new example links
- Add zero-overhead observability interfaces for metrics, tracing, and structured logging - Implement functional options to configure custom metrics collectors, tracers, and loggers - Integrate metrics, logging, and tracing into the entire HTTP retry request lifecycle - Provide default no-op implementations to eliminate nil checks when observability is disabled - Add Slog (structured logging) adapter and supporting tests - Add example implementations for Prometheus, OpenTelemetry, and Slog in the examples directory - Add comprehensive documentation for observability features and usage patterns - Add thorough test coverage for metrics, tracing, and logging integration - Improve and clarify README to highlight observability capabilities and add new example links Signed-off-by: appleboy <appleboy.tw@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds an extensible observability layer to the retry HTTP client via dependency-free interfaces (metrics, tracing, structured logging), wiring these signals into the request/attempt/retry lifecycle with default no-op implementations and accompanying docs/examples/tests.
Changes:
- Introduce
MetricsCollector,Tracer/Span, andLoggerinterfaces with default no-op implementations. - Integrate metrics/tracing/logging into
Client.DoWithContextand expose configuration viaWithMetrics,WithTracer,WithLogger. - Add documentation, examples, and new tests covering observability behavior.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
retry.go |
Adds observability fields to Client, initializes no-op defaults, and emits metrics/spans/logs across the retry lifecycle. |
option.go |
Adds functional options to configure metrics/tracer/logger implementations. |
metric.go |
Defines metrics interface + no-op collector and a retry-reason classifier. |
tracer.go |
Defines tracing interfaces + no-op tracer/span. |
logger.go |
Defines logging interface + no-op logger. |
logger_slog.go |
Adds log/slog adapter to satisfy the new Logger interface. |
metric_test.go |
Adds tests for metrics integration and retry-reason classification. |
tracer_test.go |
Adds tests validating span creation/end/status and retry events. |
logger_test.go |
Adds tests validating log emission during retries. |
observability_test.go |
Adds end-to-end tests for combined observability, failure path, and default no-op behavior. |
docs/OBSERVABILITY.md |
Adds comprehensive observability guide and integration patterns. |
_example/observability/slog/main.go |
Runnable example showing slog logging integration. |
_example/observability/prometheus/main.go |
Runnable example showing Prometheus-style metrics integration pattern. |
_example/observability/opentelemetry/main.go |
Runnable example showing OpenTelemetry-style tracing integration pattern. |
_example/observability/combined/main.go |
Runnable example combining metrics + tracing + logging. |
README.md |
Updates feature list and links to observability docs/examples. |
Comments suppressed due to low confidence (1)
retry.go:299
- When
ctx.Done()triggers while waiting before a retry, the function returns aRetryErrorimmediately but does not emit any observability signals (noRecordRequestComplete, no request span status update, no error log). This leaves metrics/traces/logs incomplete for canceled/timeouts during backoff. Consider recording a final completion metric (success=false), setting the request span status to error withctx.Err(), and logging the cancellation before returning.
// Wait before retry
select {
case <-ctx.Done():
// Context cancelled - return RetryError with information about the attempts made
statusCode := 0
if resp != nil {
statusCode = resp.StatusCode
}
return nil, &RetryError{
Attempts: attempt,
LastErr: ctx.Err(),
LastStatus: statusCode,
Elapsed: time.Since(startTime),
}
case <-time.After(actualDelay):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Success - record final metrics (no nil check needed) | ||
| c.metrics.RecordRequestComplete( | ||
| req.Method, | ||
| statusCode, | ||
| time.Since(startTime), | ||
| attempt+1, | ||
| true, | ||
| ) | ||
| c.logger.Debug("request completed", | ||
| "method", req.Method, | ||
| "attempts", attempt+1, | ||
| "duration", time.Since(startTime), | ||
| ) | ||
| requestSpan.SetStatus("ok", "") | ||
|
|
There was a problem hiding this comment.
In the !c.retryableChecker(lastErr, resp) early-return branch, the code records the request as success=true and sets the request span status to ok unconditionally. This branch also covers non-retryable failures (e.g., transport error deemed non-retryable, or non-retryable HTTP status), so observability will incorrectly report failures as successes. Consider deriving success from the actual outcome (e.g., lastErr == nil and/or status code), setting requestSpan status accordingly, and using an appropriate log level/message for non-retryable errors.
| // Success - record final metrics (no nil check needed) | |
| c.metrics.RecordRequestComplete( | |
| req.Method, | |
| statusCode, | |
| time.Since(startTime), | |
| attempt+1, | |
| true, | |
| ) | |
| c.logger.Debug("request completed", | |
| "method", req.Method, | |
| "attempts", attempt+1, | |
| "duration", time.Since(startTime), | |
| ) | |
| requestSpan.SetStatus("ok", "") | |
| // Determine whether this final outcome is a success | |
| success := lastErr == nil && statusCode >= 200 && statusCode < 300 | |
| // Record final metrics (no nil check needed) | |
| c.metrics.RecordRequestComplete( | |
| req.Method, | |
| statusCode, | |
| time.Since(startTime), | |
| attempt+1, | |
| success, | |
| ) | |
| if success { | |
| c.logger.Debug("request completed", | |
| "method", req.Method, | |
| "attempts", attempt+1, | |
| "duration", time.Since(startTime), | |
| ) | |
| requestSpan.SetStatus("ok", "") | |
| } else { | |
| // Non-retryable failure | |
| if lastErr != nil { | |
| c.logger.Debug("request completed with non-retryable error", | |
| "method", req.Method, | |
| "attempts", attempt+1, | |
| "duration", time.Since(startTime), | |
| "error", lastErr, | |
| ) | |
| requestSpan.SetStatus("error", lastErr.Error()) | |
| } else { | |
| c.logger.Debug("request completed with non-retryable HTTP status", | |
| "method", req.Method, | |
| "attempts", attempt+1, | |
| "duration", time.Since(startTime), | |
| "statusCode", statusCode, | |
| ) | |
| requestSpan.SetStatus("error", fmt.Sprintf("non-retryable HTTP status %d", statusCode)) | |
| } | |
| } |
| // Use invalid URL to trigger network error | ||
| req, _ := http.NewRequest(http.MethodGet, "http://invalid-domain-that-does-not-exist.com", nil) | ||
| _, err = client.Do(req) | ||
| if err == nil { | ||
| t.Fatal("Expected network error") | ||
| } |
There was a problem hiding this comment.
This test triggers a “network error” by requesting http://invalid-domain-that-does-not-exist.com, which depends on DNS/network availability and can be flaky in CI environments (offline runners, DNS interception, proxies). Prefer a deterministic local failure (e.g., connect to a closed local listener/unused localhost port) so the test reliably produces a connection error without external dependencies.
| // Start outer span for entire retry operation (no nil check needed) | ||
| ctx, requestSpan := c.tracer.StartSpan(ctx, "http.retry.request", | ||
| Attribute{Key: "http.method", Value: req.Method}, | ||
| Attribute{Key: "http.url", Value: req.URL.String()}, | ||
| Attribute{Key: "retry.max_attempts", Value: c.maxRetries + 1}, | ||
| ) | ||
| defer requestSpan.End() | ||
|
|
||
| // Log request start (no nil check needed) | ||
| c.logger.Debug("starting request", | ||
| "method", req.Method, | ||
| "url", req.URL.String(), | ||
| "max_retries", c.maxRetries, | ||
| ) |
There was a problem hiding this comment.
Even with the default no-op implementations, DoWithContext now always builds tracing attributes and logging fields (e.g., req.URL.String() is computed twice) and performs interface calls on every request. This introduces measurable overhead when observability is “disabled by default”, contradicting the zero/near-zero overhead goal. Consider adding a fast-path that skips all observability work when metrics/tracer/logger are the default no-ops (or introducing an Enabled()/level check), so expensive field construction is avoided unless a real implementation is configured.
- Add extensive tests to verify retry delay logging, callback accuracy, Retry-After header handling, and tracing delay correctness - Refactor retry delay calculation logic into separate helper methods - Improve separation of delay computation, delay modifiers, and retry callback invocation - Update tests to assert precise expected delay values for retries - Make retry delays predictable by disabling jitter in unit tests ref: #11 (comment) Signed-off-by: appleboy <appleboy.tw@gmail.com>
- Add extensive tests to verify retry delay logging, callback accuracy, Retry-After header handling, and tracing delay correctness - Refactor retry delay calculation logic into separate helper methods - Improve separation of delay computation, delay modifiers, and retry callback invocation - Update tests to assert precise expected delay values for retries - Make retry delays predictable by disabling jitter in unit tests ref: #11 (comment) Signed-off-by: appleboy <appleboy.tw@gmail.com>