Skip to content

feat: add extensible observability via metrics, tracing, and logging#11

Merged
appleboy merged 1 commit intomainfrom
observability
Feb 15, 2026
Merged

feat: add extensible observability via metrics, tracing, and logging#11
appleboy merged 1 commit intomainfrom
observability

Conversation

@appleboy
Copy link
Owner

  • Add zero-overhead observability interfaces for metrics, tracing, and structured logging
  • Implement functional options to configure custom metrics collectors, tracers, and loggers
  • Integrate metrics, logging, and tracing into the entire HTTP retry request lifecycle
  • Provide default no-op implementations to eliminate nil checks when observability is disabled
  • Add Slog (structured logging) adapter and supporting tests
  • Add example implementations for Prometheus, OpenTelemetry, and Slog in the examples directory
  • Add comprehensive documentation for observability features and usage patterns
  • Add thorough test coverage for metrics, tracing, and logging integration
  • Improve and clarify README to highlight observability capabilities and add new example links

- Add zero-overhead observability interfaces for metrics, tracing, and structured logging
- Implement functional options to configure custom metrics collectors, tracers, and loggers
- Integrate metrics, logging, and tracing into the entire HTTP retry request lifecycle
- Provide default no-op implementations to eliminate nil checks when observability is disabled
- Add Slog (structured logging) adapter and supporting tests
- Add example implementations for Prometheus, OpenTelemetry, and Slog in the examples directory
- Add comprehensive documentation for observability features and usage patterns
- Add thorough test coverage for metrics, tracing, and logging integration
- Improve and clarify README to highlight observability capabilities and add new example links

Signed-off-by: appleboy <appleboy.tw@gmail.com>
Copilot AI review requested due to automatic review settings February 15, 2026 03:04
@appleboy appleboy merged commit 0c24722 into main Feb 15, 2026
20 of 24 checks passed
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an extensible observability layer to the retry HTTP client via dependency-free interfaces (metrics, tracing, structured logging), wiring these signals into the request/attempt/retry lifecycle with default no-op implementations and accompanying docs/examples/tests.

Changes:

  • Introduce MetricsCollector, Tracer/Span, and Logger interfaces with default no-op implementations.
  • Integrate metrics/tracing/logging into Client.DoWithContext and expose configuration via WithMetrics, WithTracer, WithLogger.
  • Add documentation, examples, and new tests covering observability behavior.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
retry.go Adds observability fields to Client, initializes no-op defaults, and emits metrics/spans/logs across the retry lifecycle.
option.go Adds functional options to configure metrics/tracer/logger implementations.
metric.go Defines metrics interface + no-op collector and a retry-reason classifier.
tracer.go Defines tracing interfaces + no-op tracer/span.
logger.go Defines logging interface + no-op logger.
logger_slog.go Adds log/slog adapter to satisfy the new Logger interface.
metric_test.go Adds tests for metrics integration and retry-reason classification.
tracer_test.go Adds tests validating span creation/end/status and retry events.
logger_test.go Adds tests validating log emission during retries.
observability_test.go Adds end-to-end tests for combined observability, failure path, and default no-op behavior.
docs/OBSERVABILITY.md Adds comprehensive observability guide and integration patterns.
_example/observability/slog/main.go Runnable example showing slog logging integration.
_example/observability/prometheus/main.go Runnable example showing Prometheus-style metrics integration pattern.
_example/observability/opentelemetry/main.go Runnable example showing OpenTelemetry-style tracing integration pattern.
_example/observability/combined/main.go Runnable example combining metrics + tracing + logging.
README.md Updates feature list and links to observability docs/examples.
Comments suppressed due to low confidence (1)

retry.go:299

  • When ctx.Done() triggers while waiting before a retry, the function returns a RetryError immediately but does not emit any observability signals (no RecordRequestComplete, no request span status update, no error log). This leaves metrics/traces/logs incomplete for canceled/timeouts during backoff. Consider recording a final completion metric (success=false), setting the request span status to error with ctx.Err(), and logging the cancellation before returning.
			// Wait before retry
			select {
			case <-ctx.Done():
				// Context cancelled - return RetryError with information about the attempts made
				statusCode := 0
				if resp != nil {
					statusCode = resp.StatusCode
				}
				return nil, &RetryError{
					Attempts:   attempt,
					LastErr:    ctx.Err(),
					LastStatus: statusCode,
					Elapsed:    time.Since(startTime),
				}
			case <-time.After(actualDelay):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +350 to +364
// Success - record final metrics (no nil check needed)
c.metrics.RecordRequestComplete(
req.Method,
statusCode,
time.Since(startTime),
attempt+1,
true,
)
c.logger.Debug("request completed",
"method", req.Method,
"attempts", attempt+1,
"duration", time.Since(startTime),
)
requestSpan.SetStatus("ok", "")

Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the !c.retryableChecker(lastErr, resp) early-return branch, the code records the request as success=true and sets the request span status to ok unconditionally. This branch also covers non-retryable failures (e.g., transport error deemed non-retryable, or non-retryable HTTP status), so observability will incorrectly report failures as successes. Consider deriving success from the actual outcome (e.g., lastErr == nil and/or status code), setting requestSpan status accordingly, and using an appropriate log level/message for non-retryable errors.

Suggested change
// Success - record final metrics (no nil check needed)
c.metrics.RecordRequestComplete(
req.Method,
statusCode,
time.Since(startTime),
attempt+1,
true,
)
c.logger.Debug("request completed",
"method", req.Method,
"attempts", attempt+1,
"duration", time.Since(startTime),
)
requestSpan.SetStatus("ok", "")
// Determine whether this final outcome is a success
success := lastErr == nil && statusCode >= 200 && statusCode < 300
// Record final metrics (no nil check needed)
c.metrics.RecordRequestComplete(
req.Method,
statusCode,
time.Since(startTime),
attempt+1,
success,
)
if success {
c.logger.Debug("request completed",
"method", req.Method,
"attempts", attempt+1,
"duration", time.Since(startTime),
)
requestSpan.SetStatus("ok", "")
} else {
// Non-retryable failure
if lastErr != nil {
c.logger.Debug("request completed with non-retryable error",
"method", req.Method,
"attempts", attempt+1,
"duration", time.Since(startTime),
"error", lastErr,
)
requestSpan.SetStatus("error", lastErr.Error())
} else {
c.logger.Debug("request completed with non-retryable HTTP status",
"method", req.Method,
"attempts", attempt+1,
"duration", time.Since(startTime),
"statusCode", statusCode,
)
requestSpan.SetStatus("error", fmt.Sprintf("non-retryable HTTP status %d", statusCode))
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +156 to +161
// Use invalid URL to trigger network error
req, _ := http.NewRequest(http.MethodGet, "http://invalid-domain-that-does-not-exist.com", nil)
_, err = client.Do(req)
if err == nil {
t.Fatal("Expected network error")
}
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test triggers a “network error” by requesting http://invalid-domain-that-does-not-exist.com, which depends on DNS/network availability and can be flaky in CI environments (offline runners, DNS interception, proxies). Prefer a deterministic local failure (e.g., connect to a closed local listener/unused localhost port) so the test reliably produces a connection error without external dependencies.

Copilot uses AI. Check for mistakes.
Comment on lines +222 to +235
// Start outer span for entire retry operation (no nil check needed)
ctx, requestSpan := c.tracer.StartSpan(ctx, "http.retry.request",
Attribute{Key: "http.method", Value: req.Method},
Attribute{Key: "http.url", Value: req.URL.String()},
Attribute{Key: "retry.max_attempts", Value: c.maxRetries + 1},
)
defer requestSpan.End()

// Log request start (no nil check needed)
c.logger.Debug("starting request",
"method", req.Method,
"url", req.URL.String(),
"max_retries", c.maxRetries,
)
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the default no-op implementations, DoWithContext now always builds tracing attributes and logging fields (e.g., req.URL.String() is computed twice) and performs interface calls on every request. This introduces measurable overhead when observability is “disabled by default”, contradicting the zero/near-zero overhead goal. Consider adding a fast-path that skips all observability work when metrics/tracer/logger are the default no-ops (or introducing an Enabled()/level check), so expensive field construction is avoided unless a real implementation is configured.

Copilot uses AI. Check for mistakes.
appleboy added a commit that referenced this pull request Feb 15, 2026
- Add extensive tests to verify retry delay logging, callback accuracy, Retry-After header handling, and tracing delay correctness
- Refactor retry delay calculation logic into separate helper methods
- Improve separation of delay computation, delay modifiers, and retry callback invocation
- Update tests to assert precise expected delay values for retries
- Make retry delays predictable by disabling jitter in unit tests

ref: #11 (comment)
Signed-off-by: appleboy <appleboy.tw@gmail.com>
appleboy added a commit that referenced this pull request Feb 15, 2026
- Add extensive tests to verify retry delay logging, callback accuracy, Retry-After header handling, and tracing delay correctness
- Refactor retry delay calculation logic into separate helper methods
- Improve separation of delay computation, delay modifiers, and retry callback invocation
- Update tests to assert precise expected delay values for retries
- Make retry delays predictable by disabling jitter in unit tests

ref: #11 (comment)

Signed-off-by: appleboy <appleboy.tw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants