Skip to content

Conversation

@Ayush10
Copy link

@Ayush10 Ayush10 commented Jan 30, 2026

Summary

Closes #2098

This PR introduces a pluggable telemetry framework for Qlib that provides structured metrics collection and workflow tracing. It ships as a foundational module with proof-of-concept instrumentation, designed to be extended incrementally across the codebase.

Architecture

┌─────────────────────────────────────────────────┐
│  Application Code (data pipeline, models, etc.) │
│                                                 │
│  metrics.counter("cache.hits")                  │
│  with tracer.span("data_loading"):              │
│      ...                                        │
└──────────────┬──────────────────────────────────┘
               │
       ┌───────▼───────┐
       │  QlibMetrics   │  counter / gauge / histogram
       │  QlibTracer    │  span context manager + @traced
       └───────┬───────┘
               │  fan-out to all registered backends
       ┌───────┼───────────────┐
       ▼       ▼               ▼
  Logging   InMemory      Custom Backend
  Backend   Backend       (Prometheus, OTel, etc.)

Core Components (qlib/utils/telemetry.py)

Component Purpose
MetricEvent / SpanEvent Typed dataclasses for metric measurements and trace spans
MetricsBackend (ABC) Interface for pluggable export backends
QlibMetrics Singleton metrics collector (counter, gauge, histogram)
QlibTracer Context-manager-based tracer with parent-child span tracking
LoggingBackend Integrates with existing get_module_logger infrastructure
InMemoryBackend For testing, assertions, and programmatic access with summary()

Design Principles

  • Zero overhead by default — When no backend is registered, all operations are no-ops
  • Non-invasive — Uses decorators and context managers; no changes to function signatures
  • Backward compatible — Works alongside existing TimeInspector and get_module_logger
  • Error-isolated — Backend failures never crash the application
  • Thread-safe — Lock-protected backends + thread-local span stacking

Proof-of-Concept Instrumentation

Three high-value instrumentation points demonstrate the pattern:

  1. DataHandlerLP.setup_data() — Span tracing + row/column gauge metrics
  2. DataHandlerLP._run_proc_l() — Per-processor span tracing with rows in/out
  3. MemCacheUnit.__getitem__() — Cache hit counter

Usage

from qlib.utils.telemetry import metrics, tracer, enable_inmemory_backend

# Enable a backend (opt-in)
backend = enable_inmemory_backend()

# Record metrics
metrics.counter("cache.hits", 1, tags={"cache": "expression"})
metrics.gauge("memory.rss_mb", 1024.5)

# Trace a workflow
with tracer.span("data_loading", tags={"freq": "day"}):
    data = load_data()

# Use as a decorator
@tracer.traced("model_training")
def train_model():
    ...

# Inspect collected data
print(backend.summary())

Suggested Follow-up Work

This PR is intentionally scoped as a foundation. Subsequent PRs could:

  • Instrument model training (fit()/predict()) and backtesting workflows
  • Add a FileBackend for JSON/CSV metric export
  • Add an OpenTelemetryBackend for production observability
  • Instrument ExpressionCache and DatasetCache for cache hit ratios
  • Add CLI flag or config option to auto-enable logging backend

Test Plan

  • 29 new unit tests in tests/test_telemetry.py covering:
    • MetricEvent and SpanEvent defaults
    • QlibMetrics: no-op without backend, counter/gauge/histogram, multiple backends, error isolation
    • QlibTracer: span duration, tags, error recording, nested parent-child spans, histogram emission, @traced decorator, thread safety (10 concurrent threads)
    • InMemoryBackend: filtered queries, clear, summary statistics
    • LoggingBackend: non-raising behavior
    • Module-level singletons and convenience functions
  • All tests pass: python -m pytest tests/test_telemetry.py -v
  • Existing tests unaffected (instrumentation is no-op without backends)

Introduce a pluggable telemetry framework (qlib/utils/telemetry.py) that
provides metrics collection and workflow tracing with zero overhead when
no backend is registered.

Core components:
- QlibMetrics: counter/gauge/histogram with pluggable backends
- QlibTracer: context-manager spans with parent-child tracking
- LoggingBackend: integrates with existing get_module_logger
- InMemoryBackend: for testing and programmatic access

Proof-of-concept instrumentation:
- DataHandlerLP.setup_data: span tracing + row/column gauges
- DataHandlerLP._run_proc_l: per-processor span tracing
- MemCacheUnit: cache hit counter

Includes 29 unit tests covering metrics, tracing, thread safety,
nested spans, error recording, and backend isolation.
@Ayush10 Ayush10 force-pushed the feat/issue-2098-observability branch from 9fcbbd6 to b0f0e1e Compare January 30, 2026 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enhance observability with structured logging and metrics

1 participant