Skip to content

Conversation

@zephyrzilla
Copy link
Member

@zephyrzilla zephyrzilla commented Nov 6, 2025

Summary

This PR introduces a comprehensive metadata tracking system for SyGra that automatically captures execution metrics, token usage, costs, and performance data across all LLM calls and workflow executions. The system provides detailed latency statistics (including percentiles), per-node cost tracking, and multi-level metrics aggregation, requiring zero changes to existing code.

Explain the features implemented:

1. Centralized Metadata Collection System

  • Thread-safe singleton MetadataCollector for tracking all execution metrics
  • Captures execution context (task name, environment, git info, timing)
  • Records aggregate, per-model, and per-node statistics
  • Automatic dataset metadata tracking (source, size, version, hash)

2. Latency Statistics

  • Percentile tracking: min, max, mean, median, std_dev, p50, p95, p99
  • Available at both model-level and node-level
  • Helps identify performance outliers and tail latencies

3. Per-Node Cost Tracking

  • Track costs at the node level, not just model level
  • Automatic cost calculation using model's calculate_cost() method
  • Shows total_cost_usd and average_cost_per_execution per node
  • Helps identify which parts of your workflow are most expensive
  • Only shown when costs are available (graceful degradation)

4. Cost Tracking with LangChain Community Integration (langchain-community)

  • Integrated official pricing for multiple models (OpenAI, Azure OpenAI, Anthropic Claude)
  • Automatic cost calculation based on token usage
  • Per-request, per-execution, per-record, and aggregate cost reporting
  • Zero-cost reporting for unsupported models (no stale estimates)

5. Automatic Tracking Infrastructure

  • @track_model_request decorator for custom model wrappers
  • MetadataTrackingCallback for LangChain agent LLM calls
  • Captures tokens, latency, response codes, and costs automatically
  • Works with both sync and async execution
  • Integrated into BaseNode for consistent tracking across all node types

6. Comprehensive Metrics Tracking

  • Token Statistics: Prompt, completion, total tokens with averages
  • Performance Metrics: Latency (total, average, percentiles), throughput (tokens/sec), retry/failure rates
  • Cost Analytics: Total costs, per-request costs, per-node costs, per-record costs
  • Response Tracking: HTTP status code distribution
  • Model Configuration: Captured for reproducibility

7. Timestamp Synchronization

  • Output and metadata files share identical timestamps
  • Format: output_2025-10-30_18-19-07.json -> metadata_..._2025-10-30_18-19-07.json
  • Easy correlation between outputs and metadata

8. Toggle Support

  • Enable/disable via --disable_metadata CLI flag
  • Programmatic control via collector.set_enabled(False)
  • Minimal overhead when disabled
  • Preserves enabled state across resets

9. Supported Models

  • OpenAI (GPT-4, GPT-4o, GPT-4o-mini)
  • Azure OpenAI (same models, different endpoints)
  • Anthropic Claude (via AWS Bedrock)
  • vLLM (OpenAI-compatible endpoints, token tracking only)
  • TGI (Text Generation Inference with details extraction)

How to Test the feature

Test 1: Library Usage with Latency Statistics

from sygra import Workflow, DataSource
from sygra.metadata.metadata_collector import get_metadata_collector

graph = Workflow("test_metadata")
graph.source(DataSource.memory([{"text": "Hello"}, {"text": "World"}]))
graph.add_llm_node("summarizer", "gpt-4o-mini") \
    .system_message("Summarize") \
    .user_message("{text}") \
    .output_keys("summary")
graph.add_edge("START", "summarizer")
graph.add_edge("summarizer", "END")

# Run with timestamp
results = graph.run(num_records=2, output_with_ts=True, output_dir="/tmp/test")

# Access metadata
collector = get_metadata_collector()
metadata = collector.get_metadata_summary()

# Check latency statistics
print(f"P95 latency: {metadata['models']['gpt-4o-mini']['performance']['latency_statistics']['p95']}s")
print(f"Std dev: {metadata['models']['gpt-4o-mini']['performance']['latency_statistics']['std_dev']}s")

Expected Result:

  • Output file: test/output_YYYY-MM-DD_HH-MM-SS.json
  • Metadata file: test/metadata/metadata_test_metadata_YYYY-MM-DD_HH-MM-SS.json
  • Timestamps match exactly
  • Metadata contains latency statistics with min, max, mean, median, std_dev, p50, p95, p99
  • Node costs shown if model has cost calculation

Test 2: CLI Usage

poetry run python main.py --task examples.glaive_code_assistant --num_records=10

Expected Result:

  • Metadata file generated in tasks/examples/glaive_code_assistant/metadata/
  • Contains aggregate statistics, per-model metrics, per-node metrics
  • Latency percentiles for each model and node
  • Cost tracking at both model and node levels

Expected Result:

  • All latency statistics present (min, max, mean, median, std_dev, p50, p95, p99)
  • Node costs shown when available
  • Cost calculations accurate
  • Token counts match LLM responses

Screenshots (if applicable)

Metadata File Structure

{
  "metadata_version": "1.0.0",
  "generated_at": "2025-11-05T21:57:10.123456",
  
  "execution": {
    "task_name": "tasks.examples.glaive_code_assistant",
    "timing": {
      "start_time": "2025-11-05T21:57:07.899389",
      "end_time": "2025-11-05T21:57:10.657968",
      "duration_seconds": 2.759
    },
    "environment": {
      "python_version": "3.11.12",
      "sygra_version": "1.0.0"
    },
    "git": {
      "commit_hash": "139a535...",
      "branch": "scratch/metadata",
      "is_dirty": false
    }
  },
  
  "aggregate_statistics": {
    "tokens": {
      "total_prompt_tokens": 440,
      "total_completion_tokens": 920,
      "total_tokens": 1360
    },
    "cost": {
      "total_cost_usd": 0.00062,
      "average_cost_per_record": 0.000062
    },
    "requests": {
      "total_requests": 20,
      "total_failures": 0,
      "failure_rate": 0.0
    }
  },
  
  "models": {
    "gpt-4o-mini": {
      "model_type": "OpenAI",
      "performance": {
        "average_latency_seconds": 3.203,
        "tokens_per_second": 21.23,
        "latency_statistics": {
          "min": 2.105,
          "max": 4.821,
          "mean": 3.203,
          "median": 3.150,
          "std_dev": 0.652,
          "p50": 3.150,
          "p95": 4.512,
          "p99": 4.759
        }
      },
      "cost": {
        "total_cost_usd": 0.00062,
        "average_cost_per_request": 0.000031
      }
    }
  },
  
  "nodes": {
    "summarizer": {
      "node_name": "summarizer",
      "node_type": "llm",
      "model_name": "gpt-4o-mini",
      "total_executions": 10,
      "latency_statistics": {
        "min": 2.105,
        "max": 4.821,
        "mean": 3.203,
        "median": 3.150,
        "std_dev": 0.652,
        "p50": 3.150,
        "p95": 4.512,
        "p99": 4.759
      },
      "cost": {
        "total_cost_usd": 0.00031,
        "average_cost_per_execution": 0.000031
      },
      "token_statistics": {
        "total_prompt_tokens": 220,
        "total_completion_tokens": 460,
        "total_tokens": 680
      }
    }
  }
}

Checklist

  • Lint fixes and unit testing done
  • End to end task testing
  • Documentation updated

Breaking Changes

None. This is a purely additive feature with full backward compatibility. The feature works automatically with all existing code.

…ics and cost tracking

- Implemented MetadataCollector singleton for centralized metrics collection
- Added automatic tracking via @track_model_request decorator
- Integrated LangChain callback for agent/tool tracking
- Added comprehensive test suite (94 tests, 100% passing)

Features:
- Multi-level tracking: aggregate, per-model, and per-node metrics
- Token statistics: prompt, completion, total tokens with averages
- Performance metrics: latency (total, average, percentiles), throughput, failure rates
- Latency statistics: min, max, mean, median, std_dev, p50, p95, p99 using Python statistics module
- Cost tracking: per-model, per-node, and aggregate costs in USD
- Response code distribution tracking
- Git context capture (commit hash, branch, dirty status)
- Environment metadata (Python version, SyGra version)
- Thread-safe implementation with proper locking
- Toggle support via --disable_metadata flag
- Automatic JSON export with timestamp synchronization

Supported Models:
- OpenAI (GPT-4, GPT-4 Turbo, GPT-3.5 Turbo, GPT-4o, GPT-4o-mini)
- Azure OpenAI
- Anthropic Claude (via AWS Bedrock)
- vLLM (OpenAI-compatible endpoints)
- TGI (Text Generation Inference)

Documentation:
- Comprehensive feature documentation in docs/features/metadata_tracking.md
- Usage examples and API reference
- Architecture overview
- Token extraction implementation details

Tests:
- test_metadata_collector.py: Core collector functionality (32 tests)
- test_metadata_integration.py: Decorator integration (16 tests)
- test_metadata_end_to_end.py: End-to-end workflows (12 tests)
- test_langchain_callback.py: LangChain integration (6 tests)
- test_metadata_toggle.py: Enable/disable functionality (12 tests)
- test_metadata.py: Additional integration tests (16 tests)
@zephyrzilla zephyrzilla requested a review from a team as a code owner November 6, 2025 04:44
@zephyrzilla zephyrzilla self-assigned this Nov 6, 2025
@zephyrzilla zephyrzilla added the enhancement New feature or request label Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants