[Enhancement] Add metadata tracking system with latency percentiles and node costs #59

zephyrzilla · 2025-11-06T04:44:53Z

Summary

This PR introduces a comprehensive metadata tracking system for SyGra that automatically captures execution metrics, token usage, costs, and performance data across all LLM calls and workflow executions. The system provides detailed latency statistics (including percentiles), per-node cost tracking, and multi-level metrics aggregation, requiring zero changes to existing code.

Explain the features implemented:

1. Centralized Metadata Collection System

Thread-safe singleton MetadataCollector for tracking all execution metrics
Captures execution context (task name, environment, git info, timing)
Records aggregate, per-model, and per-node statistics
Automatic dataset metadata tracking (source, size, version, hash)

2. Latency Statistics

Percentile tracking: min, max, mean, median, std_dev, p50, p95, p99
Available at both model-level and node-level
Helps identify performance outliers and tail latencies

3. Per-Node Cost Tracking

Track costs at the node level, not just model level
Automatic cost calculation using model's calculate_cost() method
Shows total_cost_usd and average_cost_per_execution per node
Helps identify which parts of your workflow are most expensive
Only shown when costs are available (graceful degradation)

4. Cost Tracking with LangChain Community Integration (`langchain-community`)

Integrated official pricing for multiple models (OpenAI, Azure OpenAI, Anthropic Claude)
Automatic cost calculation based on token usage
Per-request, per-execution, per-record, and aggregate cost reporting
Zero-cost reporting for unsupported models (no stale estimates)

5. Automatic Tracking Infrastructure

@track_model_request decorator for custom model wrappers
MetadataTrackingCallback for LangChain agent LLM calls
Captures tokens, latency, response codes, and costs automatically
Works with both sync and async execution
Integrated into BaseNode for consistent tracking across all node types

6. Comprehensive Metrics Tracking

Token Statistics: Prompt, completion, total tokens with averages
Performance Metrics: Latency (total, average, percentiles), throughput (tokens/sec), retry/failure rates
Cost Analytics: Total costs, per-request costs, per-node costs, per-record costs
Response Tracking: HTTP status code distribution
Model Configuration: Captured for reproducibility

7. Timestamp Synchronization

Output and metadata files share identical timestamps
Format: output_2025-10-30_18-19-07.json -> metadata_..._2025-10-30_18-19-07.json
Easy correlation between outputs and metadata

8. Toggle Support

Enable/disable via --disable_metadata CLI flag
Programmatic control via collector.set_enabled(False)
Minimal overhead when disabled
Preserves enabled state across resets

9. Supported Models

OpenAI (GPT-4, GPT-4o, GPT-4o-mini)
Azure OpenAI (same models, different endpoints)
Anthropic Claude (via AWS Bedrock)
vLLM (OpenAI-compatible endpoints, token tracking only)
TGI (Text Generation Inference with details extraction)

How to Test the feature

Test 1: Library Usage with Latency Statistics

from sygra import Workflow, DataSource
from sygra.metadata.metadata_collector import get_metadata_collector

graph = Workflow("test_metadata")
graph.source(DataSource.memory([{"text": "Hello"}, {"text": "World"}]))
graph.add_llm_node("summarizer", "gpt-4o-mini") \
    .system_message("Summarize") \
    .user_message("{text}") \
    .output_keys("summary")
graph.add_edge("START", "summarizer")
graph.add_edge("summarizer", "END")

# Run with timestamp
results = graph.run(num_records=2, output_with_ts=True, output_dir="/tmp/test")

# Access metadata
collector = get_metadata_collector()
metadata = collector.get_metadata_summary()

# Check latency statistics
print(f"P95 latency: {metadata['models']['gpt-4o-mini']['performance']['latency_statistics']['p95']}s")
print(f"Std dev: {metadata['models']['gpt-4o-mini']['performance']['latency_statistics']['std_dev']}s")

Expected Result:

Output file: test/output_YYYY-MM-DD_HH-MM-SS.json
Metadata file: test/metadata/metadata_test_metadata_YYYY-MM-DD_HH-MM-SS.json
Timestamps match exactly
Metadata contains latency statistics with min, max, mean, median, std_dev, p50, p95, p99
Node costs shown if model has cost calculation

Test 2: CLI Usage

poetry run python main.py --task examples.glaive_code_assistant --num_records=10

Expected Result:

Metadata file generated in tasks/examples/glaive_code_assistant/metadata/
Contains aggregate statistics, per-model metrics, per-node metrics
Latency percentiles for each model and node
Cost tracking at both model and node levels

Expected Result:

All latency statistics present (min, max, mean, median, std_dev, p50, p95, p99)
Node costs shown when available
Cost calculations accurate
Token counts match LLM responses

Screenshots (if applicable)

Metadata File Structure

{
  "metadata_version": "1.0.0",
  "generated_at": "2025-11-05T21:57:10.123456",
  
  "execution": {
    "task_name": "tasks.examples.glaive_code_assistant",
    "timing": {
      "start_time": "2025-11-05T21:57:07.899389",
      "end_time": "2025-11-05T21:57:10.657968",
      "duration_seconds": 2.759
    },
    "environment": {
      "python_version": "3.11.12",
      "sygra_version": "1.0.0"
    },
    "git": {
      "commit_hash": "139a535...",
      "branch": "scratch/metadata",
      "is_dirty": false
    }
  },
  
  "aggregate_statistics": {
    "tokens": {
      "total_prompt_tokens": 440,
      "total_completion_tokens": 920,
      "total_tokens": 1360
    },
    "cost": {
      "total_cost_usd": 0.00062,
      "average_cost_per_record": 0.000062
    },
    "requests": {
      "total_requests": 20,
      "total_failures": 0,
      "failure_rate": 0.0
    }
  },
  
  "models": {
    "gpt-4o-mini": {
      "model_type": "OpenAI",
      "performance": {
        "average_latency_seconds": 3.203,
        "tokens_per_second": 21.23,
        "latency_statistics": {
          "min": 2.105,
          "max": 4.821,
          "mean": 3.203,
          "median": 3.150,
          "std_dev": 0.652,
          "p50": 3.150,
          "p95": 4.512,
          "p99": 4.759
        }
      },
      "cost": {
        "total_cost_usd": 0.00062,
        "average_cost_per_request": 0.000031
      }
    }
  },
  
  "nodes": {
    "summarizer": {
      "node_name": "summarizer",
      "node_type": "llm",
      "model_name": "gpt-4o-mini",
      "total_executions": 10,
      "latency_statistics": {
        "min": 2.105,
        "max": 4.821,
        "mean": 3.203,
        "median": 3.150,
        "std_dev": 0.652,
        "p50": 3.150,
        "p95": 4.512,
        "p99": 4.759
      },
      "cost": {
        "total_cost_usd": 0.00031,
        "average_cost_per_execution": 0.000031
      },
      "token_statistics": {
        "total_prompt_tokens": 220,
        "total_completion_tokens": 460,
        "total_tokens": 680
      }
    }
  }
}

Checklist

Lint fixes and unit testing done
End to end task testing
Documentation updated

Breaking Changes

None. This is a purely additive feature with full backward compatibility. The feature works automatically with all existing code.

…ics and cost tracking - Implemented MetadataCollector singleton for centralized metrics collection - Added automatic tracking via @track_model_request decorator - Integrated LangChain callback for agent/tool tracking - Added comprehensive test suite (94 tests, 100% passing) Features: - Multi-level tracking: aggregate, per-model, and per-node metrics - Token statistics: prompt, completion, total tokens with averages - Performance metrics: latency (total, average, percentiles), throughput, failure rates - Latency statistics: min, max, mean, median, std_dev, p50, p95, p99 using Python statistics module - Cost tracking: per-model, per-node, and aggregate costs in USD - Response code distribution tracking - Git context capture (commit hash, branch, dirty status) - Environment metadata (Python version, SyGra version) - Thread-safe implementation with proper locking - Toggle support via --disable_metadata flag - Automatic JSON export with timestamp synchronization Supported Models: - OpenAI (GPT-4, GPT-4 Turbo, GPT-3.5 Turbo, GPT-4o, GPT-4o-mini) - Azure OpenAI - Anthropic Claude (via AWS Bedrock) - vLLM (OpenAI-compatible endpoints) - TGI (Text Generation Inference) Documentation: - Comprehensive feature documentation in docs/features/metadata_tracking.md - Usage examples and API reference - Architecture overview - Token extraction implementation details Tests: - test_metadata_collector.py: Core collector functionality (32 tests) - test_metadata_integration.py: Decorator integration (16 tests) - test_metadata_end_to_end.py: End-to-end workflows (12 tests) - test_langchain_callback.py: LangChain integration (6 tests) - test_metadata_toggle.py: Enable/disable functionality (12 tests) - test_metadata.py: Additional integration tests (16 tests)

zephyrzilla requested a review from a team as a code owner November 6, 2025 04:44

Merge branch 'main' into scratch/metadata

b68329f

zephyrzilla self-assigned this Nov 6, 2025

zephyrzilla added the enhancement New feature or request label Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Add metadata tracking system with latency percentiles and node costs #59

[Enhancement] Add metadata tracking system with latency percentiles and node costs #59

Uh oh!

zephyrzilla commented Nov 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Enhancement] Add metadata tracking system with latency percentiles and node costs #59

Are you sure you want to change the base?

[Enhancement] Add metadata tracking system with latency percentiles and node costs #59

Uh oh!

Conversation

zephyrzilla commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Explain the features implemented:

1. Centralized Metadata Collection System

2. Latency Statistics

3. Per-Node Cost Tracking

4. Cost Tracking with LangChain Community Integration (langchain-community)

5. Automatic Tracking Infrastructure

6. Comprehensive Metrics Tracking

7. Timestamp Synchronization

8. Toggle Support

9. Supported Models

How to Test the feature

Test 1: Library Usage with Latency Statistics

Test 2: CLI Usage

Screenshots (if applicable)

Metadata File Structure

Checklist

Breaking Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zephyrzilla commented Nov 6, 2025 •

edited

Loading

4. Cost Tracking with LangChain Community Integration (`langchain-community`)