Skip to content

[LEADS-40]: Get statistics about the token usage for lightspeed-evaluation#112

Merged
asamal4 merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev
Dec 8, 2025
Merged

[LEADS-40]: Get statistics about the token usage for lightspeed-evaluation#112
asamal4 merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev

Conversation

@bsatapat-jpg
Copy link
Collaborator

@bsatapat-jpg bsatapat-jpg commented Dec 2, 2025

Description

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude-4.5-sonnet
  • Generated by: Cursor

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Token usage (input/output) captured for both API calls and judge LLMs and shown in CSV exports, JSON/TXT reports, and CLI summary.
    • Lightweight TokenTracker utility exposed for collecting judge LLM token counts.
  • Tests

    • Added unit tests covering token aggregation, TokenTracker behavior, statistics, and report output with token sections.
  • Chores

    • Public outputs and exports extended to include token usage fields.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

Walkthrough

Streaming parser, models, pipeline, output, runner, and tests updated to capture and propagate API token counts; a new TokenTracker captures judge-LLM token usage via litellm callbacks and those metrics are aggregated and included in reports and CSV constants.

Changes

Cohort / File(s) Summary
Streaming API & Models
src/lightspeed_evaluation/core/api/streaming_parser.py, src/lightspeed_evaluation/core/models/api.py
parse_streaming_response now extracts input_tokens/output_tokens from streaming end events and returns them; APIResponse gains input_tokens and output_tokens (ints, default 0).
Data Models
src/lightspeed_evaluation/core/models/data.py
TurnData adds api_input_tokens/api_output_tokens; EvaluationResult adds judge_llm_input_tokens, judge_llm_output_tokens, api_input_tokens, api_output_tokens (ints, ge=0, default 0).
Constants
src/lightspeed_evaluation/core/constants.py
Append judge_llm_input_tokens, judge_llm_output_tokens, api_input_tokens, api_output_tokens to SUPPORTED_CSV_COLUMNS.
LLM Token Tracking
src/lightspeed_evaluation/core/llm/custom.py, src/lightspeed_evaluation/core/llm/__init__.py
Add TokenTracker that registers/unregisters a litellm success callback to accumulate judge input/output tokens; export TokenTracker via package init.
Pipeline — Evaluator & Amender
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py, src/lightspeed_evaluation/pipeline/evaluation/amender.py
Evaluator starts/stops TokenTracker around evaluation and attaches judge token counts to EvaluationResult; Amender records API token counts from APIResponse onto TurnData.
Statistics & Output
src/lightspeed_evaluation/core/output/statistics.py, src/lightspeed_evaluation/core/output/generator.py
Add calculate_api_token_usage(evaluation_data) to aggregate API tokens; extend calculate_basic_stats with judge token totals; pass api_tokens into JSON/TXT summary generation and include token usage sections.
Runner Reporting
src/lightspeed_evaluation/runner/evaluation.py
OutputHandler.generate_reports now accepts evaluation_data; runner prints a "Token Usage Summary" with judge LLM and (when API enabled) API token totals and combined total.
Tests
tests/unit/core/output/test_statistics.py, tests/unit/runner/test_evaluation.py, tests/unit/pipeline/evaluation/test_evaluator.py, tests/unit/core/output/test_generator.py
Add tests for calculate_api_token_usage; extend calculate_basic_stats expectations; add TokenTracker lifecycle/counts/reset tests; update generator and runner tests to include token fields and api_tokens.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant StreamingParser
    participant Amender
    participant Evaluator
    participant TokenTracker
    participant Statistics
    participant Runner

    Client->>StreamingParser: stream events (chunks, end event with token counts)
    StreamingParser-->>Amender: APIResponse (includes api input/output tokens)
    Amender-->>Evaluator: turn_data (with api_input_tokens/api_output_tokens)
    Evaluator->>TokenTracker: start()
    Evaluator->>Evaluator: run judge-LLM evaluation (invokes litellm)
    TokenTracker--)TokenTracker: callback accumulates judge input/output tokens
    Evaluator->>TokenTracker: stop()
    TokenTracker-->>Evaluator: get_counts() → judge token totals
    Evaluator-->>Statistics: emit EvaluationResult (includes judge + API tokens)
    Statistics-->>Runner: aggregated stats (including token totals)
    Runner->>Runner: print Token Usage Summary (judge, API, combined)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Inspect TokenTracker registration/unregistration for idempotence and any concurrency/async edge cases.
  • Verify streaming end-event token extraction and defaulting in streaming_parser.py.
  • Confirm token propagation path: APIResponseAmenderTurnDataEvaluationResult.
  • Validate aggregation and zero/default handling in calculate_api_token_usage and calculate_basic_stats.
  • Review report generation and runner printing for correct wiring and conditional API-enabled behavior.

Possibly related PRs

Suggested reviewers

  • VladimirKadlec
  • tisnik

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding token usage statistics tracking to the lightspeed-evaluation project, which aligns with the comprehensive changes across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/core/output/statistics.py (1)

1-230: Fix trailing whitespace on lines 57, 220, 225.

The pipeline has flagged trailing whitespace on multiple lines. Additionally, Black formatting needs to be applied.

Run Black to fix:

#!/bin/bash
black src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/api/streaming_parser.py (1)

1-134: Fix formatting issues flagged by pipeline.

The pipeline has identified trailing whitespace on line 72 and requires Black formatting.

Run Black to fix:

#!/bin/bash
black src/lightspeed_evaluation/core/api/streaming_parser.py
🧹 Nitpick comments (3)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

174-175: Update error path cleanup to match normal path.

If you switch to append/remove pattern for callbacks, apply the same cleanup logic here.

             # Clear callback on error
-            litellm.success_callback = []
+            if token_tracking_callback in litellm.success_callback:
+                litellm.success_callback.remove(token_tracking_callback)

180-195: Consider adding token fields to error results for consistency.

The _create_error_result method doesn't include judge_llm_input_tokens and judge_llm_output_tokens fields. While token counts on error are typically zero, including them ensures the EvaluationResult objects are structurally consistent.

     def _create_error_result(
         self, request: EvaluationRequest, reason: str, execution_time: float
     ) -> EvaluationResult:
         """Create an ERROR result for failed evaluation."""
         return EvaluationResult(
             conversation_group_id=request.conv_data.conversation_group_id,
             turn_id=request.turn_id,
             metric_identifier=request.metric_identifier,
             result="ERROR",
             score=None,
             threshold=None,
             reason=reason,
             query=request.turn_data.query if request.turn_data else "",
             response=request.turn_data.response or "" if request.turn_data else "",
             execution_time=execution_time,
+            judge_llm_input_tokens=0,
+            judge_llm_output_tokens=0,
         )
src/lightspeed_evaluation/core/api/streaming_parser.py (1)

14-85: Address complexity warning - too many branches.

The pipeline flagged this function with 17 branches exceeding the limit of 12. Consider refactoring to extract event handling into separate methods.

Refactor event handling into dedicated methods:

def parse_streaming_response(response: httpx.Response) -> dict[str, Any]:
    """Parse streaming response and extract data."""
    conversation_id = ""
    final_response = ""
    tool_calls: list[dict[str, Any]] = []
    token_count = 0
    input_tokens = 0
    output_tokens = 0

    for line in response.iter_lines():
        line = line.strip()
        if not line or not line.startswith(DATA_PREFIX):
            continue

        json_data = line.replace(DATA_PREFIX, "")
        parsed_data = _parse_sse_line(json_data)
        if not parsed_data:
            continue

        event, event_data = parsed_data
        
        # Delegate to event handlers
        result = _handle_event(
            event, event_data, conversation_id, final_response, 
            tool_calls, token_count, input_tokens, output_tokens
        )
        conversation_id = result["conversation_id"]
        final_response = result["final_response"]
        tool_calls = result["tool_calls"]
        token_count = result["token_count"]
        input_tokens = result["input_tokens"]
        output_tokens = result["output_tokens"]

    # ... validation and return logic

Then create separate handler methods for each event type.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7665def and 7e131b2.

📒 Files selected for processing (12)
  • src/lightspeed_evaluation/core/api/client.py (2 hunks)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (4 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/token_counter.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/token_tracker.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/core/metrics/token_counter.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/metrics/token_tracker.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/metrics/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Update MetricManager supported_metrics dictionary when registering new metrics

Files:

  • src/lightspeed_evaluation/core/metrics/token_counter.py
  • src/lightspeed_evaluation/core/metrics/token_tracker.py
🧠 Learnings (3)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/runner/evaluation.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

  • src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (4)
src/lightspeed_evaluation/core/metrics/token_counter.py (1)
src/lightspeed_evaluation/core/metrics/token_tracker.py (1)
  • reset (38-43)
src/lightspeed_evaluation/core/metrics/token_tracker.py (1)
src/lightspeed_evaluation/core/metrics/token_counter.py (1)
  • reset (26-29)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/metrics/token_counter.py (3)
  • TokenCounter (9-65)
  • add_tokens (17-20)
  • get_counts (22-24)
src/lightspeed_evaluation/runner/evaluation.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
  • calculate_api_token_usage (216-230)
  • calculate_basic_stats (37-73)
🪛 GitHub Actions: Black
src/lightspeed_evaluation/core/api/client.py

[error] 1-1: Black would reformat this file. Run 'black' to format the code.

src/lightspeed_evaluation/core/metrics/token_counter.py

[error] 1-1: Black would reformat this file. Run 'black' to format the code.

src/lightspeed_evaluation/core/models/data.py

[error] 1-1: Black would reformat this file. Run 'black' to format the code.

src/lightspeed_evaluation/core/metrics/token_tracker.py

[error] 1-1: Black would reformat this file. Run 'black' to format the code.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

[error] 1-1: Black would reformat this file. Run 'black' to format the code.

src/lightspeed_evaluation/runner/evaluation.py

[error] 1-1: Black would reformat this file. Run 'black' to format the code.

src/lightspeed_evaluation/core/api/streaming_parser.py

[error] 1-1: Black would reformat this file. Run 'black' to format the code.

src/lightspeed_evaluation/core/output/generator.py

[error] 1-1: Black would reformat this file. Run 'black' to format the code.

src/lightspeed_evaluation/core/output/statistics.py

[error] 1-1: Black would reformat this file. Run 'black' to format the code.

🪛 GitHub Actions: Pydocstyle
src/lightspeed_evaluation/core/metrics/token_tracker.py

[error] 52-52: D403: First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM'). Command: make docstyle.


[error] 89-89: D403: First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM'). Command: make docstyle.

🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/api/client.py

[error] 159-159: Trailing whitespace (trailing-whitespace)


[error] 166-166: Trailing whitespace (trailing-whitespace)


[error] 198-198: Trailing whitespace (trailing-whitespace)


[error] 203-203: Line too long (102/100) (line-too-long)


[error] 204-204: Trailing whitespace (trailing-whitespace)

src/lightspeed_evaluation/core/metrics/token_counter.py

[error] 48-48: Trailing whitespace (trailing-whitespace)


[error] 64-64: Trailing whitespace (trailing-whitespace)


[error] 88-88: Trailing newlines (trailing-newlines)


[error] 63-63: W1203: Use lazy % formatting in logging functions (logging-fstring-interpolation)


[error] 79-79: W0603: Using the global statement (global-statement)


[error] 85-85: W0602: Using global for '_current_counter' but no assignment is done (global-variable-not-assigned)

src/lightspeed_evaluation/core/models/data.py

[error] 72-72: Trailing whitespace (trailing-whitespace)

src/lightspeed_evaluation/core/metrics/token_tracker.py

[error] 32-32: Trailing whitespace (trailing-whitespace)


[error] 37-37: Trailing whitespace (trailing-whitespace)


[error] 44-44: Trailing whitespace (trailing-whitespace)


[error] 46-46: Trailing whitespace (trailing-whitespace)


[error] 68-68: Trailing whitespace (trailing-whitespace)


[error] 73-73: Trailing whitespace (trailing-whitespace)


[error] 81-81: Trailing whitespace (trailing-whitespace)


[error] 102-102: Trailing whitespace (trailing-whitespace)


[error] 114-114: Trailing whitespace (trailing-whitespace)


[error] 114-114: W0613: Unused argument 'kwargs' (unused-argument)


[error] 114-114: W0613: Unused argument 'start_time' (unused-argument)


[error] 114-114: W0613: Unused argument 'end_time' (unused-argument)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

[error] 71-71: Trailing whitespace (trailing-whitespace)


[error] 138-138: Trailing whitespace (trailing-whitespace)


[error] 68-68: R0902: Too many local variables (18/15) (too-many-locals)


[error] 123-123: W0718: Catching too general exception Exception (broad-exception-caught)


[error] 124-124: W1203: Use lazy % formatting in logging functions (logging-fstring-interpolation)


[error] 114-114: W0613: Unused argument 'kwargs' (unused-argument)


[error] 114-114: W0613: Unused argument 'start_time' (unused-argument)


[error] 114-114: W0613: Unused argument 'end_time' (unused-argument)

src/lightspeed_evaluation/runner/evaluation.py

[error] 97-97: Trailing whitespace (trailing-whitespace)


[error] 105-105: Trailing whitespace (trailing-whitespace)

src/lightspeed_evaluation/core/api/streaming_parser.py

[error] 72-72: Trailing whitespace (trailing-whitespace)


[error] 14-14: Too many branches (17/12) (too-many-branches)

src/lightspeed_evaluation/core/output/generator.py

[error] 243-243: Trailing whitespace (trailing-whitespace)

src/lightspeed_evaluation/core/output/statistics.py

[error] 57-57: Trailing whitespace (trailing-whitespace)


[error] 220-220: Trailing whitespace (trailing-whitespace)


[error] 225-225: Trailing whitespace (trailing-whitespace)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: mypy
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.13)
🔇 Additional comments (11)
src/lightspeed_evaluation/core/constants.py (1)

35-36: LGTM!

The new CSV columns for judge LLM token tracking are consistent with the naming convention used elsewhere and align with the new EvaluationResult fields introduced in this PR.

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

62-70: LGTM!

Token usage extraction and logging are well integrated into the amendment flow. Note that getattr with fallback is slightly redundant since APIResponse now defines these fields with default=0, but the defensive approach is harmless and adds resilience if the model changes.

src/lightspeed_evaluation/core/models/api.py (2)

93-94: LGTM!

The new token fields are well-defined with appropriate validation (ge=0) and sensible defaults. The field descriptions are clear.


113-123: LGTM!

Token extraction follows the established pattern for parsing raw API response data, with safe defaults when fields are absent.

src/lightspeed_evaluation/core/api/client.py (1)

198-205: Fix formatting issues and consider adding output token estimation for consistency.

The pipeline reports trailing whitespace on lines 198 and 204, and line 203 exceeds 100 characters. The streaming path estimates only input tokens, while the standard path estimates both. To resolve this:

  1. Remove trailing whitespace on lines 198 and 204
  2. Reformat the logger.debug call on line 203 to stay within the 100-character limit
  3. Add output token estimation for the streaming path if parse_streaming_response does not already provide it
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

112-127: Thread safety concern with global callback mutation requires verification.

Mutating litellm.success_callback globally is not thread-safe if multiple evaluate_metric calls execute concurrently. They will overwrite each other's callbacks, leading to incorrect token counts or missed tracking.

Additionally:

  • Prefix unused parameters (kwargs, start_time, end_time) with _ to suppress linter warnings
  • Use lazy % formatting instead of f-string on line 124: logger.debug("Token tracking callback error: %s", e)
  • Consider using append() to add the callback instead of assignment to avoid overwriting existing callbacks

For proper thread safety, verify whether evaluations can run concurrently in this codebase (e.g., via ThreadPoolExecutor). If so, use thread-local or context variables to manage per-evaluation callbacks, or ensure evaluations are serialized.

src/lightspeed_evaluation/core/models/data.py (1)

74-79: LGTM! Token tracking fields properly added.

The token tracking fields are well-defined with appropriate validation constraints (ge=0) and descriptive field descriptions. The additions to both TurnData and EvaluationResult align with the PR's objective to track token usage across the evaluation pipeline.

Also applies to: 410-415

src/lightspeed_evaluation/core/output/statistics.py (2)

48-72: LGTM! Token statistics properly integrated.

The token tracking fields are correctly integrated into the basic statistics calculation. The implementation:

  • Properly initializes token fields in the empty results case
  • Accumulates token counts from all results
  • Returns comprehensive token statistics (input, output, and total)

216-230: LGTM! API token usage calculation is clean and correct.

The calculate_api_token_usage function properly aggregates API token usage from all turns across all conversations. The implementation is straightforward and returns a well-structured dictionary with total counts.

src/lightspeed_evaluation/core/api/streaming_parser.py (2)

19-21: LGTM! Token tracking properly integrated into streaming parser.

The token tracking implementation is well-structured:

  • Initializes tracking variables appropriately
  • Counts tokens during streaming events
  • Extracts authoritative counts from end events when available
  • Extends the return payload with token information

The multi-tier approach (explicit counts → streamed count → estimation) provides good fallback coverage.

Also applies to: 43-46, 55-60, 83-84


69-77: Verify token estimation approach aligns with other estimation logic.

The fallback token estimation (lines 74-77) estimates tokens using a 4 characters per token heuristic. Confirm this matches the approach in TokenCounter.estimate_tokens() in token_counter.py. If the heuristics differ, extract a shared estimation utility function to maintain consistency.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)

14-15: Update docstring to document return structure and new token fields.

The function now returns input_tokens and output_tokens in addition to existing fields, but the docstring doesn't reflect this. Per coding guidelines, use Google-style docstrings for all public APIs.

Apply this diff to improve the docstring:

 def parse_streaming_response(response: httpx.Response) -> dict[str, Any]:
-    """Parse streaming response and extract data."""
+    """Parse streaming response and extract data.
+
+    Args:
+        response: The httpx streaming response to parse.
+
+    Returns:
+        A dictionary containing:
+            - response (str): The final response text.
+            - tool_calls (list): List of tool call sequences.
+            - conversation_id (str): The conversation identifier.
+            - input_tokens (int): Number of input tokens consumed.
+            - output_tokens (int): Number of output tokens generated.
+
+    Raises:
+        ValueError: If no final response or conversation ID is found.
+    """

Based on coding guidelines.

♻️ Duplicate comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

143-144: Callback registration overwrites existing callbacks.

Setting litellm.success_callback = [token_tracking_callback] replaces any existing callbacks that may have been registered elsewhere. This could cause issues if other parts of the system rely on their callbacks.

Consider appending and removing specifically:

             # Register callback
-            litellm.success_callback = [token_tracking_callback]
+            litellm.success_callback.append(token_tracking_callback)

             # Evaluate metric
             score, reason = self.handlers[framework].evaluate(  # type: ignore
                 metric_name, request.conv_data, evaluation_scope
             )

             # Clear callback
-            litellm.success_callback = []
+            if token_tracking_callback in litellm.success_callback:
+                litellm.success_callback.remove(token_tracking_callback)
🧹 Nitpick comments (5)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)

69-77: Document the token estimation approximation.

The 4 characters per token estimate at lines 74-77 is a reasonable fallback, but the accuracy varies by content type (English text, code, non-English text) and tokenizer. Since these statistics may be used for cost estimation or monitoring, consider:

  1. Adding a comment explaining when this fallback is triggered and its limitations
  2. Including the estimation method in token usage reports so users know when estimates are used versus actual counts

Apply this diff to clarify the estimation:

     # If still no output tokens, estimate from response length
     if not output_tokens and final_response:
-        # Rough estimation: ~4 characters per token for English text
+        # Fallback estimation when API doesn't provide token counts.
+        # Assumes ~4 characters per token for English text; accuracy varies by content type.
         output_tokens = max(1, len(final_response) // 4)
-        logger.debug("Estimated output tokens from response length: %d", output_tokens)
+        logger.warning(
+            "Output tokens estimated from response length: %d (actual count unavailable)",
+            output_tokens
+        )

Consider changing the log level to warning since estimates are less reliable than actual counts.


14-85: Consider refactoring to reduce cyclomatic complexity.

The linter flags "Too many branches" (R0912) for this function. While the current implementation is readable, you could reduce complexity by extracting event handlers:

Example refactoring approach:

def parse_streaming_response(response: httpx.Response) -> dict[str, Any]:
    """Parse streaming response and extract data."""
    parser_state = StreamingParserState()
    
    for line in response.iter_lines():
        line = line.strip()
        if not line or not line.startswith(DATA_PREFIX):
            continue
            
        json_data = line.replace(DATA_PREFIX, "")
        parsed_data = _parse_sse_line(json_data)
        
        if parsed_data:
            event, event_data = parsed_data
            _handle_event(event, event_data, parser_state)
    
    return parser_state.to_dict()

def _handle_event(event: str, event_data: dict, state: StreamingParserState) -> None:
    """Dispatch to appropriate event handler."""
    handlers = {
        "error": _handle_error_event,
        "start": _handle_start_event,
        "token": _handle_token_event,
        "turn_complete": _handle_turn_complete_event,
        "tool_call": _handle_tool_call_event,
        "end": _handle_end_event,
    }
    handler = handlers.get(event)
    if handler:
        handler(event_data, state)

This is optional and can be deferred if the current implementation meets your needs.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)

130-141: Address pylint warnings for unused callback parameters.

The callback signature is required by LiteLLM but the parameters are unused. Prefix them with underscore to suppress warnings:

             # Set up token tracking callback for LiteLLM
             def token_tracking_callback(
-                kwargs, completion_response, start_time, end_time
+                _kwargs, completion_response, _start_time, _end_time
             ):

Additionally, the bare except Exception is flagged. Consider catching a more specific exception or at minimum documenting why the broad catch is necessary:

                         token_counter.add_tokens(input_tokens, output_tokens)
-                except Exception as e:
+                except (AttributeError, TypeError) as e:
                     logger.debug("Token tracking callback error: %s", e)

197-212: Consider tracking partial token counts in error results.

When an evaluation fails partway through, any tokens consumed before the error are not captured in the error result. The _create_error_result method doesn't accept token counts.

If tracking partial token usage is important for cost analysis, consider passing the token counter:

     def _create_error_result(
-        self, request: EvaluationRequest, reason: str, execution_time: float
+        self, request: EvaluationRequest, reason: str, execution_time: float,
+        judge_input_tokens: int = 0, judge_output_tokens: int = 0
     ) -> EvaluationResult:
         """Create an ERROR result for failed evaluation."""
         return EvaluationResult(
             conversation_group_id=request.conv_data.conversation_group_id,
             turn_id=request.turn_id,
             metric_identifier=request.metric_identifier,
             result="ERROR",
             score=None,
             threshold=None,
             reason=reason,
             query=request.turn_data.query if request.turn_data else "",
             response=request.turn_data.response or "" if request.turn_data else "",
             execution_time=execution_time,
+            judge_llm_input_tokens=judge_input_tokens,
+            judge_llm_output_tokens=judge_output_tokens,
         )

Then update the exception handler to pass partial counts:

judge_input, judge_output = token_counter.get_counts()
return self._create_error_result(
    request, f"Evaluation error: {e}", execution_time,
    judge_input, judge_output
)

85-85: Address "too many local variables" pylint warning.

The evaluate_metric method has grown complex with token tracking logic. Consider extracting the callback setup and cleanup into a context manager to reduce local variable count:

from contextlib import contextmanager

@contextmanager
def _litellm_token_tracking(self, token_counter: TokenCounter):
    """Context manager for LiteLLM token tracking callback."""
    def callback(_kwargs, completion_response, _start_time, _end_time):
        try:
            if hasattr(completion_response, "usage"):
                usage = completion_response.usage
                input_tokens = getattr(usage, "prompt_tokens", 0)
                output_tokens = getattr(usage, "completion_tokens", 0)
                token_counter.add_tokens(input_tokens, output_tokens)
        except (AttributeError, TypeError) as e:
            logger.debug("Token tracking callback error: %s", e)
    
    litellm.success_callback.append(callback)
    try:
        yield
    finally:
        if callback in litellm.success_callback:
            litellm.success_callback.remove(callback)

This encapsulates the callback lifecycle and reduces complexity in the main method.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e131b2 and 1e767a0.

📒 Files selected for processing (10)
  • src/lightspeed_evaluation/core/api/client.py (2 hunks)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (4 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/core/constants.py
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/output/generator.py
🧠 Learnings (2)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

  • src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • evaluate (97-138)
src/lightspeed_evaluation/core/metrics/geval.py (1)
  • evaluate (45-116)
src/lightspeed_evaluation/core/metrics/ragas.py (1)
  • evaluate (100-140)
src/lightspeed_evaluation/core/metrics/script.py (1)
  • evaluate (23-38)
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/api/streaming_parser.py

[error] 14-14: pylint: Too many branches (R0912).

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

[error] 85-85: pylint: Too many local variables (R0914).


[warning] 140-140: pylint: Catching too general exception (broad-exception-caught) (W0718).


[warning] 131-131: pylint: Unused argument 'kwargs' (no-unused-vars) (W0613).


[warning] 131-131: pylint: Unused argument 'start_time' (no-unused-vars) (W0613).


[warning] 131-131: pylint: Unused argument 'end_time' (no-unused-vars) (W0613).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (7)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)

79-85: Token fields correctly added to return payload.

The addition of input_tokens and output_tokens to the return dictionary aligns with the PR objective to track token usage throughout the evaluation pipeline. The fallback logic ensures token counts are always available, even when the API doesn't provide explicit counts.


43-46: Verify that counting "inference" role tokens is accurate.

The logic assumes that each "token" event with role="inference" represents exactly one output token. Please verify:

  1. Is "inference" the only role that should be counted for output tokens?
  2. Does each event truly represent a single token, or could events represent multiple tokens?
  3. Are there other roles (e.g., "assistant") that should also be counted?
src/lightspeed_evaluation/core/output/generator.py (1)

244-253: LGTM! Token usage section is well-integrated.

The formatting with :, for thousands separators provides good readability. The section placement after overall statistics is appropriate and maintains consistency with the report structure.

src/lightspeed_evaluation/core/output/statistics.py (1)

58-61: LGTM! Token aggregation is straightforward and correct.

The summation approach correctly handles the default value of 0 for missing token counts.

src/lightspeed_evaluation/core/models/data.py (2)

73-79: LGTM! API token tracking fields are well-defined.

The field definitions with ge=0 constraints and sensible defaults properly handle cases where token tracking is not available.


410-415: LGTM! Judge LLM token fields are properly structured.

The fields align with the TokenCounter implementation in the evaluator and integrate cleanly with the existing model.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

27-43: LGTM! Simple and effective token counter implementation.

The class is appropriately scoped for per-evaluation token tracking. The design is clean and thread-safe since each evaluation creates its own instance.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)

216-230: Improve docstring to meet Google-style format requirements.

The docstring for this public API function still lacks Args and Returns sections as required by the coding guidelines.

 def calculate_api_token_usage(evaluation_data: list[EvaluationData]) -> dict[str, Any]:
-    """Calculate total API token usage from evaluation data."""
+    """Calculate total API token usage from evaluation data.
+
+    Args:
+        evaluation_data: List of evaluation data containing turn-level API token counts.
+
+    Returns:
+        Dictionary containing total_api_input_tokens, total_api_output_tokens,
+        and total_api_tokens.
+    """
🧹 Nitpick comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

27-43: Add Google-style docstrings to the public class and methods.

Per coding guidelines, public APIs should have Google-style docstrings. The TokenCounter class and its methods lack complete documentation.

 class TokenCounter:
-    """Simple token counter for metric evaluation."""
+    """Simple token counter for metric evaluation.
+
+    Accumulates input and output token counts across multiple LLM calls
+    during a single metric evaluation.
+    """

     def __init__(self) -> None:
-        """Initialize token counter."""
+        """Initialize token counter with zero counts."""
         self.input_tokens = 0
         self.output_tokens = 0

     def add_tokens(self, input_tokens: int, output_tokens: int) -> None:
-        """Add token counts."""
+        """Add token counts to the running totals.
+
+        Args:
+            input_tokens: Number of input/prompt tokens to add.
+            output_tokens: Number of output/completion tokens to add.
+        """
         self.input_tokens += input_tokens
         self.output_tokens += output_tokens

     def get_counts(self) -> tuple[int, int]:
-        """Get current token counts."""
+        """Get current token counts.
+
+        Returns:
+            Tuple of (input_tokens, output_tokens).
+        """
         return self.input_tokens, self.output_tokens

199-214: Consider tracking tokens even for error results.

The _create_error_result method doesn't include token counts, but LLM calls may have occurred before the error. For accurate usage tracking, consider passing the token_counter to this method.

     def _create_error_result(
-        self, request: EvaluationRequest, reason: str, execution_time: float
+        self, request: EvaluationRequest, reason: str, execution_time: float,
+        token_counter: Optional[TokenCounter] = None
     ) -> EvaluationResult:
         """Create an ERROR result for failed evaluation."""
+        judge_input, judge_output = token_counter.get_counts() if token_counter else (0, 0)
         return EvaluationResult(
             conversation_group_id=request.conv_data.conversation_group_id,
             turn_id=request.turn_id,
             metric_identifier=request.metric_identifier,
             result="ERROR",
             score=None,
             threshold=None,
             reason=reason,
             query=request.turn_data.query if request.turn_data else "",
             response=request.turn_data.response or "" if request.turn_data else "",
             execution_time=execution_time,
+            judge_llm_input_tokens=judge_input,
+            judge_llm_output_tokens=judge_output,
         )

Then update the call site at line 195-197:

             return self._create_error_result(
-                request, f"Evaluation error: {e}", execution_time
+                request, f"Evaluation error: {e}", execution_time, token_counter
             )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e767a0 and 292949a.

📒 Files selected for processing (10)
  • src/lightspeed_evaluation/core/api/client.py (2 hunks)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (4 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/constants.py
🧠 Learnings (6)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

  • src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (317-375)
  • EvaluationResult (378-423)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (440-493)
  • EvaluationResult (378-423)
src/lightspeed_evaluation/core/metrics/geval.py (1)
  • evaluate (45-116)
src/lightspeed_evaluation/core/metrics/script.py (1)
  • evaluate (23-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: tests (3.12)
  • GitHub Check: mypy
🔇 Additional comments (8)
src/lightspeed_evaluation/core/constants.py (1)

35-36: LGTM!

The new CSV column names align with the judge_llm_input_tokens and judge_llm_output_tokens fields in EvaluationResult, maintaining consistency across the codebase.

src/lightspeed_evaluation/core/output/statistics.py (1)

48-50: LGTM!

Token tracking integration in calculate_basic_stats is well-implemented. The initialization in the empty-results case and the accumulation logic are correct, with proper field access to judge_llm_input_tokens and judge_llm_output_tokens.

Also applies to: 58-61, 70-72

src/lightspeed_evaluation/core/api/client.py (2)

160-176: LGTM!

The token estimation fallback logic is reasonable. Using len(text) // 4 with a minimum of 1 is a sensible heuristic for English text when the API doesn't provide token counts. The debug logging helps with troubleshooting.


202-211: LGTM!

The streaming path correctly only estimates input tokens since output tokens are tracked/estimated in parse_streaming_response via token events. This avoids duplicate estimation logic.

src/lightspeed_evaluation/core/api/streaming_parser.py (3)

21-23: LGTM!

The token counting initialization and incremental tracking of "inference" tokens during streaming is well-implemented. This provides accurate output token counts when the API doesn't include them in the "end" event.

Also applies to: 45-48


57-62: LGTM!

The three-tier fallback for output tokens (explicit from API → counted from token events → estimated from response length) is a robust approach. The estimation is appropriately logged at debug level.

Also applies to: 71-79


81-87: LGTM!

The updated return structure correctly includes input_tokens and output_tokens, aligning with the APIResponse model and maintaining consistency with the standard query path.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

145-146: Consider using append/remove pattern for callback registration to preserve existing callbacks.

The current approach of assigning litellm.success_callback = [token_tracking_callback] overwrites any existing callbacks. While this appears isolated to the evaluation context, using append/remove would be safer if LiteLLM is used elsewhere with other callbacks:

             # Register callback
-            litellm.success_callback = [token_tracking_callback]
+            litellm.success_callback.append(token_tracking_callback)

             # Evaluate metric
             score, reason = self.handlers[framework].evaluate(  # type: ignore
                 metric_name, request.conv_data, evaluation_scope
             )

             # Clear callback
-            litellm.success_callback = []
+            if token_tracking_callback in litellm.success_callback:
+                litellm.success_callback.remove(token_tracking_callback)

Also applies to line 194 (error path).

@bsatapat-jpg bsatapat-jpg force-pushed the dev branch 3 times, most recently from b866ff6 to ecda30f Compare December 2, 2025 18:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)

71-79: Consider logging when using streamed token count fallback.

The fallback logic is sound, but only the estimation fallback logs a debug message. Consider adding a debug log when using token_count as fallback for consistency in tracking which token source was used.

     # Use streamed token count if explicit counts not provided
     if not output_tokens and token_count > 0:
         output_tokens = token_count
+        logger.debug("Using streamed token count as output tokens: %d", output_tokens)

     # If still no output tokens, estimate from response length
tests/unit/core/output/test_statistics.py (1)

1-11: Consider adding test coverage for non-zero token counts.

The existing tests verify the token fields exist with default values, but there's no test verifying actual token aggregation when judge_llm_input_tokens or judge_llm_output_tokens are non-zero on the EvaluationResult objects.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 292949a and ecda30f.

📒 Files selected for processing (12)
  • src/lightspeed_evaluation/core/api/client.py (2 hunks)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (4 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/runner/evaluation.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/models/data.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/core/output/test_statistics.py
  • tests/unit/runner/test_evaluation.py
🧠 Learnings (7)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

  • src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (440-493)
  • EvaluationResult (378-423)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (317-375)
  • EvaluationResult (378-423)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: mypy
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.11)
🔇 Additional comments (21)
src/lightspeed_evaluation/core/api/streaming_parser.py (4)

21-23: LGTM! Token tracking initialization is appropriate.

The initialization of token_count, input_tokens, and output_tokens to 0 provides a clean baseline for accumulating token usage throughout the streaming response parsing.


45-48: Token counting logic for inference events looks correct.

Counting each token event where role == "inference" is a reasonable approach for tracking streamed output tokens.


57-62: LGTM! End event token extraction is properly handled.

Extracting input_tokens and output_tokens from the end event when available ensures accurate counts from the API when provided.


85-86: LGTM! Return payload correctly includes token fields.

The enriched return dictionary now includes input_tokens and output_tokens, enabling downstream token tracking.

src/lightspeed_evaluation/core/models/data.py (2)

73-79: LGTM! API token tracking fields properly defined on TurnData.

The api_input_tokens and api_output_tokens fields are well-defined with appropriate defaults, constraints (ge=0), and descriptive docstrings.


410-415: LGTM! Judge LLM token fields properly defined on EvaluationResult.

The judge_llm_input_tokens and judge_llm_output_tokens fields follow the same pattern as the API token fields with proper validation constraints.

tests/unit/runner/test_evaluation.py (3)

57-65: LGTM! Mock return values correctly updated to include token tracking fields.

The mock for calculate_basic_stats now includes total_judge_llm_input_tokens, total_judge_llm_output_tokens, and total_judge_llm_tokens, aligning with the updated function signature.


112-120: Mock values correctly reflect zero-token scenario.

Appropriate for testing the custom output directory case where no actual evaluations are run.


210-218: LGTM! Token totals correctly mocked for error scenario testing.

The proportional token values (500/250/750) appropriately reflect the mixed results scenario being tested.

src/lightspeed_evaluation/core/api/client.py (2)

202-210: LGTM! Streaming input token estimation follows same pattern.

The estimation logic for streaming queries mirrors the standard query path, maintaining consistency. The streaming parser handles output token estimation via token_count or response length fallback.


160-175: Consider distinguishing between missing tokens and explicit zero values.

The condition not response_data.get("input_tokens") treats an explicit 0 the same as a missing key. If the API returns 0 intentionally (e.g., for cached responses with no new input processing), this would incorrectly trigger estimation. Consider using response_data.get("input_tokens") is None to distinguish between missing tokens and legitimate zero values.

tests/unit/core/output/test_statistics.py (2)

289-291: LGTM! Expected stats correctly include token tracking fields.

The expected dictionary now includes the three new token fields, all set to 0 as the test EvaluationResult objects use default values.


319-321: LGTM! Single result test correctly expects zero token counts.

Consistent with the mixed results test above.

src/lightspeed_evaluation/core/output/statistics.py (3)

48-50: LGTM! Empty results case correctly returns zero token counts.

Consistent with the non-empty case structure.


58-72: LGTM! Token aggregation is straightforward and correct.

Summing judge_llm_input_tokens and judge_llm_output_tokens across all results, then computing the total, is the expected behavior.


216-237: LGTM! API token usage calculation function properly implemented.

The function correctly iterates through evaluation data and turns to aggregate API token usage. The Google-style docstring has been added as requested in the previous review.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5)

27-42: LGTM! TokenCounter class is clean and straightforward.

Simple accumulator with appropriate type hints on methods. Consider using @dataclass for even more concise implementation, though the current approach is perfectly acceptable.


156-162: Callback cleanup and token retrieval are correctly placed.

The callback is cleared after evaluation, and token counts are retrieved before result construction. See previous comment regarding the cleanup approach.


189-190: LGTM! Token counts correctly passed to EvaluationResult.

The judge_llm_input_tokens and judge_llm_output_tokens fields are populated from the TokenCounter, enabling downstream aggregation in statistics.


196-197: Good practice: Callback cleanup in error path.

Ensures the callback is removed even when an exception occurs, preventing stale callbacks. Apply the same improvement suggested above for the cleanup approach.


131-149: Callback management may interfere with concurrent evaluations or other LiteLLM callbacks.

Setting litellm.success_callback = [token_tracking_callback] replaces any existing callbacks. If other parts of the system register LiteLLM callbacks, or if evaluations run concurrently, this could cause issues. Consider appending and removing the specific callback instead.

-            # Register callback
-            litellm.success_callback = [token_tracking_callback]
+            # Register callback (preserve any existing callbacks)
+            litellm.success_callback.append(token_tracking_callback)

And in cleanup (lines 156-157 and 196-197):

-            litellm.success_callback = []
+            if token_tracking_callback in litellm.success_callback:
+                litellm.success_callback.remove(token_tracking_callback)

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 💪 . Added some comments, please add tests.

@bsatapat-jpg bsatapat-jpg force-pushed the dev branch 2 times, most recently from ce9a29b to 58aa1ec Compare December 3, 2025 10:27
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/api/client.py (1)

22-26: Duplicate function: consolidate with streaming_parser.py.

This estimate_tokens function is identical to the one in src/lightspeed_evaluation/core/api/streaming_parser.py (lines 12-16). See the review comment on that file for the recommended refactoring to extract this to a shared utility module.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecda30f and 58aa1ec.

📒 Files selected for processing (12)
  • src/lightspeed_evaluation/core/api/client.py (3 hunks)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (4 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • tests/unit/runner/test_evaluation.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/constants.py
  • tests/unit/core/output/test_statistics.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧠 Learnings (5)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
src/lightspeed_evaluation/core/api/client.py (1)
  • estimate_tokens (22-26)
src/lightspeed_evaluation/core/api/client.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
  • estimate_tokens (12-16)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (440-493)
  • EvaluationResult (378-423)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (5)
src/lightspeed_evaluation/core/output/generator.py (1)

244-254: LGTM! Token usage reporting is well-integrated.

The new "Token Usage (Judge LLM)" section is cleanly integrated into the text summary and follows the existing formatting patterns. The thousand-separator formatting is appropriate for readability.

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

29-31: Token tracking logic is sound.

The streaming token tracking implementation correctly:

  • Counts inference tokens during streaming (lines 53-56)
  • Extracts explicit counts from end events (lines 65-70)
  • Falls back to streamed count or estimation when needed (lines 79-86)
  • Returns token metadata in the response payload (lines 92-93)

The fallback chain (explicit → streamed → estimated) is reasonable and ensures token counts are always available.

Also applies to: 53-56, 65-70, 79-86, 92-93

src/lightspeed_evaluation/core/api/client.py (1)

167-180: Token estimation integration is correct.

The token estimation logic correctly:

  • Checks if tokens are missing using not response_data.get("input_tokens") / not raw_data.get("input_tokens")
  • Estimates from appropriate sources (query for input, response for output)
  • Logs debug messages for traceability

The implementation properly handles both standard and streaming query paths.

Also applies to: 207-214

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

27-42: LGTM! TokenCounter class is clean and well-designed.

The TokenCounter class provides a clear, simple interface for accumulating token usage with appropriate initialization, mutation (add_tokens), and access (get_counts) methods.


158-159: Token count propagation is correct.

The token counts are properly retrieved from the TokenCounter and attached to the EvaluationResult as judge_llm_input_tokens and judge_llm_output_tokens, aligning with the expanded data model.

Also applies to: 186-187

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58aa1ec and dda710c.

📒 Files selected for processing (12)
  • src/lightspeed_evaluation/core/api/client.py (3 hunks)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (4 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/lightspeed_evaluation/runner/evaluation.py
  • tests/unit/core/output/test_statistics.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/output/generator.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/runner/test_evaluation.py
🧠 Learnings (5)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (440-493)
  • EvaluationResult (378-423)
src/lightspeed_evaluation/core/api/client.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
  • estimate_tokens (12-16)
src/lightspeed_evaluation/core/output/statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (317-375)
  • EvaluationResult (378-423)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
  • EvaluationResult (258-288)
🪛 GitHub Actions: Pyright
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

[error] 200-201: "token_tracking_callback" is possibly unbound (reportPossiblyUnboundVariable)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.11)
  • GitHub Check: mypy
  • GitHub Check: Pylinter
🔇 Additional comments (10)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)

29-31: LGTM: Token tracking logic is sound.

The implementation correctly:

  • Initializes token counters
  • Accumulates streaming tokens from "inference" role events
  • Extracts final counts from the "end" event

Also applies to: 53-56, 65-70


79-86: LGTM: Fallback estimation provides good resilience.

The cascading fallback logic (streamed count → estimated from response) ensures token counts are always populated when possible, with appropriate debug logging.

src/lightspeed_evaluation/core/constants.py (1)

35-36: LGTM: CSV column constants updated correctly.

The additions align with the token tracking feature and enable token usage data export.

tests/unit/runner/test_evaluation.py (1)

57-65: LGTM: Tests properly updated for token tracking.

All test cases consistently include the new token fields in mock statistics, ensuring test coverage for the token accounting feature.

Also applies to: 112-120, 210-218

src/lightspeed_evaluation/core/api/client.py (1)

167-180: LGTM: Token estimation fallbacks are well-implemented.

The code appropriately estimates token counts when the API doesn't provide them, with helpful debug logging. The logic is consistent across both standard and streaming query paths.

Also applies to: 207-214

src/lightspeed_evaluation/core/models/api.py (1)

93-94: LGTM: API model correctly extended with token fields.

The new input_tokens and output_tokens fields are properly defined with:

  • Appropriate constraints (non-negative, default 0)
  • Clear descriptions
  • Correct extraction and initialization in from_raw_response

Also applies to: 113-115, 122-123

src/lightspeed_evaluation/core/output/statistics.py (2)

48-50: LGTM: Token totals correctly calculated and included in statistics.

The implementation properly:

  • Initializes token totals for empty results
  • Aggregates judge LLM token usage from all results
  • Returns comprehensive token statistics

Also applies to: 58-61, 70-72


216-238: LGTM: API token usage calculation is well-implemented.

The function:

  • Follows Google-style docstring guidelines
  • Correctly aggregates API token usage from turn data
  • Returns a consistent structure matching the statistics pattern
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

27-42: LGTM: TokenCounter is a clean internal utility.

The simple counter class provides clear token accumulation functionality for evaluation tracking.


91-92: LGTM: Token counter properly initialized per evaluation.

The counter is correctly initialized at the start of each evaluation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)

12-16: Update docstring to Google-style format.

A previous review already flagged this: the coding guidelines require Google-style docstrings with Args and Returns sections for public APIs.

src/lightspeed_evaluation/core/api/client.py (1)

22-26: Improve estimate_tokens docstring and avoid duplication with streaming_parser.

estimate_tokens is a public helper here, but it still has a single-line docstring; coding guidelines call for a Google-style docstring with explicit Args / Returns. Also, the same helper exists in src/lightspeed_evaluation/core/api/streaming_parser.py (lines 11-15), which introduces duplication and potential drift.

Consider updating the docstring and, in a follow-up, centralizing this helper in a shared module (or otherwise reusing a single definition) so both client.py and streaming_parser.py stay in sync.

Example docstring change:

 def estimate_tokens(text: str) -> int:
-    """Estimate token count from text (~4 characters per token)."""
+    """Estimate token count from text (~4 characters per token).
+
+    Args:
+        text: Input text to estimate tokens for.
+
+    Returns:
+        Estimated token count (minimum 1).
+    """
     if not text:
         return 1
     return max(1, len(text) // 4)

As per coding guidelines.

🧹 Nitpick comments (3)
tests/unit/core/output/test_statistics.py (1)

289-321: Consider adding a test case with non-zero token values.

The current tests verify token fields exist but only with zero values. A test that creates EvaluationResult objects with judge_llm_input_tokens and judge_llm_output_tokens set would strengthen coverage of the aggregation logic.

Example test case:

def test_calculate_basic_stats_with_token_usage(self):
    """Test calculate_basic_stats aggregates token usage."""
    results = [
        EvaluationResult(
            conversation_group_id="conv1",
            turn_id="turn1",
            metric_identifier="test:metric",
            result="PASS",
            score=0.8,
            threshold=0.7,
            reason="Good",
            judge_llm_input_tokens=100,
            judge_llm_output_tokens=50,
        ),
        EvaluationResult(
            conversation_group_id="conv1",
            turn_id="turn2",
            metric_identifier="test:metric",
            result="PASS",
            score=0.9,
            threshold=0.7,
            reason="Excellent",
            judge_llm_input_tokens=150,
            judge_llm_output_tokens=75,
        ),
    ]

    stats = calculate_basic_stats(results)

    assert stats["total_judge_llm_input_tokens"] == 250
    assert stats["total_judge_llm_output_tokens"] == 125
    assert stats["total_judge_llm_tokens"] == 375
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

199-201: Add defensive check for callback list existence in exception handler.

If litellm.success_callback is modified externally (set to None) between registration and exception handling, line 200 could raise an AttributeError. Consider adding a defensive check:

             # Remove our callback on error (preserve others)
-            if token_tracking_callback in litellm.success_callback:
+            if (
+                hasattr(litellm, "success_callback")
+                and litellm.success_callback
+                and token_tracking_callback in litellm.success_callback
+            ):
                 litellm.success_callback.remove(token_tracking_callback)

This mirrors the defensive pattern used during registration (lines 146-150).


206-221: Consider propagating partial token counts in error results.

When an evaluation error occurs, the _create_error_result method doesn't include judge_llm_input_tokens or judge_llm_output_tokens, even though some tokens may have been consumed before the error. This could lead to under-counting of total token usage.

If tracking all token usage is important for cost accounting, consider passing token counts to this method:

     def _create_error_result(
-        self, request: EvaluationRequest, reason: str, execution_time: float
+        self, request: EvaluationRequest, reason: str, execution_time: float,
+        judge_llm_input_tokens: int = 0, judge_llm_output_tokens: int = 0
     ) -> EvaluationResult:
         """Create an ERROR result for failed evaluation."""
         return EvaluationResult(
             ...
             execution_time=execution_time,
+            judge_llm_input_tokens=judge_llm_input_tokens,
+            judge_llm_output_tokens=judge_llm_output_tokens,
         )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dda710c and 13ebeab.

📒 Files selected for processing (12)
  • src/lightspeed_evaluation/core/api/client.py (3 hunks)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (4 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/models/api.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/core/output/test_statistics.py
  • tests/unit/runner/test_evaluation.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/core/constants.py
🧠 Learnings (5)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
src/lightspeed_evaluation/core/api/client.py (1)
  • estimate_tokens (22-26)
src/lightspeed_evaluation/core/api/client.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
  • estimate_tokens (12-16)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.12)
  • GitHub Check: ruff
  • GitHub Check: mypy
  • GitHub Check: pydocstyle
🔇 Additional comments (15)
src/lightspeed_evaluation/core/constants.py (1)

35-36: LGTM!

The new CSV columns for token tracking are correctly added and align with the naming convention used in the evaluation result fields (judge_llm_input_tokens, judge_llm_output_tokens).

tests/unit/runner/test_evaluation.py (3)

57-65: LGTM!

The mock return values correctly include the new token usage fields. The token values are consistent (100 + 50 = 150 for total).


112-120: LGTM!

Consistent mock data with zero token values for the empty results case.


210-218: LGTM!

Mock values correctly reflect token usage proportional to the evaluation count (500 + 250 = 750).

tests/unit/core/output/test_statistics.py (2)

289-291: LGTM!

The expected values correctly reflect the default token counts when EvaluationResult objects don't have token data set.


319-321: LGTM!

Consistent with the mixed results test case - token fields default to 0.

src/lightspeed_evaluation/core/api/streaming_parser.py (3)

29-31: Token counting logic is well-structured with appropriate fallbacks.

The fallback chain for output_tokens (explicit end event → streamed count → estimation) is sensible. Note that input_tokens has no fallback and will remain 0 if not provided by the server—this is likely intentional since input tokens can't be reliably estimated without the original prompt context.

Also applies to: 65-70, 79-86


53-56: LGTM!

Counting token events with role == "inference" correctly tracks streamed output tokens.


92-93: LGTM!

The return dictionary correctly includes the new token fields.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)

27-43: LGTM!

The TokenCounter class is a clean, focused utility for accumulating token usage. Type hints are properly included.


91-106: LGTM!

The callback is correctly defined before the try block, addressing the previous unbound variable concern. The callback properly extracts token usage from LiteLLM's response.


145-151: LGTM!

The callback registration correctly uses append to preserve existing callbacks, addressing the previous global mutation concern.


158-165: LGTM!

Callback cleanup uses safe membership check, and token counts are correctly retrieved and passed to EvaluationResult.

Also applies to: 192-193

src/lightspeed_evaluation/core/api/client.py (2)

167-180: Token estimation fallback for _standard_query looks correct.

Using estimate_tokens to backfill input_tokens and output_tokens when the API omits or zeros them is consistent with previous behavior and adds useful debug logging. No issues from a correctness or logging standpoint.


208-215: Streaming input_tokens estimation is consistent and minimal.

The streaming path mirrors the standard path by estimating input_tokens when missing/zero, using the same helper and debug logging. This keeps token accounting consistent across endpoint types without altering existing error handling.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/unit/runner/test_evaluation.py (1)

57-65: Token fields added to mocked stats look consistent with core statistics API.

The extended mock_stats.return_value dictionaries match the new token aggregates (total_judge_llm_input_tokens, total_judge_llm_output_tokens, total_judge_llm_tokens) and remain backward compatible with existing assertions. If you want stronger coverage, you could also assert that these keys are present in result, but it's optional since detailed token aggregation is already tested in test_statistics.py.

Also applies to: 112-120, 210-218

tests/unit/core/api/test_client.py (1)

7-7: estimate_tokens tests accurately capture the heuristic and edge cases.

The new TestEstimateTokens suite aligns with the implementation (4 chars/token with a floor of 1) and exercises a good range of inputs, including unicode and newlines. test_estimate_tokens_none_like_empty is slightly redundant with the empty-string test, but that’s only a naming nit and doesn’t hurt clarity.

Also applies to: 503-557

src/lightspeed_evaluation/core/models/api.py (1)

93-95: APIResponse token fields and mapping from raw_data are sound and backward compatible.

Adding input_tokens/output_tokens with non-negative integer defaults, and wiring them via raw_data.get(..., 0), preserves existing behavior while allowing real API token counts to flow through when present. Consider adding a dedicated test for APIResponse.from_raw_response including token keys, but the core change itself looks good.

Also applies to: 113-116, 117-124

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13ebeab and 4fe7dc3.

📒 Files selected for processing (15)
  • src/lightspeed_evaluation/core/api/client.py (3 hunks)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (4 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
  • tests/unit/core/api/test_client.py (2 hunks)
  • tests/unit/core/api/test_streaming_parser.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (4 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/api/client.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/output/generator.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/runner/test_evaluation.py
  • tests/unit/core/api/test_client.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/core/api/test_streaming_parser.py
  • tests/unit/core/output/test_statistics.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧠 Learnings (7)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
🧬 Code graph analysis (6)
tests/unit/core/api/test_client.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
  • estimate_tokens (12-16)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
src/lightspeed_evaluation/core/api/client.py (1)
  • estimate_tokens (22-26)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (440-493)
  • EvaluationResult (378-423)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
  • TokenCounter (27-42)
  • get_counts (40-42)
  • add_tokens (35-38)
tests/unit/core/api/test_streaming_parser.py (3)
src/lightspeed_evaluation/core/api/client.py (1)
  • estimate_tokens (22-26)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
  • estimate_tokens (12-16)
tests/unit/core/api/test_client.py (5)
  • TestEstimateTokens (503-557)
  • test_estimate_tokens_empty_string (506-508)
  • test_estimate_tokens_short_text (514-517)
  • test_estimate_tokens_exact_boundary (519-522)
  • test_estimate_tokens_minimum_one (535-539)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationData (317-375)
  • EvaluationResult (378-423)
  • TurnData (35-314)
src/lightspeed_evaluation/core/output/statistics.py (2)
  • calculate_api_token_usage (216-238)
  • calculate_basic_stats (37-73)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)
  • GitHub Check: mypy
🔇 Additional comments (11)
tests/unit/core/api/test_streaming_parser.py (1)

10-10: Good coverage for streaming_parser’s estimate_tokens helper.

The added tests validate the same contract as the client-side estimator and cover key edge cases without over-constraining the implementation. Duplication with client tests is acceptable here since both modules expose estimate_tokens as part of their public surface.

Also applies to: 257-283

tests/unit/pipeline/evaluation/test_evaluator.py (1)

12-15: TokenCounter tests are comprehensive and aligned with the implementation.

The new tests cover initialization, accumulation semantics, zero-add behavior, return type, and instance independence, matching the current TokenCounter contract and providing solid safety for future changes.

Also applies to: 461-513

tests/unit/core/output/test_statistics.py (1)

6-12: Statistics tests correctly exercise new token aggregation behavior.

Extending expectations for calculate_basic_stats and adding focused suites for calculate_api_token_usage and token totals provides good coverage of both judge-LLM and API token accounting. The tests use real EvaluationData, TurnData, and EvaluationResult instances, so they validate model integration as well as numeric correctness.

Also applies to: 282-323, 591-769

src/lightspeed_evaluation/core/output/statistics.py (1)

9-10: Token aggregation in statistics is consistent and model-driven.

calculate_basic_stats now always exposes judge token totals, with safe defaults for empty inputs and simple sums over EvaluationResult.judge_llm_input_tokens/judge_llm_output_tokens. The new calculate_api_token_usage walks EvaluationData.turns and aggregates api_input_tokens and api_output_tokens into clear totals, with a Google-style docstring and proper type hints. This matches the tests and the new token fields on the models.

Also applies to: 37-73, 216-238

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

12-17: Streaming parser token tracking is well layered and preserves existing behavior.

The new estimate_tokens helper and token logic in parse_streaming_response correctly prioritize explicit server counts, then streamed token events, and finally a length-based estimate, while continuing to enforce the existing invariants around final response and conversation ID. The extended return dict adds input_tokens/output_tokens without breaking callers that only rely on response, tool_calls, and conversation_id.

Also applies to: 22-32, 53-57, 65-71, 79-87, 88-94

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)

27-43: LGTM - Simple and effective token accumulator.

The TokenCounter class is clean and straightforward for per-evaluation token tracking. The implementation correctly uses instance variables to avoid cross-contamination between evaluations.


94-106: Token tracking callback properly defined before try block.

The callback is now defined before the try block, correctly addressing the previous unbound variable concern. The defensive hasattr and getattr usage for accessing usage attributes is good practice.


145-151: Callback registration correctly preserves existing callbacks.

The implementation properly checks for the existence of success_callback and initializes it as an empty list if needed before appending. This addresses the previous concern about overwriting existing callbacks.


181-194: Token counts correctly propagated to EvaluationResult.

The successful evaluation path properly retrieves token counts and includes them in the result. This integrates well with the EvaluationResult model's new fields.

src/lightspeed_evaluation/core/models/data.py (2)

73-79: Token tracking fields for API usage are well-defined.

The api_input_tokens and api_output_tokens fields follow the existing patterns in TurnData with appropriate defaults, constraints (ge=0), and clear descriptions. Good placement alongside other API-populated fields.


410-415: Judge LLM token tracking fields properly added.

The judge_llm_input_tokens and judge_llm_output_tokens fields maintain consistency with the existing field structure and provide clear semantic separation from API token usage. The ge=0 constraint and default of 0 are appropriate.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

196-201: Optionally harden error-path callback cleanup against library defaults.

In the exception path you do:

if token_tracking_callback in litellm.success_callback:
    litellm.success_callback.remove(token_tracking_callback)

This assumes litellm.success_callback is always an iterable; if some LiteLLM versions initialise it as None (or omit the attribute) and an exception is raised before your registration code runs, this membership test can itself raise, masking the original evaluation error. If your supported LiteLLM versions guarantee success_callback is always a list, you can ignore this; otherwise, wrapping the check with a simple guard on attribute existence/non-None would make the cleanup more robust.

🧹 Nitpick comments (2)
tests/unit/runner/test_evaluation.py (1)

57-65: Consider asserting token fields in runner tests for stronger coverage.

Extending the mocked calculate_basic_stats return dicts with the new total_judge_llm_* keys is good, but the tests still only assert on TOTAL/PASS/FAIL/ERROR. Optional: add simple expectations on the token fields here to catch any future wiring regressions between the runner and statistics.calculate_basic_stats, even though the detailed behavior is already well covered in tests/unit/core/output/test_statistics.py.

Also applies to: 112-120, 210-218

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

27-43: Judge LLM token tracking integration looks good; consider also populating tokens on ERROR results.

TokenCounter plus the LiteLLM success_callback hook correctly accumulates prompt_tokens/completion_tokens and attaches them to successful EvaluationResult instances via judge_llm_input_tokens/judge_llm_output_tokens. Right now, when you fall back to _create_error_result (e.g., score is None after some LiteLLM calls), those error results always carry zero tokens, so calculate_basic_stats will under-report total judge-LLM tokens for evaluations that failed after consuming tokens. If you want “total tokens consumed by judge LLM” semantics, consider grabbing the counts from token_counter and setting them on error results as well (either by extending _create_error_result or by mutating the returned object before returning).

Also applies to: 91-107, 145-151, 158-161, 164-166, 192-193

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4fe7dc3 and fea3845.

📒 Files selected for processing (12)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (3 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (4 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/constants.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/core/output/generator.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/core/models/data.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/core/output/test_statistics.py
  • tests/unit/runner/test_evaluation.py
🧠 Learnings (6)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • tests/unit/runner/test_evaluation.py
  • src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (440-493)
  • EvaluationResult (378-423)
src/lightspeed_evaluation/core/output/statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (317-375)
  • EvaluationResult (378-423)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
  • EvaluationResult (258-288)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationData (317-375)
  • EvaluationResult (378-423)
  • TurnData (35-314)
src/lightspeed_evaluation/core/output/statistics.py (2)
  • calculate_api_token_usage (216-238)
  • calculate_basic_stats (37-73)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: mypy
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.12)
🔇 Additional comments (9)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

62-70: API token usage propagation looks correct and backwards compatible.

Reading input_tokens/output_tokens with getattr(..., 0) and storing them on TurnData aligns with the new fields and keeps older responses (without tokens) working while adding useful debug logging.

tests/unit/core/output/test_statistics.py (3)

6-6: Stats expectations updated consistently with new token fields.

The additional imports and the extended expected dicts (including total_judge_llm_* fields) align with the new behavior of calculate_basic_stats and give a strict equality check that will catch shape regressions.

Also applies to: 9-9, 290-293, 320-323


591-694: API token usage tests exercise key scenarios well.

The new TestCalculateApiTokenUsage class covers empty input, single turn, multiple turns, multiple conversations, and explicit zero-token cases, which matches how calculate_api_token_usage aggregates api_input_tokens/api_output_tokens across EvaluationData.turns.


696-769: Judge LLM token tracking tests for calculate_basic_stats are solid.

These tests verify presence of the new token fields, correct summation across multiple EvaluationResult instances, default-zero behavior when tokens aren’t provided, and the empty-results case, giving good confidence in the extended stats.

src/lightspeed_evaluation/core/models/api.py (1)

93-95: APIResponse token fields and extraction are consistent and safe.

Adding input_tokens/output_tokens as non-negative ints with defaults, and wiring from_raw_response to read them (defaulting to 0 when absent), cleanly extends the model without breaking existing callers.

Also applies to: 113-124

src/lightspeed_evaluation/core/output/statistics.py (2)

37-51: Extension of calculate_basic_stats with judge token totals is coherent.

Summing judge_llm_input_tokens/judge_llm_output_tokens and exposing them (plus their sum) in both empty and non-empty cases matches the new data model and the updated tests, without changing existing count/rate semantics.

Also applies to: 58-61, 70-73


216-238: calculate_api_token_usage helper is clear and aligns with EvaluationData/TurnData.

Iterating over EvaluationData.turns and accumulating api_input_tokens/api_output_tokens into a simple dict (total_api_input_tokens, total_api_output_tokens, total_api_tokens) provides exactly what the higher-level reporting needs, and the Google-style docstring matches the documented guidelines for new public APIs. Based on learnings, this placement in core/output/statistics is appropriate for evaluation features.

src/lightspeed_evaluation/core/models/data.py (2)

73-80: TurnData API token fields are well specified.

Adding api_input_tokens/api_output_tokens as non-negative ints with default 0 and clear descriptions fits the per-turn model and supports aggregation in calculate_api_token_usage without breaking existing YAML/evaluation data.


410-415: EvaluationResult judge token fields align with downstream statistics.

The new judge_llm_input_tokens and judge_llm_output_tokens fields (int, ge=0, default 0) integrate cleanly with MetricsEvaluator and calculate_basic_stats, enabling end-to-end judge-LLM token tracking while remaining backward compatible.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fea3845 and 2bad7d2.

📒 Files selected for processing (14)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (3 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/llm/__init__.py (2 hunks)
  • src/lightspeed_evaluation/core/llm/custom.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (4 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/constants.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/core/output/test_statistics.py
  • tests/unit/runner/test_evaluation.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/llm/custom.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/llm/__init__.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧠 Learnings (6)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (5)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationData (317-375)
  • EvaluationResult (378-423)
  • TurnData (35-314)
src/lightspeed_evaluation/core/output/statistics.py (1)
  • calculate_api_token_usage (216-238)
src/lightspeed_evaluation/core/output/statistics.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
  • EvaluationResult (258-288)
src/lightspeed_evaluation/core/llm/__init__.py (1)
src/lightspeed_evaluation/core/llm/custom.py (2)
  • BaseCustomLLM (67-132)
  • TokenTracker (10-64)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
src/lightspeed_evaluation/core/llm/custom.py (5)
  • TokenTracker (10-64)
  • get_counts (57-59)
  • reset (61-64)
  • start (40-47)
  • stop (49-55)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/llm/custom.py (4)
  • TokenTracker (10-64)
  • start (40-47)
  • stop (49-55)
  • get_counts (57-59)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (440-493)
  • EvaluationResult (378-423)
🪛 GitHub Actions: Pydocstyle
src/lightspeed_evaluation/core/llm/custom.py

[error] 34-34: D403: First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM')

🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/llm/custom.py

[warning] 29-29: Pylint: Unused argument 'kwargs' (unused-argument)


[warning] 31-31: Pylint: Unused argument 'start_time' (unused-argument)


[warning] 32-32: Pylint: Unused argument 'end_time' (unused-argument)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)
  • GitHub Check: mypy
🔇 Additional comments (16)
src/lightspeed_evaluation/core/llm/__init__.py (1)

9-9: LGTM! TokenTracker properly exposed via lazy imports.

The new TokenTracker export follows the existing lazy-import pattern and integrates cleanly with the public API surface.

Also applies to: 22-22

src/lightspeed_evaluation/core/output/generator.py (1)

244-253: LGTM! Token usage reporting integrates cleanly.

The new token usage block follows the existing report structure and formatting conventions. The statistics are pulled from the overall stats dictionary and displayed with appropriate number formatting.

tests/unit/runner/test_evaluation.py (1)

57-65: LGTM! Test mocks updated to match new stats shape.

The mock return values have been properly extended to include the three new token-tracking fields required by the updated calculate_basic_stats function.

Also applies to: 112-120, 210-218

src/lightspeed_evaluation/core/models/data.py (1)

73-79: LGTM! Token fields properly integrated into data models.

The new token-tracking fields follow existing patterns with appropriate validation constraints (ge=0), sensible defaults (0), and clear descriptions. Both TurnData (API tokens) and EvaluationResult (judge LLM tokens) are extended consistently.

Also applies to: 410-415

tests/unit/core/output/test_statistics.py (3)

6-9: LGTM! Existing tests properly updated for new token fields.

The updated imports and test assertions correctly accommodate the new token-tracking fields with appropriate zero defaults.

Also applies to: 290-293, 320-323


591-694: LGTM! Comprehensive test coverage for API token usage calculations.

The new test class thoroughly exercises calculate_api_token_usage across all relevant scenarios: empty data, single turn, multiple turns, multiple conversations, and zero tokens. Tests are well-structured and assertions are clear.


696-769: LGTM! Thorough verification of token field integration in basic stats.

The new test class validates that calculate_basic_stats correctly includes, sums, and defaults token fields. Coverage includes populated values, zero defaults, and empty results.

tests/unit/pipeline/evaluation/test_evaluator.py (1)

5-5: LGTM! Thorough TokenTracker test coverage.

The new test class comprehensively validates TokenTracker's initialization, lifecycle (start/stop), reset behavior, idempotency, and instance independence. All tests are well-structured and follow pytest conventions.

Also applies to: 459-517

src/lightspeed_evaluation/core/models/api.py (1)

93-94: LGTM! Token fields properly integrated into API response model.

The new input_tokens and output_tokens fields are correctly defined with appropriate validation constraints and defaults. The from_raw_response method cleanly extracts these values with sensible fallbacks.

Also applies to: 113-116, 122-123

src/lightspeed_evaluation/core/output/statistics.py (3)

9-9: LGTM: Import updated to support API token calculation.

The addition of EvaluationData to the imports is necessary for the new calculate_api_token_usage function and properly extends the module's capability to handle both judge LLM and API token aggregation.


48-72: LGTM: Judge LLM token aggregation implemented correctly.

The token tracking additions are well-integrated:

  • Empty results properly initialize all token fields to 0
  • Token aggregation uses efficient generator expressions
  • Total tokens correctly computed as sum of input and output
  • Field naming is consistent and clear

216-238: LGTM: API token calculation function is well-implemented.

The function correctly:

  • Follows Google-style docstring format as required by coding guidelines
  • Uses proper type hints
  • Iterates through conversation data and turns to aggregate API token usage
  • Returns a consistently structured dictionary
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)

8-8: LGTM: TokenTracker import enables judge LLM token measurement.

The import of TokenTracker from the custom LLM module provides the necessary infrastructure for tracking token usage during metric evaluation.


72-73: LGTM: Token tracking integration is correct.

The token tracking lifecycle is properly implemented:

  • Per-evaluation TokenTracker instance prevents cross-evaluation contamination
  • start() called before evaluation to register callbacks
  • stop() called after evaluation to unregister callbacks
  • Token counts retrieved and used in result construction

Also applies to: 112-113, 120-121, 125-126


153-154: LGTM: Token counts correctly added to evaluation result.

The judge LLM token counts are properly passed to the EvaluationResult constructor, aligning with the updated model definition that includes judge_llm_input_tokens and judge_llm_output_tokens fields.


160-161: LGTM: Token tracking properly cleaned up in error path.

Calling token_tracker.stop() in the exception handler ensures that the litellm callback is unregistered even when evaluation fails, preventing callback leaks.

@bsatapat-jpg bsatapat-jpg force-pushed the dev branch 3 times, most recently from c8daa18 to e753fe8 Compare December 4, 2025 05:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/llm/custom.py (1)

34-34: Pipeline failure: D403 pydocstyle error requires attention.

The CI reports a D403 error: "First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM')". While "LiteLLM" is the correct brand name, pydocstyle enforces sentence-case rules.

Options to resolve:

  1. Reword to avoid starting with the brand name:
-        """LiteLLM callback to capture token usage."""
+        """Capture token usage from LiteLLM completion responses."""
  1. Or add a pydocstyle ignore directive if preserving brand casing is important.
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)

14-22: Return type annotation should reflect the new token fields.

The function returns a dict[str, Any] which accommodates the new fields, but consider documenting the expected return structure in the docstring for clarity. The current docstring "Parse streaming response and extract data." is minimal.

 def parse_streaming_response(  # pylint: disable=too-many-branches
     response: httpx.Response,
 ) -> dict[str, Any]:
-    """Parse streaming response and extract data."""
+    """Parse streaming response and extract data.
+
+    Returns:
+        Dictionary with keys: response, tool_calls, conversation_id,
+        input_tokens, output_tokens.
+    """
     conversation_id = ""

Also applies to: 66-72

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

124-138: Consider using try/finally for cleaner token tracker lifecycle management.

The current implementation correctly stops the token tracker in both success and exception paths, but a try/finally pattern would be more idiomatic and prevent potential issues if new code paths are added later.

             # Start token tracking
             token_tracker.start()
 
-            # Evaluate metric
-            score, reason = self.handlers[framework].evaluate(  # type: ignore
-                metric_name, request.conv_data, evaluation_scope
-            )
-
-            # Stop token tracking
-            token_tracker.stop()
+            try:
+                # Evaluate metric
+                score, reason = self.handlers[framework].evaluate(  # type: ignore
+                    metric_name, request.conv_data, evaluation_scope
+                )
+            finally:
+                # Stop token tracking
+                token_tracker.stop()
 
             execution_time = time.time() - start_time
 
             # Get token counts
             judge_input_tokens, judge_output_tokens = token_tracker.get_counts()

This would also allow removing the token_tracker.stop() call in the outer exception handler (line 173).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 394e2f3 and c8daa18.

📒 Files selected for processing (14)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (3 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/llm/__init__.py (2 hunks)
  • src/lightspeed_evaluation/core/llm/custom.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (1 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (4 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/llm/init.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • tests/unit/runner/test_evaluation.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/core/llm/custom.py
  • src/lightspeed_evaluation/core/output/generator.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/core/output/test_statistics.py
🧠 Learnings (8)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

  • src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.

Applied to files:

  • src/lightspeed_evaluation/runner/evaluation.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/runner/evaluation.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
  • calculate_api_token_usage (216-238)
  • calculate_basic_stats (37-73)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/llm/custom.py (4)
  • TokenTracker (10-64)
  • start (40-47)
  • stop (49-55)
  • get_counts (57-59)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (440-493)
  • EvaluationResult (378-423)
🪛 GitHub Actions: Pydocstyle
src/lightspeed_evaluation/core/llm/custom.py

[error] 34-34: Docstyle error: D403 - First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM').

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: mypy
  • GitHub Check: Pylinter
🔇 Additional comments (19)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)

52-57: Token extraction from end event looks correct.

The conditional extraction with dictionary .get() pattern would be cleaner and safer than in checks, but the current approach is functionally correct. The independent extraction from the end event doesn't interfere with final_response construction.

src/lightspeed_evaluation/core/models/api.py (2)

93-94: Well-structured token fields with proper validation.

The Field(default=0, ge=0) constraints ensure non-negative integers with sensible defaults. The descriptive field documentation is clear.


113-123: Clean extraction pattern for token counts.

Using .get() with default 0 aligns with the field defaults and handles missing keys gracefully. The implementation correctly propagates tokens from raw response data to the model.

src/lightspeed_evaluation/core/output/generator.py (1)

244-253: Token usage section integrates well with the existing report structure.

The formatting with thousands separators (:,) improves readability. The section placement after overall statistics is logical. The implementation correctly references the keys from calculate_basic_stats.

src/lightspeed_evaluation/runner/evaluation.py (2)

39-42: Import structure is clean and follows existing patterns.

The grouped import from lightspeed_evaluation.core.output.statistics maintains consistency with the codebase style.


98-117: Token usage summary provides good visibility into resource consumption.

The implementation correctly:

  • Displays judge LLM tokens unconditionally (always available from evaluation results)
  • Conditionally displays API tokens only when system_config.api.enabled is true
  • Computes combined total when both are available

Verify that the amender correctly populates api_input_tokens and api_output_tokens on TurnData objects during pipeline.run_evaluation, as the token usage calculation depends on these fields being populated.

src/lightspeed_evaluation/core/llm/custom.py (3)

10-19: Well-documented class with clear usage pattern.

The docstring provides a helpful usage example. The lifecycle pattern (start → use → stop → get_counts) is intuitive.


57-64: Utility methods are clean and straightforward.

get_counts() returns a tuple which is appropriate for the two-value return. reset() is useful for reusing the same tracker across multiple evaluation runs.


40-55: Callback registration is thread-safe for single instance but consider concurrent usage.

The _callback_registered flag prevents double-registration within a single instance. However, if multiple TokenTracker instances are used concurrently (e.g., parallel evaluations), they would all append to the same global litellm.success_callback list, potentially causing token counts to be accumulated across instances.

If concurrent usage is expected, consider using instance-specific identification or documenting that only one tracker should be active at a time.

tests/unit/core/output/test_statistics.py (4)

6-12: LGTM! Imports are correctly updated for the new token tracking tests.

The imports now include EvaluationData and TurnData models alongside the new calculate_api_token_usage function, which are all necessary for the new test cases.


290-293: LGTM! Token field assertions correctly added to existing test expectations.

The test expectations are properly updated to include the new total_judge_llm_input_tokens, total_judge_llm_output_tokens, and total_judge_llm_tokens fields with zero values, correctly reflecting that results without explicit token counts default to zero.


591-693: LGTM! Comprehensive test coverage for calculate_api_token_usage.

The test class covers all key scenarios:

  • Empty data handling (returns zeros)
  • Single turn token aggregation
  • Multiple turns within a conversation
  • Multiple conversations aggregation
  • Zero token edge case

Test naming follows the test_* convention as required by coding guidelines.


696-769: LGTM! Well-structured tests for judge LLM token tracking in basic stats.

The test class properly validates:

  • Token fields are present in the output
  • Token values are correctly summed across results (300 input + 150 output = 450 total)
  • Results without explicit tokens default to zero
  • Empty results return zero tokens
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)

8-8: LGTM! TokenTracker import correctly added.

The import from lightspeed_evaluation.core.llm.custom aligns with the new token tracking infrastructure.


66-81: LGTM! Docstring updated to Google-style format.

The docstring now properly documents the method's purpose, args, and return values including the new token tracking behavior. This addresses the previous review comment.


169-176: Error handling correctly stops token tracking.

The exception handler properly ensures the token tracker is stopped, and the error result intentionally omits token counts since partial counts from failed evaluations could be misleading. The TokenTracker.stop() implementation is safe to call even if tracking wasn't started.

src/lightspeed_evaluation/core/output/statistics.py (3)

9-9: LGTM! Import updated to include EvaluationData.

The import correctly adds EvaluationData which is needed by the new calculate_api_token_usage function.


48-72: LGTM! Token tracking fields correctly added to basic stats.

The implementation properly:

  • Returns zero values for empty results
  • Sums judge_llm_input_tokens and judge_llm_output_tokens from each result
  • Calculates the total as the sum of input and output tokens

This is consistent with the EvaluationResult model which defaults these fields to 0.


216-238: Verify that TurnData has api_input_tokens and api_output_tokens fields.

The function implementation appears sound on code review:

  • Type hints are present: list[EvaluationData] -> dict[str, Any]
  • Docstring follows Google-style format
  • Logic correctly iterates through conversations and turns to aggregate token counts
  • Return structure is consistent

However, verification is needed to confirm:

  • TurnData model contains the api_input_tokens and api_output_tokens fields being accessed
  • Function follows the coding guideline requirement for structured logging (if applicable for this context)
  • No error handling needed for missing or invalid token values

@bsatapat-jpg bsatapat-jpg requested a review from asamal4 December 4, 2025 05:36
Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you 👍

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both json & CSV files should also contain the token information for both API/JudgeLLM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/output/generator.py (1)

199-206: Fix pylint “too many arguments” by passing a consolidated stats dict into JSON/TXT summaries

The new token_usage sections and per-result judge_llm_*_tokens fields in both JSON and text summaries are well-structured and align with the aggregated statistics. However, pylint is correctly flagging _generate_json_summary and _generate_text_summary for having too many parameters / positional arguments (R0913/R0917), which will keep CI red until addressed.

A straightforward way to resolve this without losing clarity is to pass the existing stats dict into these helpers instead of three separate dict arguments:

@@
-    def _generate_json_summary(
-        self,
-        results: list[EvaluationResult],
-        base_filename: str,
-        basic_stats: dict[str, Any],
-        detailed_stats: dict[str, Any],
-        api_tokens: dict[str, Any],
-    ) -> Path:
+    def _generate_json_summary(
+        self,
+        results: list[EvaluationResult],
+        base_filename: str,
+        stats: dict[str, Any],
+    ) -> Path:
@@
-        summary = {
+        basic_stats = stats["basic"]
+        detailed_stats = stats["detailed"]
+        api_tokens = stats.get("api_tokens", {})
+
+        summary: dict[str, Any] = {
             "timestamp": datetime.now().isoformat(),
             "total_evaluations": len(results),
             "summary_stats": {
-                "overall": basic_stats,
-                "by_metric": detailed_stats["by_metric"],
-                "by_conversation": detailed_stats["by_conversation"],
+                "overall": basic_stats,
+                "by_metric": detailed_stats["by_metric"],
+                "by_conversation": detailed_stats["by_conversation"],
             },
@@
-            "results": [
+            "results": [
                 {
@@
-                    "execution_time": round(r.execution_time, 3),
-                    "judge_llm_input_tokens": r.judge_llm_input_tokens,
-                    "judge_llm_output_tokens": r.judge_llm_output_tokens,
+                    "execution_time": round(r.execution_time, 3),
+                    "judge_llm_input_tokens": r.judge_llm_input_tokens,
+                    "judge_llm_output_tokens": r.judge_llm_output_tokens,
                 }
                 for r in results
             ],
         }
@@
-    def _generate_text_summary(
-        self,
-        results: list[EvaluationResult],
-        base_filename: str,
-        basic_stats: dict[str, Any],
-        detailed_stats: dict[str, Any],
-        api_tokens: dict[str, Any],
-    ) -> Path:
+    def _generate_text_summary(
+        self,
+        results: list[EvaluationResult],
+        base_filename: str,
+        stats: dict[str, Any],
+    ) -> Path:
@@
-        stats = {
-            "overall": basic_stats,
-            "by_metric": detailed_stats["by_metric"],
-            "by_conversation": detailed_stats["by_conversation"],
-            "api_tokens": api_tokens,
-        }
+        stats = {
+            "overall": stats["basic"],
+            "by_metric": stats["detailed"]["by_metric"],
+            "by_conversation": stats["detailed"]["by_conversation"],
+            "api_tokens": stats.get("api_tokens", {}),
+        }

And update the call sites in _generate_individual_reports accordingly:

-        if "json" in enabled_outputs:
-            json_file = self._generate_json_summary(
-                results,
-                base_filename,
-                stats["basic"],
-                stats["detailed"],
-                stats.get("api_tokens", {}),
-            )
+        if "json" in enabled_outputs:
+            json_file = self._generate_json_summary(
+                results,
+                base_filename,
+                stats,
+            )
@@
-        if "txt" in enabled_outputs:
-            txt_file = self._generate_text_summary(
-                results,
-                base_filename,
-                stats["basic"],
-                stats["detailed"],
-                stats.get("api_tokens", {}),
-            )
+        if "txt" in enabled_outputs:
+            txt_file = self._generate_text_summary(
+                results,
+                base_filename,
+                stats,
+            )

This reduces both helper methods to 4 parameters (including self), fixes the Pylint R0913/R0917 failures, and keeps all token-usage behavior unchanged.

Also applies to: 218-245, 253-260, 268-269, 291-313

🧹 Nitpick comments (3)
tests/unit/core/output/test_generator.py (1)

106-110: Generator tests correctly exercise API token plumbing; consider slightly stricter TXT assertion

Passing an explicit api_tokens dict into the JSON/TXT summary helpers and asserting on the JSON token_usage.api.total_tokens plus the TXT “Token Usage (API Calls)” section ensures the new API token path is exercised. If you want to make the TXT test a bit stronger, you could also assert that the rendered text includes the concrete totals (e.g. "300") to catch formatting regressions, but the current checks are acceptable.

Also applies to: 113-114, 132-136, 139-140, 147-148

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

8-9: TokenTracker integration is correct; consider whether to include tokens on error results and document concurrency assumptions

The way TokenTracker is wired into evaluate_metric looks sound: you create a scoped tracker per evaluation, start it immediately before invoking the framework handler, stop it right after, and also stop it again in the except path, relying on the tracker’s internal _callback_registered flag to avoid double‑unregistering. Attaching judge_llm_input_tokens / judge_llm_output_tokens to the success path of EvaluationResult cleanly exposes judge‑side usage to downstream statistics.

Two non‑blocking points to consider:

  • Token counts on ERROR results: When score is None or an exception is raised, you return _create_error_result(...) which doesn’t accept token counts, so EvaluationResult falls back to the default 0 values for judge tokens. If judge LLM calls can happen before an error is signalled by the metric handler, your aggregate judge token stats will under‑report usage for those ERROR cases. If you want totals to reflect all judge LLM usage, you could thread judge_input_tokens / judge_output_tokens through _create_error_result and set them there.

  • Concurrency and global callbacks: TokenTracker appends a callback to litellm.success_callback and removes it in stop(). This works well for the current single‑evaluation flow, but if evaluations are ever run concurrently in the same process, callbacks from multiple trackers will all see all completions, and each tracker will accumulate tokens for every active evaluation. It may be worth at least documenting the “single active evaluation per process” assumption, or in the future moving to a more scoped mechanism (e.g., a context manager around LiteLLM calls) if parallel evaluation is introduced.

Overall, the integration meets the PR’s objectives; these are future‑proofing considerations rather than blockers.

Also applies to: 66-86, 124-139, 155-167, 169-176

src/lightspeed_evaluation/core/output/generator.py (1)

41-51: New evaluation_data parameter and API token aggregation look correct

The generate_reports signature and docstring cleanly extend the API to accept optional evaluation_data, and the stats["api_tokens"] population (either via calculate_api_token_usage or a zeroed default) ensures downstream consumers always see a consistent shape.

As a small clarity improvement, you could make the conditional explicit so an empty list is treated the same as a non-empty one but without changing semantics:

-        if evaluation_data:
+        if evaluation_data is not None:
             stats["api_tokens"] = calculate_api_token_usage(evaluation_data)

Also applies to: 68-76

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8daa18 and 59df3e4.

📒 Files selected for processing (15)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (3 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/llm/__init__.py (2 hunks)
  • src/lightspeed_evaluation/core/llm/custom.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (10 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (3 hunks)
  • tests/unit/core/output/test_generator.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (4 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/llm/custom.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • src/lightspeed_evaluation/core/llm/init.py
  • tests/unit/runner/test_evaluation.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/core/output/test_generator.py
  • tests/unit/core/output/test_statistics.py
🧠 Learnings (7)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/output/generator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

  • src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (4)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/llm/custom.py (4)
  • TokenTracker (10-64)
  • start (40-47)
  • stop (49-55)
  • get_counts (57-59)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (440-493)
  • EvaluationResult (378-423)
tests/unit/core/output/test_generator.py (1)
src/lightspeed_evaluation/core/output/generator.py (2)
  • _generate_json_summary (199-251)
  • _generate_text_summary (253-362)
src/lightspeed_evaluation/core/output/generator.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (317-375)
  • EvaluationResult (378-423)
src/lightspeed_evaluation/core/output/statistics.py (1)
  • calculate_api_token_usage (216-238)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationData (317-375)
  • EvaluationResult (378-423)
  • TurnData (35-314)
src/lightspeed_evaluation/core/output/statistics.py (2)
  • calculate_api_token_usage (216-238)
  • calculate_basic_stats (37-73)
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/output/generator.py

[error] 199-199: pylint: R0913: Too many arguments (6/5) (too-many-arguments)


[error] 199-199: pylint: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)


[error] 253-253: pylint: R0913: Too many arguments (6/5) (too-many-arguments)


[error] 253-253: pylint: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: mypy
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)
🔇 Additional comments (8)
src/lightspeed_evaluation/core/constants.py (1)

24-37: CSV columns extended for judge LLM tokens are consistent with model changes

The new judge_llm_input_tokens and judge_llm_output_tokens entries align with the added fields on EvaluationResult and keep CSV exports in sync with the in‑memory model; no issues from a correctness or compatibility standpoint.

src/lightspeed_evaluation/core/models/data.py (1)

73-80: Token fields on TurnData and EvaluationResult are well-shaped and backward compatible

Adding api_input_tokens / api_output_tokens to TurnData and judge_llm_input_tokens / judge_llm_output_tokens to EvaluationResult with default=0 and ge=0 keeps existing payloads valid under extra="forbid" while enforcing sensible constraints for counts; this pairs cleanly with the statistics helpers and output generators.

Also applies to: 410-415

src/lightspeed_evaluation/core/models/api.py (1)

93-95: APIResponse token fields correctly bridge raw responses into the model

Populating input_tokens and output_tokens on APIResponse and wiring them through from_raw_response with default 0 preserves compatibility with older responses while exposing token data when provided by the backend; this is a clean extension of the public API.

Also applies to: 113-123

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

62-70: API token usage propagation into TurnData looks correct and defensive

Copying input_tokens / output_tokens from api_response onto turn_data.api_input_tokens / api_output_tokens with a getattr(..., 0) fallback, plus a debug log, gives robust token tracking while remaining compatible with responses that lack these fields.

tests/unit/core/output/test_statistics.py (1)

6-12: New statistics tests give solid coverage for API and judge token aggregation

Extending basic-stats expectations with total_judge_llm_* keys and adding dedicated suites for calculate_api_token_usage and token-aware calculate_basic_stats exercises the main scenarios (empty, single/multi-turn, multi-conversation, zero tokens) and validates both presence and correct summation of the new fields; this is a good level of coverage for the token usage feature.

Also applies to: 280-324, 591-769

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

14-16: Streaming parser cleanly surfaces end‑event token counts into the payload

Initializing input_tokens / output_tokens to 0, updating them only on the "end" event when those fields are present, and returning them alongside response / conversation_id gives a backward-compatible way to expose quota-based token counts from lightspeed-stack instead of estimating them locally. This matches the earlier guidance to rely on stack‑provided token data rather than heuristics. Based on learnings, this is the preferred direction.

Also applies to: 21-22, 52-57, 70-72

src/lightspeed_evaluation/core/output/generator.py (2)

15-20: Imports for EvaluationData and API token statistics are appropriate

Importing EvaluationData and calculate_api_token_usage here is the right place to wire evaluation-time token data into the reporting layer; no issues with these additions.

Based on learnings, this keeps new evaluation features within src/lightspeed_evaluation/ as intended.


101-131: Wiring api_tokens into JSON/TXT report generation is consistent

Passing stats["basic"], stats["detailed"], and stats.get("api_tokens", {}) into the JSON and TXT generators keeps all report types in sync with the new token accounting without altering existing behavior when evaluation_data is absent (defaults remain zero).

No functional issues spotted in this wiring.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/llm/custom.py (1)

10-65: Consider adding Google-style docstrings to public methods.

The TokenTracker class is well-implemented with clear lifecycle management. Consider enhancing the docstrings for public methods (start, stop, get_counts, reset) to follow Google-style format with Args and Returns sections for better API documentation clarity.

Based on coding guidelines, which require Google-style docstrings for all public APIs.

src/lightspeed_evaluation/core/output/generator.py (2)

199-219: JSON summary properly includes token usage data.

The token usage is correctly merged into overall stats and included in per-result output. The calculation of total_tokens as the sum of Judge LLM and API tokens is appropriate.

Note: The method now has 6 parameters, exceeding pylint's default limit of 5. Consider refactoring to pass a single stats dictionary instead of separate basic_stats, detailed_stats, and api_tokens parameters to reduce argument count.

Also applies to: 238-239


250-310: Text summary properly formatted with token usage sections.

The token usage sections for Judge LLM and API Calls are well-formatted with clear labels and thousands separators. The output structure is user-friendly and informative.

Note: Similar to the JSON generation method, this method exceeds pylint's parameter limit (6 > 5). Consider the same refactoring suggestion to consolidate stats parameters into a single dictionary.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 59df3e4 and dc16383.

📒 Files selected for processing (15)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (3 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/llm/__init__.py (2 hunks)
  • src/lightspeed_evaluation/core/llm/custom.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (9 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (3 hunks)
  • tests/unit/core/output/test_generator.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (4 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • tests/unit/runner/test_evaluation.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/core/output/test_generator.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/llm/custom.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/models/api.py
  • src/lightspeed_evaluation/core/llm/__init__.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/core/output/test_statistics.py
🧠 Learnings (7)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.

Applied to files:

  • src/lightspeed_evaluation/runner/evaluation.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (4)
src/lightspeed_evaluation/runner/evaluation.py (2)
src/lightspeed_evaluation/core/output/statistics.py (2)
  • calculate_api_token_usage (216-238)
  • calculate_basic_stats (37-73)
src/lightspeed_evaluation/core/output/generator.py (1)
  • generate_reports (41-88)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationData (317-375)
  • EvaluationResult (378-427)
  • TurnData (35-314)
src/lightspeed_evaluation/core/output/statistics.py (1)
  • calculate_api_token_usage (216-238)
src/lightspeed_evaluation/core/llm/__init__.py (1)
src/lightspeed_evaluation/core/llm/custom.py (2)
  • BaseCustomLLM (67-132)
  • TokenTracker (10-64)
src/lightspeed_evaluation/core/output/statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (317-375)
  • EvaluationResult (378-427)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
  • EvaluationResult (258-288)
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/output/generator.py

[error] 199-199: R0913: Too many arguments (6/5) (too-many-arguments)


[error] 250-250: R0913: Too many arguments (6/5) (too-many-arguments)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)
  • GitHub Check: mypy
🔇 Additional comments (11)
src/lightspeed_evaluation/core/models/api.py (1)

93-94: LGTM! Token tracking fields properly integrated.

The input_tokens and output_tokens fields are well-defined with appropriate validation (ge=0) and defaults. The extraction from raw_data safely handles missing values by defaulting to 0.

Also applies to: 113-116, 122-123

src/lightspeed_evaluation/core/llm/__init__.py (1)

9-9: LGTM! TokenTracker correctly exposed via lazy imports.

The TokenTracker is properly added to both TYPE_CHECKING imports and the lazy import mapping, following the same pattern as BaseCustomLLM.

Also applies to: 22-22

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

84-85: LGTM! Token tracking lifecycle properly managed.

The TokenTracker is correctly initialized per evaluation and started/stopped around the metric evaluation. Token counts are properly captured and passed to the result.

Also applies to: 124-138


175-205: Verify whether error results should include judge LLM token counts.

When an exception occurs during evaluation (line 175), the token tracker is stopped (line 179) but the judge LLM token counts are not retrieved or included in the error result. If the error occurred after some judge LLM calls, those tokens were consumed but not tracked in the result.

Consider capturing the token counts from the tracker before creating the error result, similar to the success path.

# Example: Capture tokens even on error
except Exception as e:
    execution_time = time.time() - start_time
    token_tracker.stop()
    judge_input_tokens, judge_output_tokens = token_tracker.get_counts()
    return self._create_error_result(
        request, f"Evaluation error: {e}", execution_time,
        judge_input_tokens, judge_output_tokens
    )

This would require updating the _create_error_result signature to accept the judge token counts.

src/lightspeed_evaluation/runner/evaluation.py (1)

39-42: LGTM! Token usage reporting well-integrated.

The token usage summary is properly integrated with conditional API token calculation when API is enabled. The display format is clear and informative, showing both Judge LLM and API token usage with proper formatting.

Also applies to: 82-82, 98-117

src/lightspeed_evaluation/core/output/statistics.py (2)

9-9: LGTM! Token tracking properly integrated into basic stats.

The token fields are correctly initialized to 0 for empty results and properly summed from EvaluationResult fields in the non-empty case. The implementation is clean and consistent.

Also applies to: 48-50, 58-61, 70-72


216-238: LGTM! API token usage calculation well-implemented.

The calculate_api_token_usage function correctly aggregates token usage across all conversations and turns. The Google-style docstring is clear and complete, meeting the coding guidelines.

Based on coding guidelines, which require Google-style docstrings for all public APIs.

tests/unit/core/output/test_statistics.py (3)

6-9: LGTM! Existing tests properly updated for token fields.

The imports are correct and existing test expectations are updated to include the new token fields with appropriate default values.

Also applies to: 290-292, 320-322


591-693: LGTM! Comprehensive test coverage for API token usage calculation.

The test suite thoroughly validates calculate_api_token_usage across multiple scenarios including empty data, single/multiple turns, multiple conversations, and zero token values. The tests are well-structured and use proper model construction.


696-763: LGTM! Token tracking in basic stats properly tested.

The tests validate both the presence of token fields and the correct summation logic, including edge cases like defaults and empty results. The test coverage is thorough and assertions are clear.

src/lightspeed_evaluation/core/output/generator.py (1)

15-17: LGTM! API token usage properly integrated into report generation.

The integration of EvaluationData and calculate_api_token_usage is clean and handles both scenarios (with and without evaluation data) appropriately. The conditional logic ensures API tokens default to zero when evaluation data is not provided.

Also applies to: 41-76

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

184-205: Include judge LLM token counts in error results for consistency.

The error result includes api_input_tokens and api_output_tokens but omits judge_llm_input_tokens and judge_llm_output_tokens. If an evaluation partially completes before failing, the token tracker may have accumulated counts that should be reported for accurate usage tracking and billing.

     def _create_error_result(
-        self, request: EvaluationRequest, reason: str, execution_time: float
+        self, request: EvaluationRequest, reason: str, execution_time: float,
+        judge_input_tokens: int = 0, judge_output_tokens: int = 0
     ) -> EvaluationResult:
         """Create an ERROR result for failed evaluation."""
         return EvaluationResult(
             conversation_group_id=request.conv_data.conversation_group_id,
             turn_id=request.turn_id,
             metric_identifier=request.metric_identifier,
             result="ERROR",
             score=None,
             threshold=None,
             reason=reason,
             query=request.turn_data.query if request.turn_data else "",
             response=request.turn_data.response or "" if request.turn_data else "",
             execution_time=execution_time,
+            judge_llm_input_tokens=judge_input_tokens,
+            judge_llm_output_tokens=judge_output_tokens,
             api_input_tokens=(
                 request.turn_data.api_input_tokens if request.turn_data else 0
             ),
             api_output_tokens=(
                 request.turn_data.api_output_tokens if request.turn_data else 0
             ),
         )

Then update the exception handler call:

         except Exception as e:
             execution_time = time.time() - start_time
             token_tracker.stop()
+            judge_input, judge_output = token_tracker.get_counts()
             return self._create_error_result(
-                request, f"Evaluation error: {e}", execution_time
+                request, f"Evaluation error: {e}", execution_time,
+                judge_input, judge_output
             )
🧹 Nitpick comments (6)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

62-70: Consider direct attribute access instead of getattr.

Since APIResponse is a well-defined model with input_tokens and output_tokens fields, using direct attribute access with or 0 would be more idiomatic:

-            turn_data.api_input_tokens = getattr(api_response, "input_tokens", 0)
-            turn_data.api_output_tokens = getattr(api_response, "output_tokens", 0)
+            turn_data.api_input_tokens = api_response.input_tokens or 0
+            turn_data.api_output_tokens = api_response.output_tokens or 0

Using getattr with a fallback suggests uncertainty about attribute existence, which isn't the case for typed models. If the fields are Optional[int], the or 0 pattern handles None values cleanly.

tests/unit/pipeline/evaluation/test_evaluator.py (1)

484-506: Add cleanup fixture to prevent test pollution of litellm.success_callback.

The start() method modifies the global litellm.success_callback list. If a test fails before stop() is called, subsequent tests may be affected by leftover callbacks. Consider adding a cleanup fixture:

@pytest.fixture(autouse=True)
def cleanup_litellm_callbacks(self):
    """Ensure litellm callbacks are cleaned up after each test."""
    import litellm
    original = getattr(litellm, 'success_callback', None)
    yield
    litellm.success_callback = original if original is not None else []

Alternatively, ensure each test that calls start() uses a try/finally pattern or properly stops the tracker in all code paths.

src/lightspeed_evaluation/runner/evaluation.py (1)

81-82: CLI token usage summary wiring is consistent with statistics and generator

Using calculate_basic_stats(results) for the CLI summary and then printing judge LLM token totals from summary aligns with the extended stats API. The conditional calculate_api_token_usage(evaluation_data) when system_config.api.enabled is true correctly mirrors how OutputHandler.generate_reports computes api_tokens, and the combined total_tokens matches JSON/text reporting semantics.

If you want to avoid double-walking evaluation_data, you could optionally reuse the api_tokens dict already computed inside OutputHandler.generate_reports (e.g., by having it return or expose the aggregate), but that’s a micro-optimization and not required for correctness.

-        # Generate reports based on configuration (pass evaluation_data for API token stats)
-        output_handler.generate_reports(results, evaluation_data)
+        # Generate reports based on configuration (pass evaluation_data for API token stats)
+        # Optionally, OutputHandler.generate_reports could return api_tokens to avoid
+        # recalculating here, if this ever becomes a hot path.
+        output_handler.generate_reports(results, evaluation_data)

Also applies to: 88-117

tests/unit/core/output/test_generator.py (2)

106-110: JSON summary tests correctly cover API token aggregation

The api_tokens fixture in this test and the assertion against data["summary_stats"]["overall"]["total_api_tokens"] validate that _generate_json_summary is merging API token stats into the overall summary as intended. This is a good, focused check on the new behavior.

If you later want stronger coverage, you could also assert on total_tokens (judge + API) in summary_stats["overall"], but the current assertions are sufficient for this change.

Also applies to: 112-114, 124-126


132-136: Text summary test exercises API token section end-to-end

Passing api_tokens into _generate_text_summary and asserting that "Token Usage (API Calls)" appears in the rendered text confirms the new API token section is wired through correctly.

As an optional enhancement, you might also assert that the expected aggregate value (e.g., "Total Tokens: 300") shows up, but the presence check already guards against regressions in this area.

Also applies to: 139-140, 146-147

src/lightspeed_evaluation/core/output/generator.py (1)

15-20: Report generation now cleanly incorporates EvaluationData‑based API token stats

Importing EvaluationData and extending generate_reports to accept evaluation_data: Optional[list[EvaluationData]] ties the output layer directly to the conversation data model. Computing stats["api_tokens"] via calculate_api_token_usage(evaluation_data) when evaluation_data is provided, and defaulting to a zeroed dict otherwise, matches the behavior expected by generator tests and keeps JSON/TXT generation simple.

If you ever need to distinguish “no evaluation data” from “evaluation data with zero tokens”, consider switching the condition to if evaluation_data is not None: so that an explicitly empty list still flows through calculate_api_token_usage, but the current implementation is functionally equivalent for token totals.

-        # Calculate API token usage if evaluation data is provided
-        if evaluation_data:
+        # Calculate API token usage if evaluation data is provided
+        if evaluation_data is not None:
             stats["api_tokens"] = calculate_api_token_usage(evaluation_data)
         else:

Also applies to: 41-51, 68-76

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc16383 and 92f5a35.

📒 Files selected for processing (15)
  • src/lightspeed_evaluation/core/api/streaming_parser.py (3 hunks)
  • src/lightspeed_evaluation/core/constants.py (1 hunks)
  • src/lightspeed_evaluation/core/llm/__init__.py (2 hunks)
  • src/lightspeed_evaluation/core/llm/custom.py (1 hunks)
  • src/lightspeed_evaluation/core/models/api.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (2 hunks)
  • src/lightspeed_evaluation/core/output/generator.py (9 hunks)
  • src/lightspeed_evaluation/core/output/statistics.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/runner/evaluation.py (3 hunks)
  • tests/unit/core/output/test_generator.py (2 hunks)
  • tests/unit/core/output/test_statistics.py (4 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (2 hunks)
  • tests/unit/runner/test_evaluation.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/api/streaming_parser.py
  • src/lightspeed_evaluation/core/models/api.py
  • tests/unit/runner/test_evaluation.py
  • src/lightspeed_evaluation/core/output/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

  • tests/unit/core/output/test_statistics.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/core/output/test_generator.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/llm/__init__.py
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/core/llm/custom.py
  • src/lightspeed_evaluation/core/output/generator.py
🧠 Learnings (6)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/runner/evaluation.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/output/generator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.

Applied to files:

  • src/lightspeed_evaluation/runner/evaluation.py
🧬 Code graph analysis (6)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/llm/custom.py (4)
  • TokenTracker (10-64)
  • start (40-47)
  • stop (49-55)
  • get_counts (57-59)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationRequest (444-497)
  • EvaluationResult (378-427)
src/lightspeed_evaluation/runner/evaluation.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
  • calculate_api_token_usage (216-238)
  • calculate_basic_stats (37-73)
src/lightspeed_evaluation/core/llm/__init__.py (1)
src/lightspeed_evaluation/core/llm/custom.py (2)
  • BaseCustomLLM (67-132)
  • TokenTracker (10-64)
src/lightspeed_evaluation/core/output/generator.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (317-375)
  • EvaluationResult (378-427)
src/lightspeed_evaluation/core/output/statistics.py (1)
  • calculate_api_token_usage (216-238)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
src/lightspeed_evaluation/core/llm/custom.py (5)
  • TokenTracker (10-64)
  • get_counts (57-59)
  • reset (61-64)
  • start (40-47)
  • stop (49-55)
tests/unit/core/output/test_generator.py (1)
src/lightspeed_evaluation/core/output/generator.py (4)
  • _generate_json_summary (199-248)
  • OutputHandler (24-363)
  • _calculate_stats (90-99)
  • _generate_text_summary (250-359)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)
  • GitHub Check: mypy
🔇 Additional comments (12)
tests/unit/pipeline/evaluation/test_evaluator.py (1)

459-517: Good coverage of TokenTracker lifecycle.

The tests cover initialization, counts retrieval, reset behavior, idempotent start/stop operations, and instance independence. This provides solid baseline coverage for the TokenTracker class.

src/lightspeed_evaluation/core/llm/__init__.py (1)

9-9: LGTM!

The TokenTracker export follows the established lazy import pattern consistently with BaseCustomLLM. The TYPE_CHECKING import and lazy import mapping are properly synchronized.

Also applies to: 22-22

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

66-81: LGTM!

The evaluate_metric docstring follows Google-style format with proper Args and Returns sections, addressing the previous review feedback.


124-138: Token tracking lifecycle is correctly implemented.

The TokenTracker is properly started before evaluation, stopped after, and counts are retrieved. The exception handler also ensures the tracker is stopped on error paths.

src/lightspeed_evaluation/core/llm/custom.py (1)

10-64: Well-structured TokenTracker implementation.

The class provides a clean API with proper lifecycle management (start/stop), idempotent operations, and defensive attribute access in the callback. The docstring provides a clear usage example.

src/lightspeed_evaluation/runner/evaluation.py (1)

39-42: Token statistics imports are correctly scoped to the “heavy modules” section

Importing calculate_basic_stats and calculate_api_token_usage alongside OutputHandler keeps all reporting/statistics dependencies loaded only after config and logging are initialized, which matches the existing pattern in this runner. No issues here.

tests/unit/core/output/test_statistics.py (3)

6-12: Basic stats tests correctly updated for new judge token fields

Importing EvaluationData, EvaluationResult, and TurnData plus extending the expected dicts in the mixed/single-result tests to include total_judge_llm_input_tokens, total_judge_llm_output_tokens, and total_judge_llm_tokens keeps the tests aligned with the expanded calculate_basic_stats contract. The zero expectations for cases without explicit tokens are consistent with the model defaults and the implementation.

Also applies to: 282-295, 312-324


591-694: API token usage aggregation tests provide thorough coverage

The TestCalculateApiTokenUsage suite exercises calculate_api_token_usage across empty input, single/multiple turns within a conversation, multiple conversations, and explicit zero-token cases. This is a solid set of scenarios and closely matches how EvaluationData and TurnData will be used in production.


696-769: Judge LLM token tracking behavior is well-specified via new basic-stats tests

The TestCalculateBasicStatsWithTokens class clearly verifies:

  • presence of the three judge token fields in the stats dict,
  • correct summation across multiple EvaluationResult instances, and
  • zero defaults both for results without token fields set and for empty result lists.

This tightly couples the tests to the intended calculate_basic_stats semantics and should catch any future regressions in token handling.

src/lightspeed_evaluation/core/models/data.py (1)

73-79: Token fields on TurnData and EvaluationResult are well-modeled and backward‑compatible

Adding api_input_tokens / api_output_tokens to TurnData and both judge and API token fields to EvaluationResult with default=0 and ge=0 constraints cleanly encodes token usage without breaking existing data files (missing fields deserialize as zero). The Field descriptions make the intent explicit, and the types line up with how calculate_basic_stats and calculate_api_token_usage consume these attributes.

Also applies to: 410-419

src/lightspeed_evaluation/core/output/generator.py (2)

199-219: JSON summary correctly merges judge and API token stats and exposes per‑result judge tokens

_generate_json_summary now:

  • accepts an explicit api_tokens dict,
  • builds overall_stats by combining basic_stats with total_api_input_tokens, total_api_output_tokens, total_api_tokens, and a total_tokens field that sums judge and API totals, and
  • includes judge_llm_input_tokens and judge_llm_output_tokens per result.

This structure matches the new tests and gives downstream consumers a single place (summary_stats["overall"]) to read aggregate token usage.

Also applies to: 225-240


250-267: Text summary adds clear, separate sections for judge and API token usage

The updated _generate_text_summary keeps basic_stats as the “overall” block and introduces:

  • a “Token Usage (Judge LLM)” section driven by total_judge_llm_* fields, and
  • a “Token Usage (API Calls)” section reading from api_tokens with sensible .get(..., 0) defaults.

This mirrors the JSON structure at a human-readable level and safely handles cases where API usage is disabled or evaluation data is not supplied.

Also applies to: 288-310

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asamal4 asamal4 merged commit 659d90c into lightspeed-core:main Dec 8, 2025
15 checks passed
@bsatapat-jpg bsatapat-jpg deleted the dev branch December 8, 2025 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants