[LEADS-40]: Get statistics about the token usage for lightspeed-evaluation#112
[LEADS-40]: Get statistics about the token usage for lightspeed-evaluation#112asamal4 merged 1 commit intolightspeed-core:mainfrom
Conversation
WalkthroughStreaming parser, models, pipeline, output, runner, and tests updated to capture and propagate API token counts; a new TokenTracker captures judge-LLM token usage via litellm callbacks and those metrics are aggregated and included in reports and CSV constants. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant StreamingParser
participant Amender
participant Evaluator
participant TokenTracker
participant Statistics
participant Runner
Client->>StreamingParser: stream events (chunks, end event with token counts)
StreamingParser-->>Amender: APIResponse (includes api input/output tokens)
Amender-->>Evaluator: turn_data (with api_input_tokens/api_output_tokens)
Evaluator->>TokenTracker: start()
Evaluator->>Evaluator: run judge-LLM evaluation (invokes litellm)
TokenTracker--)TokenTracker: callback accumulates judge input/output tokens
Evaluator->>TokenTracker: stop()
TokenTracker-->>Evaluator: get_counts() → judge token totals
Evaluator-->>Statistics: emit EvaluationResult (includes judge + API tokens)
Statistics-->>Runner: aggregated stats (including token totals)
Runner->>Runner: print Token Usage Summary (judge, API, combined)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/core/output/statistics.py (1)
1-230: Fix trailing whitespace on lines 57, 220, 225.The pipeline has flagged trailing whitespace on multiple lines. Additionally, Black formatting needs to be applied.
Run Black to fix:
#!/bin/bash black src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/api/streaming_parser.py (1)
1-134: Fix formatting issues flagged by pipeline.The pipeline has identified trailing whitespace on line 72 and requires Black formatting.
Run Black to fix:
#!/bin/bash black src/lightspeed_evaluation/core/api/streaming_parser.py
🧹 Nitpick comments (3)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
174-175: Update error path cleanup to match normal path.If you switch to append/remove pattern for callbacks, apply the same cleanup logic here.
# Clear callback on error - litellm.success_callback = [] + if token_tracking_callback in litellm.success_callback: + litellm.success_callback.remove(token_tracking_callback)
180-195: Consider adding token fields to error results for consistency.The
_create_error_resultmethod doesn't includejudge_llm_input_tokensandjudge_llm_output_tokensfields. While token counts on error are typically zero, including them ensures theEvaluationResultobjects are structurally consistent.def _create_error_result( self, request: EvaluationRequest, reason: str, execution_time: float ) -> EvaluationResult: """Create an ERROR result for failed evaluation.""" return EvaluationResult( conversation_group_id=request.conv_data.conversation_group_id, turn_id=request.turn_id, metric_identifier=request.metric_identifier, result="ERROR", score=None, threshold=None, reason=reason, query=request.turn_data.query if request.turn_data else "", response=request.turn_data.response or "" if request.turn_data else "", execution_time=execution_time, + judge_llm_input_tokens=0, + judge_llm_output_tokens=0, )src/lightspeed_evaluation/core/api/streaming_parser.py (1)
14-85: Address complexity warning - too many branches.The pipeline flagged this function with 17 branches exceeding the limit of 12. Consider refactoring to extract event handling into separate methods.
Refactor event handling into dedicated methods:
def parse_streaming_response(response: httpx.Response) -> dict[str, Any]: """Parse streaming response and extract data.""" conversation_id = "" final_response = "" tool_calls: list[dict[str, Any]] = [] token_count = 0 input_tokens = 0 output_tokens = 0 for line in response.iter_lines(): line = line.strip() if not line or not line.startswith(DATA_PREFIX): continue json_data = line.replace(DATA_PREFIX, "") parsed_data = _parse_sse_line(json_data) if not parsed_data: continue event, event_data = parsed_data # Delegate to event handlers result = _handle_event( event, event_data, conversation_id, final_response, tool_calls, token_count, input_tokens, output_tokens ) conversation_id = result["conversation_id"] final_response = result["final_response"] tool_calls = result["tool_calls"] token_count = result["token_count"] input_tokens = result["input_tokens"] output_tokens = result["output_tokens"] # ... validation and return logicThen create separate handler methods for each event type.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
src/lightspeed_evaluation/core/api/client.py(2 hunks)src/lightspeed_evaluation/core/api/streaming_parser.py(4 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/metrics/token_counter.py(1 hunks)src/lightspeed_evaluation/core/metrics/token_tracker.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(4 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/metrics/token_counter.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/metrics/token_tracker.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/runner/evaluation.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/metrics/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Update
MetricManagersupported_metrics dictionary when registering new metrics
Files:
src/lightspeed_evaluation/core/metrics/token_counter.pysrc/lightspeed_evaluation/core/metrics/token_tracker.py
🧠 Learnings (3)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
src/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/runner/evaluation.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/runner/evaluation.pysrc/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (4)
src/lightspeed_evaluation/core/metrics/token_counter.py (1)
src/lightspeed_evaluation/core/metrics/token_tracker.py (1)
reset(38-43)
src/lightspeed_evaluation/core/metrics/token_tracker.py (1)
src/lightspeed_evaluation/core/metrics/token_counter.py (1)
reset(26-29)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/metrics/token_counter.py (3)
TokenCounter(9-65)add_tokens(17-20)get_counts(22-24)
src/lightspeed_evaluation/runner/evaluation.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_api_token_usage(216-230)calculate_basic_stats(37-73)
🪛 GitHub Actions: Black
src/lightspeed_evaluation/core/api/client.py
[error] 1-1: Black would reformat this file. Run 'black' to format the code.
src/lightspeed_evaluation/core/metrics/token_counter.py
[error] 1-1: Black would reformat this file. Run 'black' to format the code.
src/lightspeed_evaluation/core/models/data.py
[error] 1-1: Black would reformat this file. Run 'black' to format the code.
src/lightspeed_evaluation/core/metrics/token_tracker.py
[error] 1-1: Black would reformat this file. Run 'black' to format the code.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
[error] 1-1: Black would reformat this file. Run 'black' to format the code.
src/lightspeed_evaluation/runner/evaluation.py
[error] 1-1: Black would reformat this file. Run 'black' to format the code.
src/lightspeed_evaluation/core/api/streaming_parser.py
[error] 1-1: Black would reformat this file. Run 'black' to format the code.
src/lightspeed_evaluation/core/output/generator.py
[error] 1-1: Black would reformat this file. Run 'black' to format the code.
src/lightspeed_evaluation/core/output/statistics.py
[error] 1-1: Black would reformat this file. Run 'black' to format the code.
🪛 GitHub Actions: Pydocstyle
src/lightspeed_evaluation/core/metrics/token_tracker.py
[error] 52-52: D403: First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM'). Command: make docstyle.
[error] 89-89: D403: First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM'). Command: make docstyle.
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/api/client.py
[error] 159-159: Trailing whitespace (trailing-whitespace)
[error] 166-166: Trailing whitespace (trailing-whitespace)
[error] 198-198: Trailing whitespace (trailing-whitespace)
[error] 203-203: Line too long (102/100) (line-too-long)
[error] 204-204: Trailing whitespace (trailing-whitespace)
src/lightspeed_evaluation/core/metrics/token_counter.py
[error] 48-48: Trailing whitespace (trailing-whitespace)
[error] 64-64: Trailing whitespace (trailing-whitespace)
[error] 88-88: Trailing newlines (trailing-newlines)
[error] 63-63: W1203: Use lazy % formatting in logging functions (logging-fstring-interpolation)
[error] 79-79: W0603: Using the global statement (global-statement)
[error] 85-85: W0602: Using global for '_current_counter' but no assignment is done (global-variable-not-assigned)
src/lightspeed_evaluation/core/models/data.py
[error] 72-72: Trailing whitespace (trailing-whitespace)
src/lightspeed_evaluation/core/metrics/token_tracker.py
[error] 32-32: Trailing whitespace (trailing-whitespace)
[error] 37-37: Trailing whitespace (trailing-whitespace)
[error] 44-44: Trailing whitespace (trailing-whitespace)
[error] 46-46: Trailing whitespace (trailing-whitespace)
[error] 68-68: Trailing whitespace (trailing-whitespace)
[error] 73-73: Trailing whitespace (trailing-whitespace)
[error] 81-81: Trailing whitespace (trailing-whitespace)
[error] 102-102: Trailing whitespace (trailing-whitespace)
[error] 114-114: Trailing whitespace (trailing-whitespace)
[error] 114-114: W0613: Unused argument 'kwargs' (unused-argument)
[error] 114-114: W0613: Unused argument 'start_time' (unused-argument)
[error] 114-114: W0613: Unused argument 'end_time' (unused-argument)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
[error] 71-71: Trailing whitespace (trailing-whitespace)
[error] 138-138: Trailing whitespace (trailing-whitespace)
[error] 68-68: R0902: Too many local variables (18/15) (too-many-locals)
[error] 123-123: W0718: Catching too general exception Exception (broad-exception-caught)
[error] 124-124: W1203: Use lazy % formatting in logging functions (logging-fstring-interpolation)
[error] 114-114: W0613: Unused argument 'kwargs' (unused-argument)
[error] 114-114: W0613: Unused argument 'start_time' (unused-argument)
[error] 114-114: W0613: Unused argument 'end_time' (unused-argument)
src/lightspeed_evaluation/runner/evaluation.py
[error] 97-97: Trailing whitespace (trailing-whitespace)
[error] 105-105: Trailing whitespace (trailing-whitespace)
src/lightspeed_evaluation/core/api/streaming_parser.py
[error] 72-72: Trailing whitespace (trailing-whitespace)
[error] 14-14: Too many branches (17/12) (too-many-branches)
src/lightspeed_evaluation/core/output/generator.py
[error] 243-243: Trailing whitespace (trailing-whitespace)
src/lightspeed_evaluation/core/output/statistics.py
[error] 57-57: Trailing whitespace (trailing-whitespace)
[error] 220-220: Trailing whitespace (trailing-whitespace)
[error] 225-225: Trailing whitespace (trailing-whitespace)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: mypy
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
🔇 Additional comments (11)
src/lightspeed_evaluation/core/constants.py (1)
35-36: LGTM!The new CSV columns for judge LLM token tracking are consistent with the naming convention used elsewhere and align with the new
EvaluationResultfields introduced in this PR.src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
62-70: LGTM!Token usage extraction and logging are well integrated into the amendment flow. Note that
getattrwith fallback is slightly redundant sinceAPIResponsenow defines these fields withdefault=0, but the defensive approach is harmless and adds resilience if the model changes.src/lightspeed_evaluation/core/models/api.py (2)
93-94: LGTM!The new token fields are well-defined with appropriate validation (
ge=0) and sensible defaults. The field descriptions are clear.
113-123: LGTM!Token extraction follows the established pattern for parsing raw API response data, with safe defaults when fields are absent.
src/lightspeed_evaluation/core/api/client.py (1)
198-205: Fix formatting issues and consider adding output token estimation for consistency.The pipeline reports trailing whitespace on lines 198 and 204, and line 203 exceeds 100 characters. The streaming path estimates only input tokens, while the standard path estimates both. To resolve this:
- Remove trailing whitespace on lines 198 and 204
- Reformat the logger.debug call on line 203 to stay within the 100-character limit
- Add output token estimation for the streaming path if
parse_streaming_responsedoes not already provide itsrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
112-127: Thread safety concern with global callback mutation requires verification.Mutating
litellm.success_callbackglobally is not thread-safe if multipleevaluate_metriccalls execute concurrently. They will overwrite each other's callbacks, leading to incorrect token counts or missed tracking.Additionally:
- Prefix unused parameters (
kwargs,start_time,end_time) with_to suppress linter warnings- Use lazy
%formatting instead of f-string on line 124:logger.debug("Token tracking callback error: %s", e)- Consider using
append()to add the callback instead of assignment to avoid overwriting existing callbacksFor proper thread safety, verify whether evaluations can run concurrently in this codebase (e.g., via ThreadPoolExecutor). If so, use thread-local or context variables to manage per-evaluation callbacks, or ensure evaluations are serialized.
src/lightspeed_evaluation/core/models/data.py (1)
74-79: LGTM! Token tracking fields properly added.The token tracking fields are well-defined with appropriate validation constraints (
ge=0) and descriptive field descriptions. The additions to bothTurnDataandEvaluationResultalign with the PR's objective to track token usage across the evaluation pipeline.Also applies to: 410-415
src/lightspeed_evaluation/core/output/statistics.py (2)
48-72: LGTM! Token statistics properly integrated.The token tracking fields are correctly integrated into the basic statistics calculation. The implementation:
- Properly initializes token fields in the empty results case
- Accumulates token counts from all results
- Returns comprehensive token statistics (input, output, and total)
216-230: LGTM! API token usage calculation is clean and correct.The
calculate_api_token_usagefunction properly aggregates API token usage from all turns across all conversations. The implementation is straightforward and returns a well-structured dictionary with total counts.src/lightspeed_evaluation/core/api/streaming_parser.py (2)
19-21: LGTM! Token tracking properly integrated into streaming parser.The token tracking implementation is well-structured:
- Initializes tracking variables appropriately
- Counts tokens during streaming events
- Extracts authoritative counts from end events when available
- Extends the return payload with token information
The multi-tier approach (explicit counts → streamed count → estimation) provides good fallback coverage.
Also applies to: 43-46, 55-60, 83-84
69-77: Verify token estimation approach aligns with other estimation logic.The fallback token estimation (lines 74-77) estimates tokens using a 4 characters per token heuristic. Confirm this matches the approach in
TokenCounter.estimate_tokens()intoken_counter.py. If the heuristics differ, extract a shared estimation utility function to maintain consistency.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
14-15: Update docstring to document return structure and new token fields.The function now returns
input_tokensandoutput_tokensin addition to existing fields, but the docstring doesn't reflect this. Per coding guidelines, use Google-style docstrings for all public APIs.Apply this diff to improve the docstring:
def parse_streaming_response(response: httpx.Response) -> dict[str, Any]: - """Parse streaming response and extract data.""" + """Parse streaming response and extract data. + + Args: + response: The httpx streaming response to parse. + + Returns: + A dictionary containing: + - response (str): The final response text. + - tool_calls (list): List of tool call sequences. + - conversation_id (str): The conversation identifier. + - input_tokens (int): Number of input tokens consumed. + - output_tokens (int): Number of output tokens generated. + + Raises: + ValueError: If no final response or conversation ID is found. + """Based on coding guidelines.
♻️ Duplicate comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
143-144: Callback registration overwrites existing callbacks.Setting
litellm.success_callback = [token_tracking_callback]replaces any existing callbacks that may have been registered elsewhere. This could cause issues if other parts of the system rely on their callbacks.Consider appending and removing specifically:
# Register callback - litellm.success_callback = [token_tracking_callback] + litellm.success_callback.append(token_tracking_callback) # Evaluate metric score, reason = self.handlers[framework].evaluate( # type: ignore metric_name, request.conv_data, evaluation_scope ) # Clear callback - litellm.success_callback = [] + if token_tracking_callback in litellm.success_callback: + litellm.success_callback.remove(token_tracking_callback)
🧹 Nitpick comments (5)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
69-77: Document the token estimation approximation.The 4 characters per token estimate at lines 74-77 is a reasonable fallback, but the accuracy varies by content type (English text, code, non-English text) and tokenizer. Since these statistics may be used for cost estimation or monitoring, consider:
- Adding a comment explaining when this fallback is triggered and its limitations
- Including the estimation method in token usage reports so users know when estimates are used versus actual counts
Apply this diff to clarify the estimation:
# If still no output tokens, estimate from response length if not output_tokens and final_response: - # Rough estimation: ~4 characters per token for English text + # Fallback estimation when API doesn't provide token counts. + # Assumes ~4 characters per token for English text; accuracy varies by content type. output_tokens = max(1, len(final_response) // 4) - logger.debug("Estimated output tokens from response length: %d", output_tokens) + logger.warning( + "Output tokens estimated from response length: %d (actual count unavailable)", + output_tokens + )Consider changing the log level to
warningsince estimates are less reliable than actual counts.
14-85: Consider refactoring to reduce cyclomatic complexity.The linter flags "Too many branches" (R0912) for this function. While the current implementation is readable, you could reduce complexity by extracting event handlers:
Example refactoring approach:
def parse_streaming_response(response: httpx.Response) -> dict[str, Any]: """Parse streaming response and extract data.""" parser_state = StreamingParserState() for line in response.iter_lines(): line = line.strip() if not line or not line.startswith(DATA_PREFIX): continue json_data = line.replace(DATA_PREFIX, "") parsed_data = _parse_sse_line(json_data) if parsed_data: event, event_data = parsed_data _handle_event(event, event_data, parser_state) return parser_state.to_dict() def _handle_event(event: str, event_data: dict, state: StreamingParserState) -> None: """Dispatch to appropriate event handler.""" handlers = { "error": _handle_error_event, "start": _handle_start_event, "token": _handle_token_event, "turn_complete": _handle_turn_complete_event, "tool_call": _handle_tool_call_event, "end": _handle_end_event, } handler = handlers.get(event) if handler: handler(event_data, state)This is optional and can be deferred if the current implementation meets your needs.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
130-141: Address pylint warnings for unused callback parameters.The callback signature is required by LiteLLM but the parameters are unused. Prefix them with underscore to suppress warnings:
# Set up token tracking callback for LiteLLM def token_tracking_callback( - kwargs, completion_response, start_time, end_time + _kwargs, completion_response, _start_time, _end_time ):Additionally, the bare
except Exceptionis flagged. Consider catching a more specific exception or at minimum documenting why the broad catch is necessary:token_counter.add_tokens(input_tokens, output_tokens) - except Exception as e: + except (AttributeError, TypeError) as e: logger.debug("Token tracking callback error: %s", e)
197-212: Consider tracking partial token counts in error results.When an evaluation fails partway through, any tokens consumed before the error are not captured in the error result. The
_create_error_resultmethod doesn't accept token counts.If tracking partial token usage is important for cost analysis, consider passing the token counter:
def _create_error_result( - self, request: EvaluationRequest, reason: str, execution_time: float + self, request: EvaluationRequest, reason: str, execution_time: float, + judge_input_tokens: int = 0, judge_output_tokens: int = 0 ) -> EvaluationResult: """Create an ERROR result for failed evaluation.""" return EvaluationResult( conversation_group_id=request.conv_data.conversation_group_id, turn_id=request.turn_id, metric_identifier=request.metric_identifier, result="ERROR", score=None, threshold=None, reason=reason, query=request.turn_data.query if request.turn_data else "", response=request.turn_data.response or "" if request.turn_data else "", execution_time=execution_time, + judge_llm_input_tokens=judge_input_tokens, + judge_llm_output_tokens=judge_output_tokens, )Then update the exception handler to pass partial counts:
judge_input, judge_output = token_counter.get_counts() return self._create_error_result( request, f"Evaluation error: {e}", execution_time, judge_input, judge_output )
85-85: Address "too many local variables" pylint warning.The
evaluate_metricmethod has grown complex with token tracking logic. Consider extracting the callback setup and cleanup into a context manager to reduce local variable count:from contextlib import contextmanager @contextmanager def _litellm_token_tracking(self, token_counter: TokenCounter): """Context manager for LiteLLM token tracking callback.""" def callback(_kwargs, completion_response, _start_time, _end_time): try: if hasattr(completion_response, "usage"): usage = completion_response.usage input_tokens = getattr(usage, "prompt_tokens", 0) output_tokens = getattr(usage, "completion_tokens", 0) token_counter.add_tokens(input_tokens, output_tokens) except (AttributeError, TypeError) as e: logger.debug("Token tracking callback error: %s", e) litellm.success_callback.append(callback) try: yield finally: if callback in litellm.success_callback: litellm.success_callback.remove(callback)This encapsulates the callback lifecycle and reduces complexity in the main method.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
src/lightspeed_evaluation/core/api/client.py(2 hunks)src/lightspeed_evaluation/core/api/streaming_parser.py(4 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/core/models/api.py
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/api/client.py
- src/lightspeed_evaluation/core/constants.py
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/output/generator.py
🧠 Learnings (2)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
evaluate(97-138)src/lightspeed_evaluation/core/metrics/geval.py (1)
evaluate(45-116)src/lightspeed_evaluation/core/metrics/ragas.py (1)
evaluate(100-140)src/lightspeed_evaluation/core/metrics/script.py (1)
evaluate(23-38)
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/api/streaming_parser.py
[error] 14-14: pylint: Too many branches (R0912).
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
[error] 85-85: pylint: Too many local variables (R0914).
[warning] 140-140: pylint: Catching too general exception (broad-exception-caught) (W0718).
[warning] 131-131: pylint: Unused argument 'kwargs' (no-unused-vars) (W0613).
[warning] 131-131: pylint: Unused argument 'start_time' (no-unused-vars) (W0613).
[warning] 131-131: pylint: Unused argument 'end_time' (no-unused-vars) (W0613).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (7)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
79-85: Token fields correctly added to return payload.The addition of
input_tokensandoutput_tokensto the return dictionary aligns with the PR objective to track token usage throughout the evaluation pipeline. The fallback logic ensures token counts are always available, even when the API doesn't provide explicit counts.
43-46: Verify that counting "inference" role tokens is accurate.The logic assumes that each "token" event with
role="inference"represents exactly one output token. Please verify:
- Is "inference" the only role that should be counted for output tokens?
- Does each event truly represent a single token, or could events represent multiple tokens?
- Are there other roles (e.g., "assistant") that should also be counted?
src/lightspeed_evaluation/core/output/generator.py (1)
244-253: LGTM! Token usage section is well-integrated.The formatting with
:,for thousands separators provides good readability. The section placement after overall statistics is appropriate and maintains consistency with the report structure.src/lightspeed_evaluation/core/output/statistics.py (1)
58-61: LGTM! Token aggregation is straightforward and correct.The summation approach correctly handles the default value of 0 for missing token counts.
src/lightspeed_evaluation/core/models/data.py (2)
73-79: LGTM! API token tracking fields are well-defined.The field definitions with
ge=0constraints and sensible defaults properly handle cases where token tracking is not available.
410-415: LGTM! Judge LLM token fields are properly structured.The fields align with the
TokenCounterimplementation in the evaluator and integrate cleanly with the existing model.src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
27-43: LGTM! Simple and effective token counter implementation.The class is appropriately scoped for per-evaluation token tracking. The design is clean and thread-safe since each evaluation creates its own instance.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
216-230: Improve docstring to meet Google-style format requirements.The docstring for this public API function still lacks
ArgsandReturnssections as required by the coding guidelines.def calculate_api_token_usage(evaluation_data: list[EvaluationData]) -> dict[str, Any]: - """Calculate total API token usage from evaluation data.""" + """Calculate total API token usage from evaluation data. + + Args: + evaluation_data: List of evaluation data containing turn-level API token counts. + + Returns: + Dictionary containing total_api_input_tokens, total_api_output_tokens, + and total_api_tokens. + """
🧹 Nitpick comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
27-43: Add Google-style docstrings to the public class and methods.Per coding guidelines, public APIs should have Google-style docstrings. The
TokenCounterclass and its methods lack complete documentation.class TokenCounter: - """Simple token counter for metric evaluation.""" + """Simple token counter for metric evaluation. + + Accumulates input and output token counts across multiple LLM calls + during a single metric evaluation. + """ def __init__(self) -> None: - """Initialize token counter.""" + """Initialize token counter with zero counts.""" self.input_tokens = 0 self.output_tokens = 0 def add_tokens(self, input_tokens: int, output_tokens: int) -> None: - """Add token counts.""" + """Add token counts to the running totals. + + Args: + input_tokens: Number of input/prompt tokens to add. + output_tokens: Number of output/completion tokens to add. + """ self.input_tokens += input_tokens self.output_tokens += output_tokens def get_counts(self) -> tuple[int, int]: - """Get current token counts.""" + """Get current token counts. + + Returns: + Tuple of (input_tokens, output_tokens). + """ return self.input_tokens, self.output_tokens
199-214: Consider tracking tokens even for error results.The
_create_error_resultmethod doesn't include token counts, but LLM calls may have occurred before the error. For accurate usage tracking, consider passing thetoken_counterto this method.def _create_error_result( - self, request: EvaluationRequest, reason: str, execution_time: float + self, request: EvaluationRequest, reason: str, execution_time: float, + token_counter: Optional[TokenCounter] = None ) -> EvaluationResult: """Create an ERROR result for failed evaluation.""" + judge_input, judge_output = token_counter.get_counts() if token_counter else (0, 0) return EvaluationResult( conversation_group_id=request.conv_data.conversation_group_id, turn_id=request.turn_id, metric_identifier=request.metric_identifier, result="ERROR", score=None, threshold=None, reason=reason, query=request.turn_data.query if request.turn_data else "", response=request.turn_data.response or "" if request.turn_data else "", execution_time=execution_time, + judge_llm_input_tokens=judge_input, + judge_llm_output_tokens=judge_output, )Then update the call site at line 195-197:
return self._create_error_result( - request, f"Evaluation error: {e}", execution_time + request, f"Evaluation error: {e}", execution_time, token_counter )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
src/lightspeed_evaluation/core/api/client.py(2 hunks)src/lightspeed_evaluation/core/api/streaming_parser.py(4 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/output/generator.py
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/models/api.py
- src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/constants.py
🧠 Learnings (6)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(317-375)EvaluationResult(378-423)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(440-493)EvaluationResult(378-423)src/lightspeed_evaluation/core/metrics/geval.py (1)
evaluate(45-116)src/lightspeed_evaluation/core/metrics/script.py (1)
evaluate(23-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: tests (3.12)
- GitHub Check: mypy
🔇 Additional comments (8)
src/lightspeed_evaluation/core/constants.py (1)
35-36: LGTM!The new CSV column names align with the
judge_llm_input_tokensandjudge_llm_output_tokensfields inEvaluationResult, maintaining consistency across the codebase.src/lightspeed_evaluation/core/output/statistics.py (1)
48-50: LGTM!Token tracking integration in
calculate_basic_statsis well-implemented. The initialization in the empty-results case and the accumulation logic are correct, with proper field access tojudge_llm_input_tokensandjudge_llm_output_tokens.Also applies to: 58-61, 70-72
src/lightspeed_evaluation/core/api/client.py (2)
160-176: LGTM!The token estimation fallback logic is reasonable. Using
len(text) // 4with a minimum of 1 is a sensible heuristic for English text when the API doesn't provide token counts. The debug logging helps with troubleshooting.
202-211: LGTM!The streaming path correctly only estimates input tokens since output tokens are tracked/estimated in
parse_streaming_responsevia token events. This avoids duplicate estimation logic.src/lightspeed_evaluation/core/api/streaming_parser.py (3)
21-23: LGTM!The token counting initialization and incremental tracking of "inference" tokens during streaming is well-implemented. This provides accurate output token counts when the API doesn't include them in the "end" event.
Also applies to: 45-48
57-62: LGTM!The three-tier fallback for output tokens (explicit from API → counted from token events → estimated from response length) is a robust approach. The estimation is appropriately logged at debug level.
Also applies to: 71-79
81-87: LGTM!The updated return structure correctly includes
input_tokensandoutput_tokens, aligning with theAPIResponsemodel and maintaining consistency with the standard query path.src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
145-146: Consider using append/remove pattern for callback registration to preserve existing callbacks.The current approach of assigning
litellm.success_callback = [token_tracking_callback]overwrites any existing callbacks. While this appears isolated to the evaluation context, using append/remove would be safer if LiteLLM is used elsewhere with other callbacks:# Register callback - litellm.success_callback = [token_tracking_callback] + litellm.success_callback.append(token_tracking_callback) # Evaluate metric score, reason = self.handlers[framework].evaluate( # type: ignore metric_name, request.conv_data, evaluation_scope ) # Clear callback - litellm.success_callback = [] + if token_tracking_callback in litellm.success_callback: + litellm.success_callback.remove(token_tracking_callback)Also applies to line 194 (error path).
b866ff6 to
ecda30f
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
71-79: Consider logging when using streamed token count fallback.The fallback logic is sound, but only the estimation fallback logs a debug message. Consider adding a debug log when using
token_countas fallback for consistency in tracking which token source was used.# Use streamed token count if explicit counts not provided if not output_tokens and token_count > 0: output_tokens = token_count + logger.debug("Using streamed token count as output tokens: %d", output_tokens) # If still no output tokens, estimate from response lengthtests/unit/core/output/test_statistics.py (1)
1-11: Consider adding test coverage for non-zero token counts.The existing tests verify the token fields exist with default values, but there's no test verifying actual token aggregation when
judge_llm_input_tokensorjudge_llm_output_tokensare non-zero on theEvaluationResultobjects.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
src/lightspeed_evaluation/core/api/client.py(2 hunks)src/lightspeed_evaluation/core/api/streaming_parser.py(4 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)tests/unit/core/output/test_statistics.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/core/output/generator.py
- src/lightspeed_evaluation/core/models/api.py
- src/lightspeed_evaluation/runner/evaluation.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/models/data.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_statistics.pytests/unit/runner/test_evaluation.py
🧠 Learnings (7)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/statistics.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(440-493)EvaluationResult(378-423)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(317-375)EvaluationResult(378-423)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: mypy
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.11)
🔇 Additional comments (21)
src/lightspeed_evaluation/core/api/streaming_parser.py (4)
21-23: LGTM! Token tracking initialization is appropriate.The initialization of
token_count,input_tokens, andoutput_tokensto 0 provides a clean baseline for accumulating token usage throughout the streaming response parsing.
45-48: Token counting logic for inference events looks correct.Counting each token event where
role == "inference"is a reasonable approach for tracking streamed output tokens.
57-62: LGTM! End event token extraction is properly handled.Extracting
input_tokensandoutput_tokensfrom the end event when available ensures accurate counts from the API when provided.
85-86: LGTM! Return payload correctly includes token fields.The enriched return dictionary now includes
input_tokensandoutput_tokens, enabling downstream token tracking.src/lightspeed_evaluation/core/models/data.py (2)
73-79: LGTM! API token tracking fields properly defined on TurnData.The
api_input_tokensandapi_output_tokensfields are well-defined with appropriate defaults, constraints (ge=0), and descriptive docstrings.
410-415: LGTM! Judge LLM token fields properly defined on EvaluationResult.The
judge_llm_input_tokensandjudge_llm_output_tokensfields follow the same pattern as the API token fields with proper validation constraints.tests/unit/runner/test_evaluation.py (3)
57-65: LGTM! Mock return values correctly updated to include token tracking fields.The mock for
calculate_basic_statsnow includestotal_judge_llm_input_tokens,total_judge_llm_output_tokens, andtotal_judge_llm_tokens, aligning with the updated function signature.
112-120: Mock values correctly reflect zero-token scenario.Appropriate for testing the custom output directory case where no actual evaluations are run.
210-218: LGTM! Token totals correctly mocked for error scenario testing.The proportional token values (500/250/750) appropriately reflect the mixed results scenario being tested.
src/lightspeed_evaluation/core/api/client.py (2)
202-210: LGTM! Streaming input token estimation follows same pattern.The estimation logic for streaming queries mirrors the standard query path, maintaining consistency. The streaming parser handles output token estimation via
token_countor response length fallback.
160-175: Consider distinguishing between missing tokens and explicit zero values.The condition
not response_data.get("input_tokens")treats an explicit0the same as a missing key. If the API returns0intentionally (e.g., for cached responses with no new input processing), this would incorrectly trigger estimation. Consider usingresponse_data.get("input_tokens") is Noneto distinguish between missing tokens and legitimate zero values.tests/unit/core/output/test_statistics.py (2)
289-291: LGTM! Expected stats correctly include token tracking fields.The expected dictionary now includes the three new token fields, all set to 0 as the test
EvaluationResultobjects use default values.
319-321: LGTM! Single result test correctly expects zero token counts.Consistent with the mixed results test above.
src/lightspeed_evaluation/core/output/statistics.py (3)
48-50: LGTM! Empty results case correctly returns zero token counts.Consistent with the non-empty case structure.
58-72: LGTM! Token aggregation is straightforward and correct.Summing
judge_llm_input_tokensandjudge_llm_output_tokensacross all results, then computing the total, is the expected behavior.
216-237: LGTM! API token usage calculation function properly implemented.The function correctly iterates through evaluation data and turns to aggregate API token usage. The Google-style docstring has been added as requested in the previous review.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5)
27-42: LGTM! TokenCounter class is clean and straightforward.Simple accumulator with appropriate type hints on methods. Consider using
@dataclassfor even more concise implementation, though the current approach is perfectly acceptable.
156-162: Callback cleanup and token retrieval are correctly placed.The callback is cleared after evaluation, and token counts are retrieved before result construction. See previous comment regarding the cleanup approach.
189-190: LGTM! Token counts correctly passed to EvaluationResult.The
judge_llm_input_tokensandjudge_llm_output_tokensfields are populated from the TokenCounter, enabling downstream aggregation in statistics.
196-197: Good practice: Callback cleanup in error path.Ensures the callback is removed even when an exception occurs, preventing stale callbacks. Apply the same improvement suggested above for the cleanup approach.
131-149: Callback management may interfere with concurrent evaluations or other LiteLLM callbacks.Setting
litellm.success_callback = [token_tracking_callback]replaces any existing callbacks. If other parts of the system register LiteLLM callbacks, or if evaluations run concurrently, this could cause issues. Consider appending and removing the specific callback instead.- # Register callback - litellm.success_callback = [token_tracking_callback] + # Register callback (preserve any existing callbacks) + litellm.success_callback.append(token_tracking_callback)And in cleanup (lines 156-157 and 196-197):
- litellm.success_callback = [] + if token_tracking_callback in litellm.success_callback: + litellm.success_callback.remove(token_tracking_callback)
VladimirKadlec
left a comment
There was a problem hiding this comment.
Nice 💪 . Added some comments, please add tests.
ce9a29b to
58aa1ec
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/api/client.py (1)
22-26: Duplicate function: consolidate with streaming_parser.py.This
estimate_tokensfunction is identical to the one insrc/lightspeed_evaluation/core/api/streaming_parser.py(lines 12-16). See the review comment on that file for the recommended refactoring to extract this to a shared utility module.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
src/lightspeed_evaluation/core/api/client.py(3 hunks)src/lightspeed_evaluation/core/api/streaming_parser.py(4 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)tests/unit/core/output/test_statistics.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- src/lightspeed_evaluation/core/models/api.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- tests/unit/runner/test_evaluation.py
- src/lightspeed_evaluation/core/output/statistics.py
- src/lightspeed_evaluation/core/constants.py
- tests/unit/core/output/test_statistics.py
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧠 Learnings (5)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
src/lightspeed_evaluation/core/api/client.py (1)
estimate_tokens(22-26)
src/lightspeed_evaluation/core/api/client.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
estimate_tokens(12-16)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(440-493)EvaluationResult(378-423)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (5)
src/lightspeed_evaluation/core/output/generator.py (1)
244-254: LGTM! Token usage reporting is well-integrated.The new "Token Usage (Judge LLM)" section is cleanly integrated into the text summary and follows the existing formatting patterns. The thousand-separator formatting is appropriate for readability.
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
29-31: Token tracking logic is sound.The streaming token tracking implementation correctly:
- Counts inference tokens during streaming (lines 53-56)
- Extracts explicit counts from end events (lines 65-70)
- Falls back to streamed count or estimation when needed (lines 79-86)
- Returns token metadata in the response payload (lines 92-93)
The fallback chain (explicit → streamed → estimated) is reasonable and ensures token counts are always available.
Also applies to: 53-56, 65-70, 79-86, 92-93
src/lightspeed_evaluation/core/api/client.py (1)
167-180: Token estimation integration is correct.The token estimation logic correctly:
- Checks if tokens are missing using
not response_data.get("input_tokens")/not raw_data.get("input_tokens")- Estimates from appropriate sources (query for input, response for output)
- Logs debug messages for traceability
The implementation properly handles both standard and streaming query paths.
Also applies to: 207-214
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
27-42: LGTM! TokenCounter class is clean and well-designed.The
TokenCounterclass provides a clear, simple interface for accumulating token usage with appropriate initialization, mutation (add_tokens), and access (get_counts) methods.
158-159: Token count propagation is correct.The token counts are properly retrieved from the
TokenCounterand attached to theEvaluationResultasjudge_llm_input_tokensandjudge_llm_output_tokens, aligning with the expanded data model.Also applies to: 186-187
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
src/lightspeed_evaluation/core/api/client.py(3 hunks)src/lightspeed_evaluation/core/api/streaming_parser.py(4 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)tests/unit/core/output/test_statistics.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/runner/evaluation.py
- tests/unit/core/output/test_statistics.py
- src/lightspeed_evaluation/core/models/data.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/output/generator.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/api/streaming_parser.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/runner/test_evaluation.py
🧠 Learnings (5)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/statistics.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(440-493)EvaluationResult(378-423)
src/lightspeed_evaluation/core/api/client.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
estimate_tokens(12-16)
src/lightspeed_evaluation/core/output/statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(317-375)EvaluationResult(378-423)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
EvaluationResult(258-288)
🪛 GitHub Actions: Pyright
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
[error] 200-201: "token_tracking_callback" is possibly unbound (reportPossiblyUnboundVariable)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.11)
- GitHub Check: mypy
- GitHub Check: Pylinter
🔇 Additional comments (10)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
29-31: LGTM: Token tracking logic is sound.The implementation correctly:
- Initializes token counters
- Accumulates streaming tokens from "inference" role events
- Extracts final counts from the "end" event
Also applies to: 53-56, 65-70
79-86: LGTM: Fallback estimation provides good resilience.The cascading fallback logic (streamed count → estimated from response) ensures token counts are always populated when possible, with appropriate debug logging.
src/lightspeed_evaluation/core/constants.py (1)
35-36: LGTM: CSV column constants updated correctly.The additions align with the token tracking feature and enable token usage data export.
tests/unit/runner/test_evaluation.py (1)
57-65: LGTM: Tests properly updated for token tracking.All test cases consistently include the new token fields in mock statistics, ensuring test coverage for the token accounting feature.
Also applies to: 112-120, 210-218
src/lightspeed_evaluation/core/api/client.py (1)
167-180: LGTM: Token estimation fallbacks are well-implemented.The code appropriately estimates token counts when the API doesn't provide them, with helpful debug logging. The logic is consistent across both standard and streaming query paths.
Also applies to: 207-214
src/lightspeed_evaluation/core/models/api.py (1)
93-94: LGTM: API model correctly extended with token fields.The new
input_tokensandoutput_tokensfields are properly defined with:
- Appropriate constraints (non-negative, default 0)
- Clear descriptions
- Correct extraction and initialization in
from_raw_responseAlso applies to: 113-115, 122-123
src/lightspeed_evaluation/core/output/statistics.py (2)
48-50: LGTM: Token totals correctly calculated and included in statistics.The implementation properly:
- Initializes token totals for empty results
- Aggregates judge LLM token usage from all results
- Returns comprehensive token statistics
Also applies to: 58-61, 70-72
216-238: LGTM: API token usage calculation is well-implemented.The function:
- Follows Google-style docstring guidelines
- Correctly aggregates API token usage from turn data
- Returns a consistent structure matching the statistics pattern
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
27-42: LGTM: TokenCounter is a clean internal utility.The simple counter class provides clear token accumulation functionality for evaluation tracking.
91-92: LGTM: Token counter properly initialized per evaluation.The counter is correctly initialized at the start of each evaluation.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
12-16: Update docstring to Google-style format.A previous review already flagged this: the coding guidelines require Google-style docstrings with
ArgsandReturnssections for public APIs.src/lightspeed_evaluation/core/api/client.py (1)
22-26: Improveestimate_tokensdocstring and avoid duplication withstreaming_parser.
estimate_tokensis a public helper here, but it still has a single-line docstring; coding guidelines call for a Google-style docstring with explicitArgs/Returns. Also, the same helper exists insrc/lightspeed_evaluation/core/api/streaming_parser.py(lines 11-15), which introduces duplication and potential drift.Consider updating the docstring and, in a follow-up, centralizing this helper in a shared module (or otherwise reusing a single definition) so both
client.pyandstreaming_parser.pystay in sync.Example docstring change:
def estimate_tokens(text: str) -> int: - """Estimate token count from text (~4 characters per token).""" + """Estimate token count from text (~4 characters per token). + + Args: + text: Input text to estimate tokens for. + + Returns: + Estimated token count (minimum 1). + """ if not text: return 1 return max(1, len(text) // 4)As per coding guidelines.
🧹 Nitpick comments (3)
tests/unit/core/output/test_statistics.py (1)
289-321: Consider adding a test case with non-zero token values.The current tests verify token fields exist but only with zero values. A test that creates
EvaluationResultobjects withjudge_llm_input_tokensandjudge_llm_output_tokensset would strengthen coverage of the aggregation logic.Example test case:
def test_calculate_basic_stats_with_token_usage(self): """Test calculate_basic_stats aggregates token usage.""" results = [ EvaluationResult( conversation_group_id="conv1", turn_id="turn1", metric_identifier="test:metric", result="PASS", score=0.8, threshold=0.7, reason="Good", judge_llm_input_tokens=100, judge_llm_output_tokens=50, ), EvaluationResult( conversation_group_id="conv1", turn_id="turn2", metric_identifier="test:metric", result="PASS", score=0.9, threshold=0.7, reason="Excellent", judge_llm_input_tokens=150, judge_llm_output_tokens=75, ), ] stats = calculate_basic_stats(results) assert stats["total_judge_llm_input_tokens"] == 250 assert stats["total_judge_llm_output_tokens"] == 125 assert stats["total_judge_llm_tokens"] == 375src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
199-201: Add defensive check for callback list existence in exception handler.If
litellm.success_callbackis modified externally (set toNone) between registration and exception handling, line 200 could raise anAttributeError. Consider adding a defensive check:# Remove our callback on error (preserve others) - if token_tracking_callback in litellm.success_callback: + if ( + hasattr(litellm, "success_callback") + and litellm.success_callback + and token_tracking_callback in litellm.success_callback + ): litellm.success_callback.remove(token_tracking_callback)This mirrors the defensive pattern used during registration (lines 146-150).
206-221: Consider propagating partial token counts in error results.When an evaluation error occurs, the
_create_error_resultmethod doesn't includejudge_llm_input_tokensorjudge_llm_output_tokens, even though some tokens may have been consumed before the error. This could lead to under-counting of total token usage.If tracking all token usage is important for cost accounting, consider passing token counts to this method:
def _create_error_result( - self, request: EvaluationRequest, reason: str, execution_time: float + self, request: EvaluationRequest, reason: str, execution_time: float, + judge_llm_input_tokens: int = 0, judge_llm_output_tokens: int = 0 ) -> EvaluationResult: """Create an ERROR result for failed evaluation.""" return EvaluationResult( ... execution_time=execution_time, + judge_llm_input_tokens=judge_llm_input_tokens, + judge_llm_output_tokens=judge_llm_output_tokens, )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
src/lightspeed_evaluation/core/api/client.py(3 hunks)src/lightspeed_evaluation/core/api/streaming_parser.py(4 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)tests/unit/core/output/test_statistics.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- src/lightspeed_evaluation/core/output/generator.py
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/models/data.py
- src/lightspeed_evaluation/core/output/statistics.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/models/api.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_statistics.pytests/unit/runner/test_evaluation.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/api/client.pysrc/lightspeed_evaluation/core/constants.py
🧠 Learnings (5)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
src/lightspeed_evaluation/core/api/client.py (1)
estimate_tokens(22-26)
src/lightspeed_evaluation/core/api/client.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
estimate_tokens(12-16)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.12)
- GitHub Check: ruff
- GitHub Check: mypy
- GitHub Check: pydocstyle
🔇 Additional comments (15)
src/lightspeed_evaluation/core/constants.py (1)
35-36: LGTM!The new CSV columns for token tracking are correctly added and align with the naming convention used in the evaluation result fields (
judge_llm_input_tokens,judge_llm_output_tokens).tests/unit/runner/test_evaluation.py (3)
57-65: LGTM!The mock return values correctly include the new token usage fields. The token values are consistent (100 + 50 = 150 for total).
112-120: LGTM!Consistent mock data with zero token values for the empty results case.
210-218: LGTM!Mock values correctly reflect token usage proportional to the evaluation count (500 + 250 = 750).
tests/unit/core/output/test_statistics.py (2)
289-291: LGTM!The expected values correctly reflect the default token counts when
EvaluationResultobjects don't have token data set.
319-321: LGTM!Consistent with the mixed results test case - token fields default to 0.
src/lightspeed_evaluation/core/api/streaming_parser.py (3)
29-31: Token counting logic is well-structured with appropriate fallbacks.The fallback chain for
output_tokens(explicit end event → streamed count → estimation) is sensible. Note thatinput_tokenshas no fallback and will remain 0 if not provided by the server—this is likely intentional since input tokens can't be reliably estimated without the original prompt context.Also applies to: 65-70, 79-86
53-56: LGTM!Counting token events with
role == "inference"correctly tracks streamed output tokens.
92-93: LGTM!The return dictionary correctly includes the new token fields.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)
27-43: LGTM!The
TokenCounterclass is a clean, focused utility for accumulating token usage. Type hints are properly included.
91-106: LGTM!The callback is correctly defined before the
tryblock, addressing the previous unbound variable concern. The callback properly extracts token usage from LiteLLM's response.
145-151: LGTM!The callback registration correctly uses append to preserve existing callbacks, addressing the previous global mutation concern.
158-165: LGTM!Callback cleanup uses safe membership check, and token counts are correctly retrieved and passed to
EvaluationResult.Also applies to: 192-193
src/lightspeed_evaluation/core/api/client.py (2)
167-180: Token estimation fallback for_standard_querylooks correct.Using
estimate_tokensto backfillinput_tokensandoutput_tokenswhen the API omits or zeros them is consistent with previous behavior and adds useful debug logging. No issues from a correctness or logging standpoint.
208-215: Streaminginput_tokensestimation is consistent and minimal.The streaming path mirrors the standard path by estimating
input_tokenswhen missing/zero, using the same helper and debug logging. This keeps token accounting consistent across endpoint types without altering existing error handling.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/unit/runner/test_evaluation.py (1)
57-65: Token fields added to mocked stats look consistent with core statistics API.The extended
mock_stats.return_valuedictionaries match the new token aggregates (total_judge_llm_input_tokens,total_judge_llm_output_tokens,total_judge_llm_tokens) and remain backward compatible with existing assertions. If you want stronger coverage, you could also assert that these keys are present inresult, but it's optional since detailed token aggregation is already tested intest_statistics.py.Also applies to: 112-120, 210-218
tests/unit/core/api/test_client.py (1)
7-7: estimate_tokens tests accurately capture the heuristic and edge cases.The new
TestEstimateTokenssuite aligns with the implementation (4 chars/token with a floor of 1) and exercises a good range of inputs, including unicode and newlines.test_estimate_tokens_none_like_emptyis slightly redundant with the empty-string test, but that’s only a naming nit and doesn’t hurt clarity.Also applies to: 503-557
src/lightspeed_evaluation/core/models/api.py (1)
93-95: APIResponse token fields and mapping from raw_data are sound and backward compatible.Adding
input_tokens/output_tokenswith non-negative integer defaults, and wiring them viaraw_data.get(..., 0), preserves existing behavior while allowing real API token counts to flow through when present. Consider adding a dedicated test forAPIResponse.from_raw_responseincluding token keys, but the core change itself looks good.Also applies to: 113-116, 117-124
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
src/lightspeed_evaluation/core/api/client.py(3 hunks)src/lightspeed_evaluation/core/api/streaming_parser.py(4 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)tests/unit/core/api/test_client.py(2 hunks)tests/unit/core/api/test_streaming_parser.py(2 hunks)tests/unit/core/output/test_statistics.py(4 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/api/client.py
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/output/generator.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/runner/test_evaluation.pytests/unit/core/api/test_client.pytests/unit/pipeline/evaluation/test_evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/output/test_statistics.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧠 Learnings (7)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
🧬 Code graph analysis (6)
tests/unit/core/api/test_client.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
estimate_tokens(12-16)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
src/lightspeed_evaluation/core/api/client.py (1)
estimate_tokens(22-26)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(440-493)EvaluationResult(378-423)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
TokenCounter(27-42)get_counts(40-42)add_tokens(35-38)
tests/unit/core/api/test_streaming_parser.py (3)
src/lightspeed_evaluation/core/api/client.py (1)
estimate_tokens(22-26)src/lightspeed_evaluation/core/api/streaming_parser.py (1)
estimate_tokens(12-16)tests/unit/core/api/test_client.py (5)
TestEstimateTokens(503-557)test_estimate_tokens_empty_string(506-508)test_estimate_tokens_short_text(514-517)test_estimate_tokens_exact_boundary(519-522)test_estimate_tokens_minimum_one(535-539)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
EvaluationData(317-375)EvaluationResult(378-423)TurnData(35-314)src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_api_token_usage(216-238)calculate_basic_stats(37-73)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: mypy
🔇 Additional comments (11)
tests/unit/core/api/test_streaming_parser.py (1)
10-10: Good coverage for streaming_parser’s estimate_tokens helper.The added tests validate the same contract as the client-side estimator and cover key edge cases without over-constraining the implementation. Duplication with client tests is acceptable here since both modules expose
estimate_tokensas part of their public surface.Also applies to: 257-283
tests/unit/pipeline/evaluation/test_evaluator.py (1)
12-15: TokenCounter tests are comprehensive and aligned with the implementation.The new tests cover initialization, accumulation semantics, zero-add behavior, return type, and instance independence, matching the current
TokenCountercontract and providing solid safety for future changes.Also applies to: 461-513
tests/unit/core/output/test_statistics.py (1)
6-12: Statistics tests correctly exercise new token aggregation behavior.Extending expectations for
calculate_basic_statsand adding focused suites forcalculate_api_token_usageand token totals provides good coverage of both judge-LLM and API token accounting. The tests use realEvaluationData,TurnData, andEvaluationResultinstances, so they validate model integration as well as numeric correctness.Also applies to: 282-323, 591-769
src/lightspeed_evaluation/core/output/statistics.py (1)
9-10: Token aggregation in statistics is consistent and model-driven.
calculate_basic_statsnow always exposes judge token totals, with safe defaults for empty inputs and simple sums overEvaluationResult.judge_llm_input_tokens/judge_llm_output_tokens. The newcalculate_api_token_usagewalksEvaluationData.turnsand aggregatesapi_input_tokensandapi_output_tokensinto clear totals, with a Google-style docstring and proper type hints. This matches the tests and the new token fields on the models.Also applies to: 37-73, 216-238
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
12-17: Streaming parser token tracking is well layered and preserves existing behavior.The new
estimate_tokenshelper and token logic inparse_streaming_responsecorrectly prioritize explicit server counts, then streamed token events, and finally a length-based estimate, while continuing to enforce the existing invariants around final response and conversation ID. The extended return dict addsinput_tokens/output_tokenswithout breaking callers that only rely onresponse,tool_calls, andconversation_id.Also applies to: 22-32, 53-57, 65-71, 79-87, 88-94
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)
27-43: LGTM - Simple and effective token accumulator.The
TokenCounterclass is clean and straightforward for per-evaluation token tracking. The implementation correctly uses instance variables to avoid cross-contamination between evaluations.
94-106: Token tracking callback properly defined before try block.The callback is now defined before the try block, correctly addressing the previous unbound variable concern. The defensive
hasattrandgetattrusage for accessing usage attributes is good practice.
145-151: Callback registration correctly preserves existing callbacks.The implementation properly checks for the existence of
success_callbackand initializes it as an empty list if needed before appending. This addresses the previous concern about overwriting existing callbacks.
181-194: Token counts correctly propagated to EvaluationResult.The successful evaluation path properly retrieves token counts and includes them in the result. This integrates well with the
EvaluationResultmodel's new fields.src/lightspeed_evaluation/core/models/data.py (2)
73-79: Token tracking fields for API usage are well-defined.The
api_input_tokensandapi_output_tokensfields follow the existing patterns inTurnDatawith appropriate defaults, constraints (ge=0), and clear descriptions. Good placement alongside other API-populated fields.
410-415: Judge LLM token tracking fields properly added.The
judge_llm_input_tokensandjudge_llm_output_tokensfields maintain consistency with the existing field structure and provide clear semantic separation from API token usage. Thege=0constraint and default of0are appropriate.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
196-201: Optionally harden error-path callback cleanup against library defaults.In the exception path you do:
if token_tracking_callback in litellm.success_callback: litellm.success_callback.remove(token_tracking_callback)This assumes
litellm.success_callbackis always an iterable; if some LiteLLM versions initialise it asNone(or omit the attribute) and an exception is raised before your registration code runs, this membership test can itself raise, masking the original evaluation error. If your supported LiteLLM versions guaranteesuccess_callbackis always a list, you can ignore this; otherwise, wrapping the check with a simple guard on attribute existence/non-Nonewould make the cleanup more robust.
🧹 Nitpick comments (2)
tests/unit/runner/test_evaluation.py (1)
57-65: Consider asserting token fields in runner tests for stronger coverage.Extending the mocked
calculate_basic_statsreturn dicts with the newtotal_judge_llm_*keys is good, but the tests still only assert onTOTAL/PASS/FAIL/ERROR. Optional: add simple expectations on the token fields here to catch any future wiring regressions between the runner andstatistics.calculate_basic_stats, even though the detailed behavior is already well covered intests/unit/core/output/test_statistics.py.Also applies to: 112-120, 210-218
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
27-43: Judge LLM token tracking integration looks good; consider also populating tokens on ERROR results.
TokenCounterplus the LiteLLMsuccess_callbackhook correctly accumulatesprompt_tokens/completion_tokensand attaches them to successfulEvaluationResultinstances viajudge_llm_input_tokens/judge_llm_output_tokens. Right now, when you fall back to_create_error_result(e.g.,score is Noneafter some LiteLLM calls), those error results always carry zero tokens, socalculate_basic_statswill under-report total judge-LLM tokens for evaluations that failed after consuming tokens. If you want “total tokens consumed by judge LLM” semantics, consider grabbing the counts fromtoken_counterand setting them on error results as well (either by extending_create_error_resultor by mutating the returned object before returning).Also applies to: 91-107, 145-151, 158-161, 164-166, 192-193
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
src/lightspeed_evaluation/core/api/streaming_parser.py(3 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)tests/unit/core/output/test_statistics.py(4 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/constants.py
- tests/unit/pipeline/evaluation/test_evaluator.py
- src/lightspeed_evaluation/core/api/streaming_parser.py
- src/lightspeed_evaluation/core/output/generator.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_statistics.pytests/unit/runner/test_evaluation.py
🧠 Learnings (6)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/output/statistics.pytests/unit/runner/test_evaluation.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/statistics.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(440-493)EvaluationResult(378-423)
src/lightspeed_evaluation/core/output/statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(317-375)EvaluationResult(378-423)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
EvaluationResult(258-288)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
EvaluationData(317-375)EvaluationResult(378-423)TurnData(35-314)src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_api_token_usage(216-238)calculate_basic_stats(37-73)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: mypy
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.12)
🔇 Additional comments (9)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
62-70: API token usage propagation looks correct and backwards compatible.Reading
input_tokens/output_tokenswithgetattr(..., 0)and storing them onTurnDataaligns with the new fields and keeps older responses (without tokens) working while adding useful debug logging.tests/unit/core/output/test_statistics.py (3)
6-6: Stats expectations updated consistently with new token fields.The additional imports and the extended
expecteddicts (includingtotal_judge_llm_*fields) align with the new behavior ofcalculate_basic_statsand give a strict equality check that will catch shape regressions.Also applies to: 9-9, 290-293, 320-323
591-694: API token usage tests exercise key scenarios well.The new
TestCalculateApiTokenUsageclass covers empty input, single turn, multiple turns, multiple conversations, and explicit zero-token cases, which matches howcalculate_api_token_usageaggregatesapi_input_tokens/api_output_tokensacrossEvaluationData.turns.
696-769: Judge LLM token tracking tests forcalculate_basic_statsare solid.These tests verify presence of the new token fields, correct summation across multiple
EvaluationResultinstances, default-zero behavior when tokens aren’t provided, and the empty-results case, giving good confidence in the extended stats.src/lightspeed_evaluation/core/models/api.py (1)
93-95: APIResponse token fields and extraction are consistent and safe.Adding
input_tokens/output_tokensas non-negative ints with defaults, and wiringfrom_raw_responseto read them (defaulting to 0 when absent), cleanly extends the model without breaking existing callers.Also applies to: 113-124
src/lightspeed_evaluation/core/output/statistics.py (2)
37-51: Extension ofcalculate_basic_statswith judge token totals is coherent.Summing
judge_llm_input_tokens/judge_llm_output_tokensand exposing them (plus their sum) in both empty and non-empty cases matches the new data model and the updated tests, without changing existing count/rate semantics.Also applies to: 58-61, 70-73
216-238:calculate_api_token_usagehelper is clear and aligns with EvaluationData/TurnData.Iterating over
EvaluationData.turnsand accumulatingapi_input_tokens/api_output_tokensinto a simple dict (total_api_input_tokens,total_api_output_tokens,total_api_tokens) provides exactly what the higher-level reporting needs, and the Google-style docstring matches the documented guidelines for new public APIs. Based on learnings, this placement incore/output/statisticsis appropriate for evaluation features.src/lightspeed_evaluation/core/models/data.py (2)
73-80: TurnData API token fields are well specified.Adding
api_input_tokens/api_output_tokensas non-negative ints with default 0 and clear descriptions fits the per-turn model and supports aggregation incalculate_api_token_usagewithout breaking existing YAML/evaluation data.
410-415: EvaluationResult judge token fields align with downstream statistics.The new
judge_llm_input_tokensandjudge_llm_output_tokensfields (int, ge=0, default 0) integrate cleanly withMetricsEvaluatorandcalculate_basic_stats, enabling end-to-end judge-LLM token tracking while remaining backward compatible.
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
src/lightspeed_evaluation/core/api/streaming_parser.py(3 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/llm/__init__.py(2 hunks)src/lightspeed_evaluation/core/llm/custom.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(4 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)tests/unit/core/output/test_statistics.py(4 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/api/streaming_parser.py
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/constants.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_statistics.pytests/unit/runner/test_evaluation.pytests/unit/pipeline/evaluation/test_evaluator.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/llm/custom.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/llm/__init__.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧠 Learnings (6)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/statistics.pytests/unit/pipeline/evaluation/test_evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (5)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
EvaluationData(317-375)EvaluationResult(378-423)TurnData(35-314)src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_api_token_usage(216-238)
src/lightspeed_evaluation/core/output/statistics.py (1)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
EvaluationResult(258-288)
src/lightspeed_evaluation/core/llm/__init__.py (1)
src/lightspeed_evaluation/core/llm/custom.py (2)
BaseCustomLLM(67-132)TokenTracker(10-64)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
src/lightspeed_evaluation/core/llm/custom.py (5)
TokenTracker(10-64)get_counts(57-59)reset(61-64)start(40-47)stop(49-55)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/llm/custom.py (4)
TokenTracker(10-64)start(40-47)stop(49-55)get_counts(57-59)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(440-493)EvaluationResult(378-423)
🪛 GitHub Actions: Pydocstyle
src/lightspeed_evaluation/core/llm/custom.py
[error] 34-34: D403: First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM')
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/llm/custom.py
[warning] 29-29: Pylint: Unused argument 'kwargs' (unused-argument)
[warning] 31-31: Pylint: Unused argument 'start_time' (unused-argument)
[warning] 32-32: Pylint: Unused argument 'end_time' (unused-argument)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: mypy
🔇 Additional comments (16)
src/lightspeed_evaluation/core/llm/__init__.py (1)
9-9: LGTM! TokenTracker properly exposed via lazy imports.The new TokenTracker export follows the existing lazy-import pattern and integrates cleanly with the public API surface.
Also applies to: 22-22
src/lightspeed_evaluation/core/output/generator.py (1)
244-253: LGTM! Token usage reporting integrates cleanly.The new token usage block follows the existing report structure and formatting conventions. The statistics are pulled from the overall stats dictionary and displayed with appropriate number formatting.
tests/unit/runner/test_evaluation.py (1)
57-65: LGTM! Test mocks updated to match new stats shape.The mock return values have been properly extended to include the three new token-tracking fields required by the updated
calculate_basic_statsfunction.Also applies to: 112-120, 210-218
src/lightspeed_evaluation/core/models/data.py (1)
73-79: LGTM! Token fields properly integrated into data models.The new token-tracking fields follow existing patterns with appropriate validation constraints (
ge=0), sensible defaults (0), and clear descriptions. Both TurnData (API tokens) and EvaluationResult (judge LLM tokens) are extended consistently.Also applies to: 410-415
tests/unit/core/output/test_statistics.py (3)
6-9: LGTM! Existing tests properly updated for new token fields.The updated imports and test assertions correctly accommodate the new token-tracking fields with appropriate zero defaults.
Also applies to: 290-293, 320-323
591-694: LGTM! Comprehensive test coverage for API token usage calculations.The new test class thoroughly exercises
calculate_api_token_usageacross all relevant scenarios: empty data, single turn, multiple turns, multiple conversations, and zero tokens. Tests are well-structured and assertions are clear.
696-769: LGTM! Thorough verification of token field integration in basic stats.The new test class validates that
calculate_basic_statscorrectly includes, sums, and defaults token fields. Coverage includes populated values, zero defaults, and empty results.tests/unit/pipeline/evaluation/test_evaluator.py (1)
5-5: LGTM! Thorough TokenTracker test coverage.The new test class comprehensively validates TokenTracker's initialization, lifecycle (start/stop), reset behavior, idempotency, and instance independence. All tests are well-structured and follow pytest conventions.
Also applies to: 459-517
src/lightspeed_evaluation/core/models/api.py (1)
93-94: LGTM! Token fields properly integrated into API response model.The new
input_tokensandoutput_tokensfields are correctly defined with appropriate validation constraints and defaults. Thefrom_raw_responsemethod cleanly extracts these values with sensible fallbacks.Also applies to: 113-116, 122-123
src/lightspeed_evaluation/core/output/statistics.py (3)
9-9: LGTM: Import updated to support API token calculation.The addition of
EvaluationDatato the imports is necessary for the newcalculate_api_token_usagefunction and properly extends the module's capability to handle both judge LLM and API token aggregation.
48-72: LGTM: Judge LLM token aggregation implemented correctly.The token tracking additions are well-integrated:
- Empty results properly initialize all token fields to 0
- Token aggregation uses efficient generator expressions
- Total tokens correctly computed as sum of input and output
- Field naming is consistent and clear
216-238: LGTM: API token calculation function is well-implemented.The function correctly:
- Follows Google-style docstring format as required by coding guidelines
- Uses proper type hints
- Iterates through conversation data and turns to aggregate API token usage
- Returns a consistently structured dictionary
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)
8-8: LGTM: TokenTracker import enables judge LLM token measurement.The import of
TokenTrackerfrom the custom LLM module provides the necessary infrastructure for tracking token usage during metric evaluation.
72-73: LGTM: Token tracking integration is correct.The token tracking lifecycle is properly implemented:
- Per-evaluation
TokenTrackerinstance prevents cross-evaluation contaminationstart()called before evaluation to register callbacksstop()called after evaluation to unregister callbacks- Token counts retrieved and used in result construction
Also applies to: 112-113, 120-121, 125-126
153-154: LGTM: Token counts correctly added to evaluation result.The judge LLM token counts are properly passed to the
EvaluationResultconstructor, aligning with the updated model definition that includesjudge_llm_input_tokensandjudge_llm_output_tokensfields.
160-161: LGTM: Token tracking properly cleaned up in error path.Calling
token_tracker.stop()in the exception handler ensures that the litellm callback is unregistered even when evaluation fails, preventing callback leaks.
c8daa18 to
e753fe8
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/llm/custom.py (1)
34-34: Pipeline failure: D403 pydocstyle error requires attention.The CI reports a D403 error: "First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM')". While "LiteLLM" is the correct brand name, pydocstyle enforces sentence-case rules.
Options to resolve:
- Reword to avoid starting with the brand name:
- """LiteLLM callback to capture token usage.""" + """Capture token usage from LiteLLM completion responses."""
- Or add a pydocstyle ignore directive if preserving brand casing is important.
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
14-22: Return type annotation should reflect the new token fields.The function returns a
dict[str, Any]which accommodates the new fields, but consider documenting the expected return structure in the docstring for clarity. The current docstring "Parse streaming response and extract data." is minimal.def parse_streaming_response( # pylint: disable=too-many-branches response: httpx.Response, ) -> dict[str, Any]: - """Parse streaming response and extract data.""" + """Parse streaming response and extract data. + + Returns: + Dictionary with keys: response, tool_calls, conversation_id, + input_tokens, output_tokens. + """ conversation_id = ""Also applies to: 66-72
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
124-138: Consider using try/finally for cleaner token tracker lifecycle management.The current implementation correctly stops the token tracker in both success and exception paths, but a
try/finallypattern would be more idiomatic and prevent potential issues if new code paths are added later.# Start token tracking token_tracker.start() - # Evaluate metric - score, reason = self.handlers[framework].evaluate( # type: ignore - metric_name, request.conv_data, evaluation_scope - ) - - # Stop token tracking - token_tracker.stop() + try: + # Evaluate metric + score, reason = self.handlers[framework].evaluate( # type: ignore + metric_name, request.conv_data, evaluation_scope + ) + finally: + # Stop token tracking + token_tracker.stop() execution_time = time.time() - start_time # Get token counts judge_input_tokens, judge_output_tokens = token_tracker.get_counts()This would also allow removing the
token_tracker.stop()call in the outer exception handler (line 173).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
src/lightspeed_evaluation/core/api/streaming_parser.py(3 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/llm/__init__.py(2 hunks)src/lightspeed_evaluation/core/llm/custom.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(4 hunks)src/lightspeed_evaluation/runner/evaluation.py(2 hunks)tests/unit/core/output/test_statistics.py(4 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/core/models/data.py
- src/lightspeed_evaluation/core/llm/init.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- tests/unit/runner/test_evaluation.py
- tests/unit/pipeline/evaluation/test_evaluator.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/runner/evaluation.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/llm/custom.pysrc/lightspeed_evaluation/core/output/generator.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_statistics.py
🧠 Learnings (8)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/runner/evaluation.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/statistics.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.
Applied to files:
src/lightspeed_evaluation/runner/evaluation.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/runner/evaluation.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_api_token_usage(216-238)calculate_basic_stats(37-73)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/llm/custom.py (4)
TokenTracker(10-64)start(40-47)stop(49-55)get_counts(57-59)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(440-493)EvaluationResult(378-423)
🪛 GitHub Actions: Pydocstyle
src/lightspeed_evaluation/core/llm/custom.py
[error] 34-34: Docstyle error: D403 - First word of the first line should be properly capitalized ('Litellm', not 'LiteLLM').
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: mypy
- GitHub Check: Pylinter
🔇 Additional comments (19)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
52-57: Token extraction from end event looks correct.The conditional extraction with dictionary
.get()pattern would be cleaner and safer thaninchecks, but the current approach is functionally correct. The independent extraction from theendevent doesn't interfere withfinal_responseconstruction.src/lightspeed_evaluation/core/models/api.py (2)
93-94: Well-structured token fields with proper validation.The
Field(default=0, ge=0)constraints ensure non-negative integers with sensible defaults. The descriptive field documentation is clear.
113-123: Clean extraction pattern for token counts.Using
.get()with default0aligns with the field defaults and handles missing keys gracefully. The implementation correctly propagates tokens from raw response data to the model.src/lightspeed_evaluation/core/output/generator.py (1)
244-253: Token usage section integrates well with the existing report structure.The formatting with thousands separators (
:,) improves readability. The section placement after overall statistics is logical. The implementation correctly references the keys fromcalculate_basic_stats.src/lightspeed_evaluation/runner/evaluation.py (2)
39-42: Import structure is clean and follows existing patterns.The grouped import from
lightspeed_evaluation.core.output.statisticsmaintains consistency with the codebase style.
98-117: Token usage summary provides good visibility into resource consumption.The implementation correctly:
- Displays judge LLM tokens unconditionally (always available from evaluation results)
- Conditionally displays API tokens only when
system_config.api.enabledis true- Computes combined total when both are available
Verify that the amender correctly populates
api_input_tokensandapi_output_tokensonTurnDataobjects duringpipeline.run_evaluation, as the token usage calculation depends on these fields being populated.src/lightspeed_evaluation/core/llm/custom.py (3)
10-19: Well-documented class with clear usage pattern.The docstring provides a helpful usage example. The lifecycle pattern (start → use → stop → get_counts) is intuitive.
57-64: Utility methods are clean and straightforward.
get_counts()returns a tuple which is appropriate for the two-value return.reset()is useful for reusing the same tracker across multiple evaluation runs.
40-55: Callback registration is thread-safe for single instance but consider concurrent usage.The
_callback_registeredflag prevents double-registration within a single instance. However, if multipleTokenTrackerinstances are used concurrently (e.g., parallel evaluations), they would all append to the same globallitellm.success_callbacklist, potentially causing token counts to be accumulated across instances.If concurrent usage is expected, consider using instance-specific identification or documenting that only one tracker should be active at a time.
tests/unit/core/output/test_statistics.py (4)
6-12: LGTM! Imports are correctly updated for the new token tracking tests.The imports now include
EvaluationDataandTurnDatamodels alongside the newcalculate_api_token_usagefunction, which are all necessary for the new test cases.
290-293: LGTM! Token field assertions correctly added to existing test expectations.The test expectations are properly updated to include the new
total_judge_llm_input_tokens,total_judge_llm_output_tokens, andtotal_judge_llm_tokensfields with zero values, correctly reflecting that results without explicit token counts default to zero.
591-693: LGTM! Comprehensive test coverage forcalculate_api_token_usage.The test class covers all key scenarios:
- Empty data handling (returns zeros)
- Single turn token aggregation
- Multiple turns within a conversation
- Multiple conversations aggregation
- Zero token edge case
Test naming follows the
test_*convention as required by coding guidelines.
696-769: LGTM! Well-structured tests for judge LLM token tracking in basic stats.The test class properly validates:
- Token fields are present in the output
- Token values are correctly summed across results (300 input + 150 output = 450 total)
- Results without explicit tokens default to zero
- Empty results return zero tokens
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
8-8: LGTM! TokenTracker import correctly added.The import from
lightspeed_evaluation.core.llm.customaligns with the new token tracking infrastructure.
66-81: LGTM! Docstring updated to Google-style format.The docstring now properly documents the method's purpose, args, and return values including the new token tracking behavior. This addresses the previous review comment.
169-176: Error handling correctly stops token tracking.The exception handler properly ensures the token tracker is stopped, and the error result intentionally omits token counts since partial counts from failed evaluations could be misleading. The
TokenTracker.stop()implementation is safe to call even if tracking wasn't started.src/lightspeed_evaluation/core/output/statistics.py (3)
9-9: LGTM! Import updated to include EvaluationData.The import correctly adds
EvaluationDatawhich is needed by the newcalculate_api_token_usagefunction.
48-72: LGTM! Token tracking fields correctly added to basic stats.The implementation properly:
- Returns zero values for empty results
- Sums
judge_llm_input_tokensandjudge_llm_output_tokensfrom each result- Calculates the total as the sum of input and output tokens
This is consistent with the
EvaluationResultmodel which defaults these fields to 0.
216-238: Verify thatTurnDatahasapi_input_tokensandapi_output_tokensfields.The function implementation appears sound on code review:
- Type hints are present:
list[EvaluationData] -> dict[str, Any]- Docstring follows Google-style format
- Logic correctly iterates through conversations and turns to aggregate token counts
- Return structure is consistent
However, verification is needed to confirm:
TurnDatamodel contains theapi_input_tokensandapi_output_tokensfields being accessed- Function follows the coding guideline requirement for structured logging (if applicable for this context)
- No error handling needed for missing or invalid token values
VladimirKadlec
left a comment
There was a problem hiding this comment.
LGTM, thank you 👍
asamal4
left a comment
There was a problem hiding this comment.
both json & CSV files should also contain the token information for both API/JudgeLLM
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/output/generator.py (1)
199-206: Fix pylint “too many arguments” by passing a consolidatedstatsdict into JSON/TXT summariesThe new
token_usagesections and per-resultjudge_llm_*_tokensfields in both JSON and text summaries are well-structured and align with the aggregated statistics. However, pylint is correctly flagging_generate_json_summaryand_generate_text_summaryfor having too many parameters / positional arguments (R0913/R0917), which will keep CI red until addressed.A straightforward way to resolve this without losing clarity is to pass the existing
statsdict into these helpers instead of three separate dict arguments:@@ - def _generate_json_summary( - self, - results: list[EvaluationResult], - base_filename: str, - basic_stats: dict[str, Any], - detailed_stats: dict[str, Any], - api_tokens: dict[str, Any], - ) -> Path: + def _generate_json_summary( + self, + results: list[EvaluationResult], + base_filename: str, + stats: dict[str, Any], + ) -> Path: @@ - summary = { + basic_stats = stats["basic"] + detailed_stats = stats["detailed"] + api_tokens = stats.get("api_tokens", {}) + + summary: dict[str, Any] = { "timestamp": datetime.now().isoformat(), "total_evaluations": len(results), "summary_stats": { - "overall": basic_stats, - "by_metric": detailed_stats["by_metric"], - "by_conversation": detailed_stats["by_conversation"], + "overall": basic_stats, + "by_metric": detailed_stats["by_metric"], + "by_conversation": detailed_stats["by_conversation"], }, @@ - "results": [ + "results": [ { @@ - "execution_time": round(r.execution_time, 3), - "judge_llm_input_tokens": r.judge_llm_input_tokens, - "judge_llm_output_tokens": r.judge_llm_output_tokens, + "execution_time": round(r.execution_time, 3), + "judge_llm_input_tokens": r.judge_llm_input_tokens, + "judge_llm_output_tokens": r.judge_llm_output_tokens, } for r in results ], } @@ - def _generate_text_summary( - self, - results: list[EvaluationResult], - base_filename: str, - basic_stats: dict[str, Any], - detailed_stats: dict[str, Any], - api_tokens: dict[str, Any], - ) -> Path: + def _generate_text_summary( + self, + results: list[EvaluationResult], + base_filename: str, + stats: dict[str, Any], + ) -> Path: @@ - stats = { - "overall": basic_stats, - "by_metric": detailed_stats["by_metric"], - "by_conversation": detailed_stats["by_conversation"], - "api_tokens": api_tokens, - } + stats = { + "overall": stats["basic"], + "by_metric": stats["detailed"]["by_metric"], + "by_conversation": stats["detailed"]["by_conversation"], + "api_tokens": stats.get("api_tokens", {}), + }And update the call sites in
_generate_individual_reportsaccordingly:- if "json" in enabled_outputs: - json_file = self._generate_json_summary( - results, - base_filename, - stats["basic"], - stats["detailed"], - stats.get("api_tokens", {}), - ) + if "json" in enabled_outputs: + json_file = self._generate_json_summary( + results, + base_filename, + stats, + ) @@ - if "txt" in enabled_outputs: - txt_file = self._generate_text_summary( - results, - base_filename, - stats["basic"], - stats["detailed"], - stats.get("api_tokens", {}), - ) + if "txt" in enabled_outputs: + txt_file = self._generate_text_summary( + results, + base_filename, + stats, + )This reduces both helper methods to 4 parameters (including
self), fixes the Pylint R0913/R0917 failures, and keeps all token-usage behavior unchanged.Also applies to: 218-245, 253-260, 268-269, 291-313
🧹 Nitpick comments (3)
tests/unit/core/output/test_generator.py (1)
106-110: Generator tests correctly exercise API token plumbing; consider slightly stricter TXT assertionPassing an explicit
api_tokensdict into the JSON/TXT summary helpers and asserting on the JSONtoken_usage.api.total_tokensplus the TXT “Token Usage (API Calls)” section ensures the new API token path is exercised. If you want to make the TXT test a bit stronger, you could also assert that the rendered text includes the concrete totals (e.g."300") to catch formatting regressions, but the current checks are acceptable.Also applies to: 113-114, 132-136, 139-140, 147-148
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
8-9: TokenTracker integration is correct; consider whether to include tokens on error results and document concurrency assumptionsThe way
TokenTrackeris wired intoevaluate_metriclooks sound: you create a scoped tracker per evaluation, start it immediately before invoking the framework handler, stop it right after, and also stop it again in theexceptpath, relying on the tracker’s internal_callback_registeredflag to avoid double‑unregistering. Attachingjudge_llm_input_tokens/judge_llm_output_tokensto the success path ofEvaluationResultcleanly exposes judge‑side usage to downstream statistics.Two non‑blocking points to consider:
Token counts on ERROR results: When
score is Noneor an exception is raised, you return_create_error_result(...)which doesn’t accept token counts, soEvaluationResultfalls back to the default0values for judge tokens. If judge LLM calls can happen before an error is signalled by the metric handler, your aggregate judge token stats will under‑report usage for those ERROR cases. If you want totals to reflect all judge LLM usage, you could threadjudge_input_tokens/judge_output_tokensthrough_create_error_resultand set them there.Concurrency and global callbacks:
TokenTrackerappends a callback tolitellm.success_callbackand removes it instop(). This works well for the current single‑evaluation flow, but if evaluations are ever run concurrently in the same process, callbacks from multiple trackers will all see all completions, and each tracker will accumulate tokens for every active evaluation. It may be worth at least documenting the “single active evaluation per process” assumption, or in the future moving to a more scoped mechanism (e.g., a context manager around LiteLLM calls) if parallel evaluation is introduced.Overall, the integration meets the PR’s objectives; these are future‑proofing considerations rather than blockers.
Also applies to: 66-86, 124-139, 155-167, 169-176
src/lightspeed_evaluation/core/output/generator.py (1)
41-51: Newevaluation_dataparameter and API token aggregation look correctThe
generate_reportssignature and docstring cleanly extend the API to accept optionalevaluation_data, and thestats["api_tokens"]population (either viacalculate_api_token_usageor a zeroed default) ensures downstream consumers always see a consistent shape.As a small clarity improvement, you could make the conditional explicit so an empty list is treated the same as a non-empty one but without changing semantics:
- if evaluation_data: + if evaluation_data is not None: stats["api_tokens"] = calculate_api_token_usage(evaluation_data)Also applies to: 68-76
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
src/lightspeed_evaluation/core/api/streaming_parser.py(3 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/llm/__init__.py(2 hunks)src/lightspeed_evaluation/core/llm/custom.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(10 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(4 hunks)src/lightspeed_evaluation/runner/evaluation.py(3 hunks)tests/unit/core/output/test_generator.py(2 hunks)tests/unit/core/output/test_statistics.py(4 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/output/statistics.py
- src/lightspeed_evaluation/core/llm/custom.py
- tests/unit/pipeline/evaluation/test_evaluator.py
- src/lightspeed_evaluation/core/llm/init.py
- tests/unit/runner/test_evaluation.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_generator.pytests/unit/core/output/test_statistics.py
🧠 Learnings (7)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/output/generator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (4)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/llm/custom.py (4)
TokenTracker(10-64)start(40-47)stop(49-55)get_counts(57-59)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(440-493)EvaluationResult(378-423)
tests/unit/core/output/test_generator.py (1)
src/lightspeed_evaluation/core/output/generator.py (2)
_generate_json_summary(199-251)_generate_text_summary(253-362)
src/lightspeed_evaluation/core/output/generator.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(317-375)EvaluationResult(378-423)src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_api_token_usage(216-238)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
EvaluationData(317-375)EvaluationResult(378-423)TurnData(35-314)src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_api_token_usage(216-238)calculate_basic_stats(37-73)
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/output/generator.py
[error] 199-199: pylint: R0913: Too many arguments (6/5) (too-many-arguments)
[error] 199-199: pylint: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)
[error] 253-253: pylint: R0913: Too many arguments (6/5) (too-many-arguments)
[error] 253-253: pylint: R0917: Too many positional arguments (6/5) (too-many-positional-arguments)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: mypy
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
🔇 Additional comments (8)
src/lightspeed_evaluation/core/constants.py (1)
24-37: CSV columns extended for judge LLM tokens are consistent with model changesThe new
judge_llm_input_tokensandjudge_llm_output_tokensentries align with the added fields onEvaluationResultand keep CSV exports in sync with the in‑memory model; no issues from a correctness or compatibility standpoint.src/lightspeed_evaluation/core/models/data.py (1)
73-80: Token fields on TurnData and EvaluationResult are well-shaped and backward compatibleAdding
api_input_tokens/api_output_tokenstoTurnDataandjudge_llm_input_tokens/judge_llm_output_tokenstoEvaluationResultwithdefault=0andge=0keeps existing payloads valid underextra="forbid"while enforcing sensible constraints for counts; this pairs cleanly with the statistics helpers and output generators.Also applies to: 410-415
src/lightspeed_evaluation/core/models/api.py (1)
93-95: APIResponse token fields correctly bridge raw responses into the modelPopulating
input_tokensandoutput_tokensonAPIResponseand wiring them throughfrom_raw_responsewith default0preserves compatibility with older responses while exposing token data when provided by the backend; this is a clean extension of the public API.Also applies to: 113-123
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
62-70: API token usage propagation into TurnData looks correct and defensiveCopying
input_tokens/output_tokensfromapi_responseontoturn_data.api_input_tokens/api_output_tokenswith agetattr(..., 0)fallback, plus a debug log, gives robust token tracking while remaining compatible with responses that lack these fields.tests/unit/core/output/test_statistics.py (1)
6-12: New statistics tests give solid coverage for API and judge token aggregationExtending basic-stats expectations with
total_judge_llm_*keys and adding dedicated suites forcalculate_api_token_usageand token-awarecalculate_basic_statsexercises the main scenarios (empty, single/multi-turn, multi-conversation, zero tokens) and validates both presence and correct summation of the new fields; this is a good level of coverage for the token usage feature.Also applies to: 280-324, 591-769
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
14-16: Streaming parser cleanly surfaces end‑event token counts into the payloadInitializing
input_tokens/output_tokensto 0, updating them only on the"end"event when those fields are present, and returning them alongsideresponse/conversation_idgives a backward-compatible way to expose quota-based token counts from lightspeed-stack instead of estimating them locally. This matches the earlier guidance to rely on stack‑provided token data rather than heuristics. Based on learnings, this is the preferred direction.Also applies to: 21-22, 52-57, 70-72
src/lightspeed_evaluation/core/output/generator.py (2)
15-20: Imports forEvaluationDataand API token statistics are appropriateImporting
EvaluationDataandcalculate_api_token_usagehere is the right place to wire evaluation-time token data into the reporting layer; no issues with these additions.Based on learnings, this keeps new evaluation features within
src/lightspeed_evaluation/as intended.
101-131: Wiringapi_tokensinto JSON/TXT report generation is consistentPassing
stats["basic"],stats["detailed"], andstats.get("api_tokens", {})into the JSON and TXT generators keeps all report types in sync with the new token accounting without altering existing behavior whenevaluation_datais absent (defaults remain zero).No functional issues spotted in this wiring.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/llm/custom.py (1)
10-65: Consider adding Google-style docstrings to public methods.The TokenTracker class is well-implemented with clear lifecycle management. Consider enhancing the docstrings for public methods (start, stop, get_counts, reset) to follow Google-style format with Args and Returns sections for better API documentation clarity.
Based on coding guidelines, which require Google-style docstrings for all public APIs.
src/lightspeed_evaluation/core/output/generator.py (2)
199-219: JSON summary properly includes token usage data.The token usage is correctly merged into overall stats and included in per-result output. The calculation of total_tokens as the sum of Judge LLM and API tokens is appropriate.
Note: The method now has 6 parameters, exceeding pylint's default limit of 5. Consider refactoring to pass a single stats dictionary instead of separate basic_stats, detailed_stats, and api_tokens parameters to reduce argument count.
Also applies to: 238-239
250-310: Text summary properly formatted with token usage sections.The token usage sections for Judge LLM and API Calls are well-formatted with clear labels and thousands separators. The output structure is user-friendly and informative.
Note: Similar to the JSON generation method, this method exceeds pylint's parameter limit (6 > 5). Consider the same refactoring suggestion to consolidate stats parameters into a single dictionary.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
src/lightspeed_evaluation/core/api/streaming_parser.py(3 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/llm/__init__.py(2 hunks)src/lightspeed_evaluation/core/llm/custom.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(9 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(3 hunks)tests/unit/core/output/test_generator.py(2 hunks)tests/unit/core/output/test_statistics.py(4 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/models/data.py
- src/lightspeed_evaluation/core/api/streaming_parser.py
- tests/unit/runner/test_evaluation.py
- tests/unit/pipeline/evaluation/test_evaluator.py
- tests/unit/core/output/test_generator.py
🧰 Additional context used
📓 Path-based instructions (2)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/llm/custom.pysrc/lightspeed_evaluation/runner/evaluation.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/llm/__init__.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_statistics.py
🧠 Learnings (7)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/runner/evaluation.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.
Applied to files:
src/lightspeed_evaluation/runner/evaluation.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to src/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (4)
src/lightspeed_evaluation/runner/evaluation.py (2)
src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_api_token_usage(216-238)calculate_basic_stats(37-73)src/lightspeed_evaluation/core/output/generator.py (1)
generate_reports(41-88)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
EvaluationData(317-375)EvaluationResult(378-427)TurnData(35-314)src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_api_token_usage(216-238)
src/lightspeed_evaluation/core/llm/__init__.py (1)
src/lightspeed_evaluation/core/llm/custom.py (2)
BaseCustomLLM(67-132)TokenTracker(10-64)
src/lightspeed_evaluation/core/output/statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(317-375)EvaluationResult(378-427)lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/models.py (1)
EvaluationResult(258-288)
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/output/generator.py
[error] 199-199: R0913: Too many arguments (6/5) (too-many-arguments)
[error] 250-250: R0913: Too many arguments (6/5) (too-many-arguments)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: mypy
🔇 Additional comments (11)
src/lightspeed_evaluation/core/models/api.py (1)
93-94: LGTM! Token tracking fields properly integrated.The input_tokens and output_tokens fields are well-defined with appropriate validation (ge=0) and defaults. The extraction from raw_data safely handles missing values by defaulting to 0.
Also applies to: 113-116, 122-123
src/lightspeed_evaluation/core/llm/__init__.py (1)
9-9: LGTM! TokenTracker correctly exposed via lazy imports.The TokenTracker is properly added to both TYPE_CHECKING imports and the lazy import mapping, following the same pattern as BaseCustomLLM.
Also applies to: 22-22
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
84-85: LGTM! Token tracking lifecycle properly managed.The TokenTracker is correctly initialized per evaluation and started/stopped around the metric evaluation. Token counts are properly captured and passed to the result.
Also applies to: 124-138
175-205: Verify whether error results should include judge LLM token counts.When an exception occurs during evaluation (line 175), the token tracker is stopped (line 179) but the judge LLM token counts are not retrieved or included in the error result. If the error occurred after some judge LLM calls, those tokens were consumed but not tracked in the result.
Consider capturing the token counts from the tracker before creating the error result, similar to the success path.
# Example: Capture tokens even on error except Exception as e: execution_time = time.time() - start_time token_tracker.stop() judge_input_tokens, judge_output_tokens = token_tracker.get_counts() return self._create_error_result( request, f"Evaluation error: {e}", execution_time, judge_input_tokens, judge_output_tokens )This would require updating the
_create_error_resultsignature to accept the judge token counts.src/lightspeed_evaluation/runner/evaluation.py (1)
39-42: LGTM! Token usage reporting well-integrated.The token usage summary is properly integrated with conditional API token calculation when API is enabled. The display format is clear and informative, showing both Judge LLM and API token usage with proper formatting.
Also applies to: 82-82, 98-117
src/lightspeed_evaluation/core/output/statistics.py (2)
9-9: LGTM! Token tracking properly integrated into basic stats.The token fields are correctly initialized to 0 for empty results and properly summed from EvaluationResult fields in the non-empty case. The implementation is clean and consistent.
Also applies to: 48-50, 58-61, 70-72
216-238: LGTM! API token usage calculation well-implemented.The calculate_api_token_usage function correctly aggregates token usage across all conversations and turns. The Google-style docstring is clear and complete, meeting the coding guidelines.
Based on coding guidelines, which require Google-style docstrings for all public APIs.
tests/unit/core/output/test_statistics.py (3)
6-9: LGTM! Existing tests properly updated for token fields.The imports are correct and existing test expectations are updated to include the new token fields with appropriate default values.
Also applies to: 290-292, 320-322
591-693: LGTM! Comprehensive test coverage for API token usage calculation.The test suite thoroughly validates calculate_api_token_usage across multiple scenarios including empty data, single/multiple turns, multiple conversations, and zero token values. The tests are well-structured and use proper model construction.
696-763: LGTM! Token tracking in basic stats properly tested.The tests validate both the presence of token fields and the correct summation logic, including edge cases like defaults and empty results. The test coverage is thorough and assertions are clear.
src/lightspeed_evaluation/core/output/generator.py (1)
15-17: LGTM! API token usage properly integrated into report generation.The integration of EvaluationData and calculate_api_token_usage is clean and handles both scenarios (with and without evaluation data) appropriately. The conditional logic ensures API tokens default to zero when evaluation data is not provided.
Also applies to: 41-76
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
184-205: Include judge LLM token counts in error results for consistency.The error result includes
api_input_tokensandapi_output_tokensbut omitsjudge_llm_input_tokensandjudge_llm_output_tokens. If an evaluation partially completes before failing, the token tracker may have accumulated counts that should be reported for accurate usage tracking and billing.def _create_error_result( - self, request: EvaluationRequest, reason: str, execution_time: float + self, request: EvaluationRequest, reason: str, execution_time: float, + judge_input_tokens: int = 0, judge_output_tokens: int = 0 ) -> EvaluationResult: """Create an ERROR result for failed evaluation.""" return EvaluationResult( conversation_group_id=request.conv_data.conversation_group_id, turn_id=request.turn_id, metric_identifier=request.metric_identifier, result="ERROR", score=None, threshold=None, reason=reason, query=request.turn_data.query if request.turn_data else "", response=request.turn_data.response or "" if request.turn_data else "", execution_time=execution_time, + judge_llm_input_tokens=judge_input_tokens, + judge_llm_output_tokens=judge_output_tokens, api_input_tokens=( request.turn_data.api_input_tokens if request.turn_data else 0 ), api_output_tokens=( request.turn_data.api_output_tokens if request.turn_data else 0 ), )Then update the exception handler call:
except Exception as e: execution_time = time.time() - start_time token_tracker.stop() + judge_input, judge_output = token_tracker.get_counts() return self._create_error_result( - request, f"Evaluation error: {e}", execution_time + request, f"Evaluation error: {e}", execution_time, + judge_input, judge_output )
🧹 Nitpick comments (6)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
62-70: Consider direct attribute access instead ofgetattr.Since
APIResponseis a well-defined model withinput_tokensandoutput_tokensfields, using direct attribute access withor 0would be more idiomatic:- turn_data.api_input_tokens = getattr(api_response, "input_tokens", 0) - turn_data.api_output_tokens = getattr(api_response, "output_tokens", 0) + turn_data.api_input_tokens = api_response.input_tokens or 0 + turn_data.api_output_tokens = api_response.output_tokens or 0Using
getattrwith a fallback suggests uncertainty about attribute existence, which isn't the case for typed models. If the fields areOptional[int], theor 0pattern handlesNonevalues cleanly.tests/unit/pipeline/evaluation/test_evaluator.py (1)
484-506: Add cleanup fixture to prevent test pollution oflitellm.success_callback.The
start()method modifies the globallitellm.success_callbacklist. If a test fails beforestop()is called, subsequent tests may be affected by leftover callbacks. Consider adding a cleanup fixture:@pytest.fixture(autouse=True) def cleanup_litellm_callbacks(self): """Ensure litellm callbacks are cleaned up after each test.""" import litellm original = getattr(litellm, 'success_callback', None) yield litellm.success_callback = original if original is not None else []Alternatively, ensure each test that calls
start()uses a try/finally pattern or properly stops the tracker in all code paths.src/lightspeed_evaluation/runner/evaluation.py (1)
81-82: CLI token usage summary wiring is consistent with statistics and generatorUsing
calculate_basic_stats(results)for the CLI summary and then printing judge LLM token totals fromsummaryaligns with the extended stats API. The conditionalcalculate_api_token_usage(evaluation_data)whensystem_config.api.enabledis true correctly mirrors howOutputHandler.generate_reportscomputesapi_tokens, and the combinedtotal_tokensmatches JSON/text reporting semantics.If you want to avoid double-walking
evaluation_data, you could optionally reuse theapi_tokensdict already computed insideOutputHandler.generate_reports(e.g., by having it return or expose the aggregate), but that’s a micro-optimization and not required for correctness.- # Generate reports based on configuration (pass evaluation_data for API token stats) - output_handler.generate_reports(results, evaluation_data) + # Generate reports based on configuration (pass evaluation_data for API token stats) + # Optionally, OutputHandler.generate_reports could return api_tokens to avoid + # recalculating here, if this ever becomes a hot path. + output_handler.generate_reports(results, evaluation_data)Also applies to: 88-117
tests/unit/core/output/test_generator.py (2)
106-110: JSON summary tests correctly cover API token aggregationThe
api_tokensfixture in this test and the assertion againstdata["summary_stats"]["overall"]["total_api_tokens"]validate that_generate_json_summaryis merging API token stats into the overall summary as intended. This is a good, focused check on the new behavior.If you later want stronger coverage, you could also assert on
total_tokens(judge + API) insummary_stats["overall"], but the current assertions are sufficient for this change.Also applies to: 112-114, 124-126
132-136: Text summary test exercises API token section end-to-endPassing
api_tokensinto_generate_text_summaryand asserting that"Token Usage (API Calls)"appears in the rendered text confirms the new API token section is wired through correctly.As an optional enhancement, you might also assert that the expected aggregate value (e.g.,
"Total Tokens: 300") shows up, but the presence check already guards against regressions in this area.Also applies to: 139-140, 146-147
src/lightspeed_evaluation/core/output/generator.py (1)
15-20: Report generation now cleanly incorporates EvaluationData‑based API token statsImporting
EvaluationDataand extendinggenerate_reportsto acceptevaluation_data: Optional[list[EvaluationData]]ties the output layer directly to the conversation data model. Computingstats["api_tokens"]viacalculate_api_token_usage(evaluation_data)whenevaluation_datais provided, and defaulting to a zeroed dict otherwise, matches the behavior expected by generator tests and keeps JSON/TXT generation simple.If you ever need to distinguish “no evaluation data” from “evaluation data with zero tokens”, consider switching the condition to
if evaluation_data is not None:so that an explicitly empty list still flows throughcalculate_api_token_usage, but the current implementation is functionally equivalent for token totals.- # Calculate API token usage if evaluation data is provided - if evaluation_data: + # Calculate API token usage if evaluation data is provided + if evaluation_data is not None: stats["api_tokens"] = calculate_api_token_usage(evaluation_data) else:Also applies to: 41-51, 68-76
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
src/lightspeed_evaluation/core/api/streaming_parser.py(3 hunks)src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/llm/__init__.py(2 hunks)src/lightspeed_evaluation/core/llm/custom.py(1 hunks)src/lightspeed_evaluation/core/models/api.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(9 hunks)src/lightspeed_evaluation/core/output/statistics.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/amender.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/runner/evaluation.py(3 hunks)tests/unit/core/output/test_generator.py(2 hunks)tests/unit/core/output/test_statistics.py(4 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(2 hunks)tests/unit/runner/test_evaluation.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/core/api/streaming_parser.py
- src/lightspeed_evaluation/core/models/api.py
- tests/unit/runner/test_evaluation.py
- src/lightspeed_evaluation/core/output/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_statistics.pytests/unit/pipeline/evaluation/test_evaluator.pytests/unit/core/output/test_generator.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions fromcore.system.exceptionsfor error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/runner/evaluation.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/llm/__init__.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/core/llm/custom.pysrc/lightspeed_evaluation/core/output/generator.py
🧠 Learnings (6)
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/runner/evaluation.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/generator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.
Applied to files:
src/lightspeed_evaluation/runner/evaluation.py
🧬 Code graph analysis (6)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
src/lightspeed_evaluation/core/llm/custom.py (4)
TokenTracker(10-64)start(40-47)stop(49-55)get_counts(57-59)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationRequest(444-497)EvaluationResult(378-427)
src/lightspeed_evaluation/runner/evaluation.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_api_token_usage(216-238)calculate_basic_stats(37-73)
src/lightspeed_evaluation/core/llm/__init__.py (1)
src/lightspeed_evaluation/core/llm/custom.py (2)
BaseCustomLLM(67-132)TokenTracker(10-64)
src/lightspeed_evaluation/core/output/generator.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(317-375)EvaluationResult(378-427)src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_api_token_usage(216-238)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
src/lightspeed_evaluation/core/llm/custom.py (5)
TokenTracker(10-64)get_counts(57-59)reset(61-64)start(40-47)stop(49-55)
tests/unit/core/output/test_generator.py (1)
src/lightspeed_evaluation/core/output/generator.py (4)
_generate_json_summary(199-248)OutputHandler(24-363)_calculate_stats(90-99)_generate_text_summary(250-359)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: mypy
🔇 Additional comments (12)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
459-517: Good coverage of TokenTracker lifecycle.The tests cover initialization, counts retrieval, reset behavior, idempotent start/stop operations, and instance independence. This provides solid baseline coverage for the TokenTracker class.
src/lightspeed_evaluation/core/llm/__init__.py (1)
9-9: LGTM!The
TokenTrackerexport follows the established lazy import pattern consistently withBaseCustomLLM. The TYPE_CHECKING import and lazy import mapping are properly synchronized.Also applies to: 22-22
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
66-81: LGTM!The
evaluate_metricdocstring follows Google-style format with proper Args and Returns sections, addressing the previous review feedback.
124-138: Token tracking lifecycle is correctly implemented.The
TokenTrackeris properly started before evaluation, stopped after, and counts are retrieved. The exception handler also ensures the tracker is stopped on error paths.src/lightspeed_evaluation/core/llm/custom.py (1)
10-64: Well-structured TokenTracker implementation.The class provides a clean API with proper lifecycle management (start/stop), idempotent operations, and defensive attribute access in the callback. The docstring provides a clear usage example.
src/lightspeed_evaluation/runner/evaluation.py (1)
39-42: Token statistics imports are correctly scoped to the “heavy modules” sectionImporting
calculate_basic_statsandcalculate_api_token_usagealongsideOutputHandlerkeeps all reporting/statistics dependencies loaded only after config and logging are initialized, which matches the existing pattern in this runner. No issues here.tests/unit/core/output/test_statistics.py (3)
6-12: Basic stats tests correctly updated for new judge token fieldsImporting
EvaluationData,EvaluationResult, andTurnDataplus extending theexpecteddicts in the mixed/single-result tests to includetotal_judge_llm_input_tokens,total_judge_llm_output_tokens, andtotal_judge_llm_tokenskeeps the tests aligned with the expandedcalculate_basic_statscontract. The zero expectations for cases without explicit tokens are consistent with the model defaults and the implementation.Also applies to: 282-295, 312-324
591-694: API token usage aggregation tests provide thorough coverageThe
TestCalculateApiTokenUsagesuite exercisescalculate_api_token_usageacross empty input, single/multiple turns within a conversation, multiple conversations, and explicit zero-token cases. This is a solid set of scenarios and closely matches howEvaluationDataandTurnDatawill be used in production.
696-769: Judge LLM token tracking behavior is well-specified via new basic-stats testsThe
TestCalculateBasicStatsWithTokensclass clearly verifies:
- presence of the three judge token fields in the stats dict,
- correct summation across multiple
EvaluationResultinstances, and- zero defaults both for results without token fields set and for empty result lists.
This tightly couples the tests to the intended
calculate_basic_statssemantics and should catch any future regressions in token handling.src/lightspeed_evaluation/core/models/data.py (1)
73-79: Token fields on TurnData and EvaluationResult are well-modeled and backward‑compatibleAdding
api_input_tokens/api_output_tokenstoTurnDataand both judge and API token fields toEvaluationResultwithdefault=0andge=0constraints cleanly encodes token usage without breaking existing data files (missing fields deserialize as zero). The Field descriptions make the intent explicit, and the types line up with howcalculate_basic_statsandcalculate_api_token_usageconsume these attributes.Also applies to: 410-419
src/lightspeed_evaluation/core/output/generator.py (2)
199-219: JSON summary correctly merges judge and API token stats and exposes per‑result judge tokens
_generate_json_summarynow:
- accepts an explicit
api_tokensdict,- builds
overall_statsby combiningbasic_statswithtotal_api_input_tokens,total_api_output_tokens,total_api_tokens, and atotal_tokensfield that sums judge and API totals, and- includes
judge_llm_input_tokensandjudge_llm_output_tokensper result.This structure matches the new tests and gives downstream consumers a single place (
summary_stats["overall"]) to read aggregate token usage.Also applies to: 225-240
250-267: Text summary adds clear, separate sections for judge and API token usageThe updated
_generate_text_summarykeepsbasic_statsas the “overall” block and introduces:
- a “Token Usage (Judge LLM)” section driven by
total_judge_llm_*fields, and- a “Token Usage (API Calls)” section reading from
api_tokenswith sensible.get(..., 0)defaults.This mirrors the JSON structure at a human-readable level and safely handles cases where API usage is disabled or evaluation data is not supplied.
Also applies to: 288-310
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.