add additional fields to output for non-error scenarios by asamal4 · Pull Request #114 · lightspeed-core/lightspeed-evaluation

asamal4 · 2025-12-06T23:51:19Z

Description

Added additional fields to output. This will help us to do quick comparison for non-error scenarios. For error scenarios the data is minimal as per the existing logic.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Cursor

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Added six evaluation fields for richer tool/expectation metadata (tool calls, contexts, expected response, expected intent, expected keywords, expected tool calls) and included them in CSV exports.
Changed Behavior
- Replaced numeric token-count columns with new string-based metadata fields and adjusted CSV column ordering.
Tests
- Updated unit tests to validate the new fields and CSV output for success and error cases.
Documentation
- Updated CSV schema examples and configuration docs to reflect new fields and ordering.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-06T23:51:28Z

Walkthrough

Reorders CSV columns, replaces two numeric token fields on EvaluationResult with six optional string fields, serializes additional turn_data fields into JSON when building results, and updates tests and configuration to align with the new CSV schema and model fields.

Changes

Cohort / File(s)	Summary
Core constants & model `src/lightspeed_evaluation/core/constants.py`, `src/lightspeed_evaluation/core/models/data.py`	Reordered `SUPPORTED_CSV_COLUMNS` (moved `execution_time`, repositioned token/judge fields) and added CSV fields; `EvaluationResult` removed `api_input_tokens`/`api_output_tokens` and added optional `str` fields: `tool_calls`, `contexts`, `expected_response`, `expected_intent`, `expected_keywords`, `expected_tool_calls`.
Evaluator logic `src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`	Added `_to_json_str(value)` helper, imported `json` and `Any`, switched to local `turn_data` usage, and JSON-serialized query/response plus the six new turn_data fields into `EvaluationResult` instances.
Tests, config & docs `tests/unit/core/output/test_generator.py`, `tests/unit/pipeline/evaluation/test_evaluator.py`, `config/system.yaml`, `docs/configuration.md`	Tests updated to assert the new fields and adjusted CSV expectations (generator test expects three rows); `system.yaml` and docs extended to list and demonstrate the expanded CSV/output schema including new fields and adjusted ordering.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

Pay attention to:
- _to_json_str() handling of None/empty values and JSON escaping for CSV output.
- Any remaining callers or serializers that expected api_input_tokens/api_output_tokens.
- Consistency between SUPPORTED_CSV_COLUMNS ordering and CSV generator/export logic and tests.
- Tests and exception paths ensuring optional fields are None or empty where appropriate.

Possibly related PRs

Added Unit test cases as well as integration test cases #42 — Adds tests that exercise EvaluationResult fields and CSV/output behavior affected by these changes.
[nit] Clean up evaluation_data.yaml #52 — Touches evaluation data/schema and CSV columns, likely overlapping with the model and config adjustments here.

Suggested reviewers

VladimirKadlec
tisnik

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'add additional fields to output for non-error scenarios' directly and clearly reflects the main changes in the changeset, which involves adding new fields (tool_calls, contexts, expected_response, expected_intent, expected_keywords, expected_tool_calls) to the EvaluationResult model and supporting configurations for non-error evaluation scenarios.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ac11be and 7aa1060.

📒 Files selected for processing (2)

config/system.yaml (1 hunks)
docs/configuration.md (2 hunks)

✅ Files skipped from review due to trivial changes (1)

docs/configuration.md

🧰 Additional context used

📓 Path-based instructions (2)

config/system.yaml

📄 CodeRabbit inference engine (AGENTS.md)

Add new metrics metadata to config/system.yaml in the metrics_metadata section

Files:

config/system.yaml

config/*.yaml

📄 CodeRabbit inference engine (AGENTS.md)

Add sample evaluation data YAML file when adding new features

Files:

config/system.yaml

🧠 Learnings (2)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to config/system.yaml : Add new metrics metadata to `config/system.yaml` in the metrics_metadata section

Applied to files:

config/system.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.12)
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)
GitHub Check: mypy

🔇 Additional comments (2)

config/system.yaml (2)

49-151: Clarify whether new CSV fields require metrics metadata entries.

The new CSV columns include tool_calls, contexts, expected_response, expected_intent, expected_keywords, and expected_tool_calls (lines 174–183), but these do not appear as new entries in the metrics_metadata section. Per the coding guidelines, new metrics metadata should be added to this section. Confirm whether these are:

Metrics that require threshold/description metadata entries, or

Reference/context fields that don't require metrics metadata

If they are metrics, please add appropriate metadata entries.

163-183: CSV columns configuration verified: all fields properly aligned and serialized.

The CSV columns at lines 163-183 match the SUPPORTED_CSV_COLUMNS in constants.py exactly, and all 20 fields (including api_input_tokens, api_output_tokens, judge_llm_input_tokens, judge_llm_output_tokens, tool_calls, contexts, expected_response, expected_intent, expected_keywords, and expected_tool_calls) are properly defined in the EvaluationResult model and dynamically serialized by the CSV generator using attribute reflection. No action required.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

169-184: Consider populating additional fields in error results for debugging.

The _create_error_result method currently only populates query and response. For better debuggability, consider also populating contexts, expected_response, and other fields when available, so users can see what input data led to the error.

This is optional since error results are intentionally minimal, but having context can help diagnose issues.

     def _create_error_result(
         self, request: EvaluationRequest, reason: str, execution_time: float
     ) -> EvaluationResult:
         """Create an ERROR result for failed evaluation."""
+        turn_data = request.turn_data
         return EvaluationResult(
             conversation_group_id=request.conv_data.conversation_group_id,
             turn_id=request.turn_id,
             metric_identifier=request.metric_identifier,
             result="ERROR",
             score=None,
             threshold=None,
             reason=reason,
-            query=request.turn_data.query if request.turn_data else "",
-            response=request.turn_data.response or "" if request.turn_data else "",
+            query=turn_data.query if turn_data else "",
+            response=turn_data.response or "" if turn_data else "",
+            contexts=_to_json_str(turn_data.contexts) if turn_data else None,
+            expected_response=turn_data.expected_response if turn_data else None,
             execution_time=execution_time,
         )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba0bd33 and 7f3176a.

📒 Files selected for processing (5)

src/lightspeed_evaluation/core/constants.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (1 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
tests/unit/core/output/test_generator.py (3 hunks)
tests/unit/pipeline/evaluation/test_evaluator.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

tests/unit/pipeline/evaluation/test_evaluator.py
tests/unit/core/output/test_generator.py

🧠 Learnings (7)

📓 Common learnings

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-07-29T05:15:39.782Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Applied to files:

tests/unit/pipeline/evaluation/test_evaluator.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

tests/unit/pipeline/evaluation/test_evaluator.py

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

src/lightspeed_evaluation/core/models/data.py
tests/unit/core/output/test_generator.py

📚 Learning: 2025-07-16T13:20:45.006Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

tests/unit/core/output/test_generator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

tests/unit/core/output/test_generator.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

tests/unit/core/output/test_generator.py

🧬 Code graph analysis (3)

tests/unit/pipeline/evaluation/test_evaluator.py (1)

src/lightspeed_evaluation/core/api/client.py (1)

query (71-105)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

src/lightspeed_evaluation/core/api/client.py (1)

query (71-105)

tests/unit/core/output/test_generator.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationResult (370-428)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.12)
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)
GitHub Check: mypy

🔇 Additional comments (8)

src/lightspeed_evaluation/core/constants.py (1)

35-40: LGTM!

The new CSV columns are correctly added and align with the new optional fields in EvaluationResult. The naming is consistent and the ordering (actual values followed by expected values) is logical.

tests/unit/pipeline/evaluation/test_evaluator.py (2)

156-158: LGTM!

The test correctly validates that the new fields (query, response, contexts) are populated from turn data in successful evaluation paths. The JSON serialization of contexts as '["Context"]' aligns with the _to_json_str helper behavior.

367-370: LGTM!

The test correctly expects contexts and expected_response to be None in the exception handling path, which aligns with _create_error_result not populating these optional fields.

tests/unit/core/output/test_generator.py (2)

328-331: LGTM!

The test correctly uses pre-serialized JSON strings for contexts and expected_keywords, matching the expected format after _to_json_str processing in the evaluator.

375-390: LGTM!

Comprehensive assertions validating CSV content with new fields. The test correctly expects:

JSON-serialized strings for contexts and expected_keywords

Empty string for contexts in the ERROR row (CSV representation of None)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

26-33: LGTM!

The _to_json_str helper correctly handles serialization with appropriate fallbacks:

Returns None for None, empty lists, and empty dicts

Uses json.dumps with default=str for robust serialization

Falls back to str() on serialization errors

As per coding guidelines, the function has type hints for all parameters and return type.

138-159: LGTM!

The result construction correctly populates the new fields:

Extracts turn_data for cleaner repeated access

Applies _to_json_str to complex fields (tool_calls, contexts, expected_keywords, expected_tool_calls)

Preserves string fields (expected_response, expected_intent) directly

Handles missing turn_data gracefully with appropriate defaults

src/lightspeed_evaluation/core/models/data.py (1)

400-420: LGTM!

The new optional fields are well-defined with:

Appropriate types (Optional[str] for JSON-serialized data)

Clear descriptions documenting their purpose

Consistent default values (None)

The field additions align with the evaluator's usage and CSV column definitions in constants. As per coding guidelines, the model uses type hints appropriately.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
27-35: Consider edge case: nested objects with circular references.

The _to_json_str helper uses default=str as a fallback for non-serializable types, which is good. However, if value contains circular references or deeply nested objects, json.dumps could raise RecursionError (not caught) or produce unexpectedly large strings.

Consider adding a recursion depth limit or handling RecursionError:
 def _to_json_str(value: Any) -> Optional[str]:
     """Convert any value to JSON string. Returns None for empty values."""
     if value is None or value == [] or value == {}:
         return None
     try:
         return json.dumps(value, indent=None, default=str)
-    except (TypeError, ValueError):
+    except (TypeError, ValueError, RecursionError):
         return str(value)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 032bd43 and 12f424a.

📒 Files selected for processing (5)

src/lightspeed_evaluation/core/constants.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (1 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
tests/unit/core/output/test_generator.py (3 hunks)
tests/unit/pipeline/evaluation/test_evaluator.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

tests/unit/core/output/test_generator.py
tests/unit/pipeline/evaluation/test_evaluator.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/constants.py

🧠 Learnings (7)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-07-16T13:20:45.006Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

tests/unit/core/output/test_generator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

tests/unit/core/output/test_generator.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

tests/unit/core/output/test_generator.py

📚 Learning: 2025-07-28T14:26:03.119Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

tests/unit/core/output/test_generator.py

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

src/lightspeed_evaluation/core/models/data.py
tests/unit/pipeline/evaluation/test_evaluator.py

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

🧬 Code graph analysis (2)

tests/unit/core/output/test_generator.py (2)

src/lightspeed_evaluation/core/api/client.py (1)

query (72-106)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationResult (378-446)

tests/unit/pipeline/evaluation/test_evaluator.py (1)

src/lightspeed_evaluation/core/api/client.py (1)

query (72-106)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: mypy
GitHub Check: tests (3.12)
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)

🔇 Additional comments (8)

tests/unit/pipeline/evaluation/test_evaluator.py (2)

157-160: LGTM! Test validates new fields correctly.

The test appropriately verifies that turn-level evaluation results include the new query, response, and contexts fields populated from the input turn data. The contexts field is correctly expected as a JSON-serialized string.

368-372: LGTM! Exception path test correctly validates minimal field population.

The test appropriately verifies that when an exception occurs during evaluation, the result still carries the basic query and response from the turn data, while optional fields like contexts and expected_response are set to None. This aligns with the error handling behavior in the evaluator.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

165-195: LGTM! Result construction properly populates new fields.

The updated EvaluationResult construction correctly:

Uses a local turn_data variable for consistent access

Populates new fields (tool_calls, contexts, expected_response, expected_intent, expected_keywords, expected_tool_calls) using _to_json_str for serialization

Maintains null-safety with if turn_data guards

Retains existing token tracking fields

206-227: Verify field defaults in EvaluationResult model for optional fields.

The _create_error_result method omits new optional fields (tool_calls, contexts, expected_response, expected_intent, expected_keywords, expected_tool_calls), while the success path populates them. Check whether these fields have default=None in the EvaluationResult model definition. If they do, the current implementation is safe; if not, explicitly set them to None for consistency with test expectations (lines 368-372 in test_evaluator.py expect these as None).

tests/unit/core/output/test_generator.py (2)

331-373: LGTM! Test coverage expanded to validate new fields.

The test appropriately validates:

Row 1: New fields query, response, contexts, and expected_keywords are populated and correctly serialized

Row 2: The expected_response field is included

Row 3: ERROR scenario with minimal fields (query populated, contexts empty)

This aligns with the PR objective to add additional fields for non-error scenarios while keeping error output minimal.

390-406: LGTM! Assertions validate CSV structure and new fields.

The assertions correctly verify:

Three rows in the CSV output

Row 1 contains the new fields with proper values

Row 2 includes expected_response

Row 3 (ERROR) has minimal data with empty contexts

src/lightspeed_evaluation/core/constants.py (1)

25-46: LGTM! CSV columns reordered and expanded logically.

The updated SUPPORTED_CSV_COLUMNS list:

Reorders execution_time to follow threshold (more logical grouping with evaluation metadata)

Places token-related fields together (api_input_tokens, api_output_tokens, judge_llm_input_tokens, judge_llm_output_tokens)

Adds six new fields at the end for turn data (tool_calls, contexts, expected_response, expected_intent, expected_keywords, expected_tool_calls)

The ordering is logical and groups related fields together.

src/lightspeed_evaluation/core/models/data.py (1)

410-438: Note: Inconsistency between AI summary and actual implementation.

The AI summary states that api_input_tokens and api_output_tokens were "replaced" with the new fields, but the actual code shows these fields are retained (lines 410-413) alongside the new optional string fields (lines 420-438). This is the correct approach, as token tracking is independent of the new turn data fields.

The field definitions are well-structured:

Existing token fields retained for backward compatibility

Six new optional string fields added with clear descriptions

All new fields default to None for optional population

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

config/system.yaml (1)

163-183: Clarify data field vs. metric distinction in documentation.

The new CSV columns include output data fields (tokens, contexts, expected values) that are distinct from metrics defined in metrics_metadata. Consider adding a brief comment in the YAML to clarify that csv_columns includes both evaluation metrics and raw data fields from turn execution, to aid future maintainers.

  # CSV columns to include (includes both metrics and raw turn execution data)
  csv_columns:
    # Identifiers
    - "conversation_group_id"
    - "turn_id"
    # Metric evaluation results
    - "metric_identifier"
    - "result"
    - "score"
    - "threshold"
    - "reason"
    # Turn execution metadata
    - "execution_time"
    # Turn input/output
    - "query"
    - "response"
    # API call token metrics
    - "api_input_tokens"
    - "api_output_tokens"
    - "judge_llm_input_tokens"
    - "judge_llm_output_tokens"
    # Tool and context data
    - "tool_calls"
    - "contexts"
    # Expected values for comparison
    - "expected_response"
    - "expected_intent"
    - "expected_keywords"
    - "expected_tool_calls"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12f424a and 2ac11be.

📒 Files selected for processing (1)

config/system.yaml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

config/system.yaml

📄 CodeRabbit inference engine (AGENTS.md)

Add new metrics metadata to config/system.yaml in the metrics_metadata section

Files:

config/system.yaml

config/*.yaml

📄 CodeRabbit inference engine (AGENTS.md)

Add sample evaluation data YAML file when adding new features

Files:

config/system.yaml

🧠 Learnings (2)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to config/system.yaml : Add new metrics metadata to `config/system.yaml` in the metrics_metadata section

Applied to files:

config/system.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.11)
GitHub Check: mypy
GitHub Check: tests (3.12)
GitHub Check: tests (3.13)

🔇 Additional comments (1)

config/system.yaml (1)

163-183: Verify all new CSV columns are supported by the data model and output generator.

The CSV configuration adds 12 new columns including token counts, contexts, and expected values. Before merging, ensure these columns are:

Defined in the data model (EvaluationResult or similar)

Properly serialized in the output generator

Consistently named across the codebase

Additionally, confirm that a sample evaluation data YAML file has been added per project guidelines.

VladimirKadlec

Please add the description of the new output columns to the table in docs/configuration.md

Other than that LGTM.

asamal4 · 2025-12-11T06:53:25Z

Please add the description of the new output columns to the table in docs/configuration.md

@VladimirKadlec done. thank you !!

coderabbitai bot reviewed Dec 6, 2025

View reviewed changes

asamal4 force-pushed the expand-csv-output branch from 7f3176a to 032bd43 Compare December 8, 2025 06:31

add additional fields to output for non-error scenarios

12f424a

asamal4 force-pushed the expand-csv-output branch from 032bd43 to 12f424a Compare December 8, 2025 06:38

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

VladimirKadlec approved these changes Dec 10, 2025

View reviewed changes

update sample system config for csv columns

7aa1060

asamal4 force-pushed the expand-csv-output branch from 2ac11be to 7aa1060 Compare December 10, 2025 23:29

asamal4 merged commit 8688ff6 into lightspeed-core:main Dec 11, 2025
15 checks passed

coderabbitai bot mentioned this pull request Dec 26, 2025

[LEADS-141] Add Latency Metrics to Evaluation Reports #127

Merged

15 tasks

This was referenced Jan 8, 2026

Add optional property tag to group eval conversations #134

Merged

Leads 6 set expected responses #138

Merged

coderabbitai bot mentioned this pull request Jan 16, 2026

[LEADS-182] - Add eval config data to the report #142

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add additional fields to output for non-error scenarios#114

add additional fields to output for non-error scenarios#114
asamal4 merged 2 commits intolightspeed-core:mainfrom
asamal4:expand-csv-output

asamal4 commented Dec 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

VladimirKadlec left a comment

Uh oh!

asamal4 commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

asamal4 commented Dec 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asamal4 commented Dec 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 6, 2025 •

edited

Loading