Add optional property tag to group eval conversations#134
Add optional property tag to group eval conversations#134asamal4 merged 2 commits intolightspeed-core:mainfrom
Conversation
WalkthroughAdds a new optional string field Changes
Sequence Diagram(s)sequenceDiagram
participant Conv as Conversation Data (config / input)
participant Eval as Evaluator
participant Err as ErrorHandler
participant Result as EvaluationResult (model)
participant Stats as Statistics
participant Out as Output Generator
Conv->>Eval: request with conv_data.tag
Eval->>Result: create EvaluationResult(tag=conv_data.tag)
Eval->>Err: (on errors) create_error_result(tag=conv_data.tag)
Result->>Stats: accumulate by_metric / by_conversation / by_tag
Stats->>Out: provide detailed_stats (includes by_tag)
Out->>Client: render JSON/text including tag and by_tag breakdown
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/models/data.py (1)
397-399: Consider addingmin_length=1validation for consistency.The
tagfield inEvaluationData(Line 331) hasmin_length=1validation, butEvaluationResult.tagdoes not. This inconsistency could allow empty tags in results while preventing them in data models, potentially causing issues with tag-based grouping and statistics.🔧 Suggested fix for consistency
tag: str = Field( - default="eval", description="Tag for grouping and filtering results" + default="eval", + min_length=1, + description="Tag for grouping and filtering results" )tests/unit/core/models/test_data.py (1)
297-302: Consider adding a test for empty tag validation.While the current tests cover default and explicit tag values well, there's no test verifying that empty tags are rejected for
EvaluationData(which hasmin_length=1validation on Line 331 of data.py). This would ensure the validation works as expected.🧪 Suggested test addition
Add this test to the
TestEvaluationDataclass:def test_empty_tag_rejected(self): """Test that empty tag is rejected.""" turn = TurnData(turn_id="turn1", query="Query") with pytest.raises(ValidationError): EvaluationData(conversation_group_id="conv1", turns=[turn], tag="")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
README.mdconfig/evaluation_data.yamldocs/configuration.mdsrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/errors.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/models/test_data.pytests/unit/core/output/test_statistics.py
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/pipeline/evaluation/errors.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/output/generator.py
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/models/test_data.pytests/unit/core/output/test_statistics.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/models/test_data.pytests/unit/core/output/test_statistics.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
docs/configuration.mdconfig/evaluation_data.yamlREADME.md
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Applied to files:
config/evaluation_data.yaml
📚 Learning: 2025-07-16T10:41:09.399Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.
Applied to files:
config/evaluation_data.yaml
📚 Learning: 2025-07-16T09:42:00.691Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/data/script/eval3/setup.sh:1-3
Timestamp: 2025-07-16T09:42:00.691Z
Learning: Scripts in the lsc_agent_eval/data directory are meant to be simple examples/samples for teams to customize according to their needs, not production-ready code.
Applied to files:
config/evaluation_data.yaml
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Do not add new features to lsc_agent_eval/ directory - add to src/lightspeed_evaluation/ instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/errors.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/errors.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationResult(389-461)
tests/unit/core/models/test_data.py (1)
src/lightspeed_evaluation/core/models/data.py (3)
TurnData(36-315)EvaluationData(318-386)EvaluationResult(389-461)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationResult(389-461)src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_detailed_stats(81-114)
🔇 Additional comments (20)
docs/configuration.md (1)
216-216: LGTM! Documentation is clear and consistent.The documentation accurately describes the new
tagCSV column and its purpose for grouping eval conversations. The placement afterconversation_group_idaligns with the implementation in the constants file.Also applies to: 249-249
README.md (1)
157-157: LGTM! Documentation is accurate and complete.The README correctly documents the new
tagfield with its optional nature, default value, and purpose. The documentation is consistent with the implementation in the data models.Also applies to: 214-214
src/lightspeed_evaluation/core/constants.py (1)
29-29: LGTM! Appropriate placement of the new column.The
tagcolumn is logically positioned afterconversation_group_id, which makes sense given that tags are used for grouping conversations.tests/unit/core/models/test_data.py (1)
336-347: LGTM! Comprehensive test coverage for the tag field.The tests effectively verify both default and explicit tag values for
EvaluationResult, ensuring the new feature works as intended.src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
168-168: LGTM! Tag propagation implemented correctly.The tag field is correctly propagated from
request.conv_data.tagto both successful evaluation results (line 168) and error results (line 222), ensuring consistent tagging across all evaluation outcomes.Also applies to: 222-222
config/evaluation_data.yaml (1)
5-5: LGTM! Tag field examples are clear and well-documented.The configuration examples demonstrate meaningful tag usage (basic, tools, infrastructure) with helpful inline documentation of the default value. This provides good guidance for users adopting the new tag feature.
Also applies to: 60-60, 99-99
src/lightspeed_evaluation/pipeline/evaluation/errors.py (6)
18-48: LGTM! Tag parameter properly added to result creation.The
_create_resultmethod now correctly accepts and propagates the tag parameter with an appropriate default value ("eval"), proper documentation, and correct parameter handling (keyword-only with*).
50-72: LGTM! Error result creation includes tag.The
create_error_resultmethod properly propagates the tag parameter through to_create_result, maintaining consistency in error handling.
74-96: LGTM! Skipped result creation includes tag.The
create_skipped_resultmethod correctly handles tag propagation, ensuring skipped evaluations are also properly tagged.
98-149: LGTM! All metrics error marking propagates tag correctly.The
mark_all_metrics_as_errormethod correctly passestag=conv_data.tagto all error results for both turn-level and conversation-level metrics, ensuring comprehensive tag propagation during error scenarios.
151-189: LGTM! Turn metrics error marking includes tag.The
mark_turn_metrics_as_errormethod properly propagatestag=conv_data.tagwhen creating error results for individual turn metrics.
191-244: LGTM! Cascade marking propagates tag correctly.The
_mark_cascadehelper method correctly passestag=conv_data.tagwhen marking remaining turns and conversation-level metrics, ensuring tag consistency in cascading error/skip scenarios.tests/unit/core/output/test_statistics.py (2)
683-740: LGTM! Comprehensive test coverage for by_tag statistics.The test validates that
calculate_detailed_statscorrectly aggregates results by tag, checking pass/fail counts, rates, and score statistics for multiple tags (production, staging). This provides good coverage of the new tag-based aggregation functionality.
741-759: LGTM! Default tag behavior properly tested.The test validates that when no explicit tag is provided, the default "eval" tag is used in the by_tag statistics. This ensures backward compatibility and correct default behavior.
src/lightspeed_evaluation/core/output/statistics.py (1)
81-114: LGTM! By-tag statistics properly integrated.The
calculate_detailed_statsfunction now correctly:
- Initializes and populates
by_tagstatistics alongside existing categories- Includes score statistics for tag-level aggregation (consistent with
by_metric)- Properly finalizes tag statistics with confidence intervals
- Returns the expanded structure including
by_tagThe implementation follows the same pattern as existing
by_metricstatistics, ensuring consistency.src/lightspeed_evaluation/core/output/generator.py (5)
100-100: LGTM! Consistent default structure.Adding
"by_tag": {}to the default empty detailed stats structure ensures consistency with the other breakdown categories and prevents potential KeyError issues downstream.
227-232: LGTM! Good defensive coding.The use of
.get("by_tag", {})provides a safe fallback and is consistent with how other statistics categories are handled.
241-259: LGTM! Proper inclusion of tag field in JSON output.The tag field is correctly added to each result object in the JSON output, consistent with the serialization pattern for other fields.
302-304: LGTM! Consistent text output formatting.The "By Tag" breakdown section is properly implemented with defensive coding and score statistics support, following the same pattern as other breakdown categories.
166-202: Tag field is already included in CSV columns configuration.Verification confirms that
SUPPORTED_CSV_COLUMNSinsrc/lightspeed_evaluation/core/constants.pyincludes"tag"at line 29. The CSV generation code correctly handles dynamic column inclusion usinghasattr(), so no action is required.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
81-114: LGTM! Consistent implementation of by_tag statistics.The by_tag aggregation follows the same pattern as by_metric, appropriately using
include_scores=Trueand properly finalizing statistics with confidence intervals. The empty-result handling is correct.📝 Optional: Enhance docstring for clarity
Consider making the docstring more explicit about the returned structure:
def calculate_detailed_stats(results: list[EvaluationResult]) -> dict[str, Any]: - """Calculate detailed statistics broken down by different categories.""" + """Calculate detailed statistics broken down by different categories. + + Returns: + Dictionary with keys 'by_metric', 'by_conversation', and 'by_tag', + each containing aggregated pass/fail/error counts, rates, and score statistics. + """
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
README.mdconfig/evaluation_data.yamldocs/configuration.mdsrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/errors.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/models/test_data.pytests/unit/core/output/test_statistics.py
🚧 Files skipped from review as they are similar to previous changes (4)
- config/evaluation_data.yaml
- src/lightspeed_evaluation/pipeline/evaluation/errors.py
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
📓 Path-based instructions (3)
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/models/test_data.pytests/unit/core/output/test_statistics.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/models/test_data.pytests/unit/core/output/test_statistics.py
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
README.mddocs/configuration.md
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Do not add new features to lsc_agent_eval/ directory - add to src/lightspeed_evaluation/ instead
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (2)
tests/unit/core/models/test_data.py (1)
src/lightspeed_evaluation/core/models/data.py (3)
TurnData(36-315)EvaluationData(318-386)EvaluationResult(389-463)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationResult(389-463)src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_detailed_stats(81-114)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.11)
🔇 Additional comments (13)
docs/configuration.md (1)
216-216: LGTM! Documentation is clear and consistent.The documentation properly describes the new "tag" CSV column and includes it in the example configuration. This aligns well with the field description in README.md.
Also applies to: 249-249
README.md (1)
157-157: LGTM! Clear documentation of the new tag field.The documentation properly describes the tag field as optional with a default value of "eval", making it easy for users to understand and use this feature for grouping evaluation conversations.
Also applies to: 214-214
tests/unit/core/output/test_statistics.py (1)
683-758: LGTM! Comprehensive test coverage for by_tag functionality.The tests properly validate:
- Tag-based aggregation with explicit tag values ("production", "staging")
- Default tag behavior (defaulting to "eval")
- Pass/fail/error counts and rates per tag
- Score statistics inclusion for tag breakdowns
This provides solid coverage for the new feature.
src/lightspeed_evaluation/core/output/generator.py (4)
100-100: LGTM! Proper default for empty results.The empty detailed stats correctly includes the by_tag key, ensuring consistent structure across all code paths.
227-232: LGTM! Proper inclusion of by_tag in JSON summary.The by_tag statistics are correctly added to the summary_stats structure with appropriate fallback to an empty dict.
241-259: LGTM! Tag field properly included in result output.Each evaluation result now includes the tag field (line 244), enabling downstream consumers to filter or group results by tag.
302-304: LGTM! Text summary includes by_tag breakdown.The "By Tag" section is appropriately added to the text output with
include_scores=True, providing score statistics for tag-based groupings consistent with the by_metric section.src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
166-204: LGTM! Tag propagation implemented correctly.The tag field is properly propagated from
request.conv_data.tagto theEvaluationResultin the success path, enabling conversation grouping by tag.
215-238: LGTM! Tag propagation in error path is consistent.The tag field is correctly propagated in the error result path, ensuring all evaluation results include the tag regardless of success or failure status.
tests/unit/core/models/test_data.py (4)
276-296: LGTM! Explicit tag value test is comprehensive.The test correctly verifies that explicit tag values are properly stored in EvaluationData.
297-309: LGTM! Tag validation tests are thorough.The tests comprehensively cover both the default tag value ("eval") and the validation constraint that rejects empty tags.
327-342: LGTM! Default tag assertion added correctly.The test properly verifies that EvaluationResult has the default tag value "eval".
343-366: LGTM! EvaluationResult tag tests are comprehensive.The tests thoroughly verify both explicit tag values and validation constraints for EvaluationResult, maintaining consistency with the EvaluationData tests.
VladimirKadlec
left a comment
There was a problem hiding this comment.
LGTM, one comment for the future refactor.
| turn_data = request.turn_data | ||
| return EvaluationResult( | ||
| conversation_group_id=request.conv_data.conversation_group_id, | ||
| tag=request.conv_data.tag, |
There was a problem hiding this comment.
EvaluationResult is becoming really crazy :-/
Not in this PR, but for some future refactor, I'd recommend:
- storing the whole request
- storing the whole turn_data
Description
eval) to group eval conversations.Some refactoring/modularization was already done to make it minimal change. #132 & #131
Additional refactoring can be done for errors.py (will be covered in separate PR)
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.