Skip to content

Add optional property tag to group eval conversations#134

Merged
asamal4 merged 2 commits intolightspeed-core:mainfrom
asamal4:add-tag
Jan 11, 2026
Merged

Add optional property tag to group eval conversations#134
asamal4 merged 2 commits intolightspeed-core:mainfrom
asamal4:add-tag

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Jan 8, 2026

Description

  • Added optional property tag (default value: eval) to group eval conversations.
  • Added statistics by tag

Some refactoring/modularization was already done to make it minimal change. #132 & #131
Additional refactoring can be done for errors.py (will be covered in separate PR)

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Cursor

Related Tickets & Documents

  • Related Issue #
  • Closes # LEADS-150

Checklist before requesting a review

  • [] I have performed a self-review of my code.
  • [] PR has passed all pre-merge test jobs.
  • [] If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Added an optional "tag" field (default "eval") for grouping and filtering evaluation conversations.
    • Added tag-based breakdowns in evaluation statistics and output summaries.
  • Documentation

    • Updated docs and examples to document the new tag field and CSV column.
  • Tests

    • Added tests for tag behavior, defaulting, validation, and tag-based statistics.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

Walkthrough

Adds a new optional string field tag (default "eval") across the evaluation pipeline: models, config, CSV constants, statistics aggregation, output generation, error/result creation, and tests to enable grouping/filtering of conversations.

Changes

Cohort / File(s) Summary
Docs & Config
README.md, docs/configuration.md, config/evaluation_data.yaml
Documented tag column, added tag to csv_columns example and sample groups (basic, tools, infrastructure).
Constants & Models
src/lightspeed_evaluation/core/constants.py, src/lightspeed_evaluation/core/models/data.py
Added "tag" to SUPPORTED_CSV_COLUMNS; added tag: str = Field(default="eval", ...) to EvaluationData and EvaluationResult (validation and descriptions).
Statistics & Output
src/lightspeed_evaluation/core/output/statistics.py, src/lightspeed_evaluation/core/output/generator.py
Introduced by_tag aggregation in detailed stats; included by_tag and tag when rendering JSON and text summaries.
Pipeline: Errors & Evaluator
src/lightspeed_evaluation/pipeline/evaluation/errors.py, src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
Added tag parameter (default "eval") to error/result creation paths and propagated request.conv_data.tag into EvaluationResult instances across success/error flows.
Tests
tests/unit/core/models/test_data.py, tests/unit/core/output/test_statistics.py
Added tests for tag field default/validation in models and tests for by_tag breakdown and default-tag behavior in statistics.

Sequence Diagram(s)

sequenceDiagram
    participant Conv as Conversation Data (config / input)
    participant Eval as Evaluator
    participant Err as ErrorHandler
    participant Result as EvaluationResult (model)
    participant Stats as Statistics
    participant Out as Output Generator

    Conv->>Eval: request with conv_data.tag
    Eval->>Result: create EvaluationResult(tag=conv_data.tag)
    Eval->>Err: (on errors) create_error_result(tag=conv_data.tag)
    Result->>Stats: accumulate by_metric / by_conversation / by_tag
    Stats->>Out: provide detailed_stats (includes by_tag)
    Out->>Client: render JSON/text including tag and by_tag breakdown
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • VladimirKadlec
  • tisnik
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding an optional tag property to group evaluation conversations, which is the core feature introduced across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/models/data.py (1)

397-399: Consider adding min_length=1 validation for consistency.

The tag field in EvaluationData (Line 331) has min_length=1 validation, but EvaluationResult.tag does not. This inconsistency could allow empty tags in results while preventing them in data models, potentially causing issues with tag-based grouping and statistics.

🔧 Suggested fix for consistency
 tag: str = Field(
-    default="eval", description="Tag for grouping and filtering results"
+    default="eval", 
+    min_length=1,
+    description="Tag for grouping and filtering results"
 )
tests/unit/core/models/test_data.py (1)

297-302: Consider adding a test for empty tag validation.

While the current tests cover default and explicit tag values well, there's no test verifying that empty tags are rejected for EvaluationData (which has min_length=1 validation on Line 331 of data.py). This would ensure the validation works as expected.

🧪 Suggested test addition

Add this test to the TestEvaluationData class:

def test_empty_tag_rejected(self):
    """Test that empty tag is rejected."""
    turn = TurnData(turn_id="turn1", query="Query")
    
    with pytest.raises(ValidationError):
        EvaluationData(conversation_group_id="conv1", turns=[turn], tag="")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef377db and 9d43a51.

📒 Files selected for processing (11)
  • README.md
  • config/evaluation_data.yaml
  • docs/configuration.md
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/output/test_statistics.py
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/output/generator.py
**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

  • tests/unit/core/models/test_data.py
  • tests/unit/core/output/test_statistics.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

  • tests/unit/core/models/test_data.py
  • tests/unit/core/output/test_statistics.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • docs/configuration.md
  • config/evaluation_data.yaml
  • README.md
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

  • config/evaluation_data.yaml
📚 Learning: 2025-07-16T10:41:09.399Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.

Applied to files:

  • config/evaluation_data.yaml
📚 Learning: 2025-07-16T09:42:00.691Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/data/script/eval3/setup.sh:1-3
Timestamp: 2025-07-16T09:42:00.691Z
Learning: Scripts in the lsc_agent_eval/data directory are meant to be simple examples/samples for teams to customize according to their needs, not production-ready code.

Applied to files:

  • config/evaluation_data.yaml
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Do not add new features to lsc_agent_eval/ directory - add to src/lightspeed_evaluation/ instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationResult (389-461)
tests/unit/core/models/test_data.py (1)
src/lightspeed_evaluation/core/models/data.py (3)
  • TurnData (36-315)
  • EvaluationData (318-386)
  • EvaluationResult (389-461)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationResult (389-461)
src/lightspeed_evaluation/core/output/statistics.py (1)
  • calculate_detailed_stats (81-114)
🔇 Additional comments (20)
docs/configuration.md (1)

216-216: LGTM! Documentation is clear and consistent.

The documentation accurately describes the new tag CSV column and its purpose for grouping eval conversations. The placement after conversation_group_id aligns with the implementation in the constants file.

Also applies to: 249-249

README.md (1)

157-157: LGTM! Documentation is accurate and complete.

The README correctly documents the new tag field with its optional nature, default value, and purpose. The documentation is consistent with the implementation in the data models.

Also applies to: 214-214

src/lightspeed_evaluation/core/constants.py (1)

29-29: LGTM! Appropriate placement of the new column.

The tag column is logically positioned after conversation_group_id, which makes sense given that tags are used for grouping conversations.

tests/unit/core/models/test_data.py (1)

336-347: LGTM! Comprehensive test coverage for the tag field.

The tests effectively verify both default and explicit tag values for EvaluationResult, ensuring the new feature works as intended.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

168-168: LGTM! Tag propagation implemented correctly.

The tag field is correctly propagated from request.conv_data.tag to both successful evaluation results (line 168) and error results (line 222), ensuring consistent tagging across all evaluation outcomes.

Also applies to: 222-222

config/evaluation_data.yaml (1)

5-5: LGTM! Tag field examples are clear and well-documented.

The configuration examples demonstrate meaningful tag usage (basic, tools, infrastructure) with helpful inline documentation of the default value. This provides good guidance for users adopting the new tag feature.

Also applies to: 60-60, 99-99

src/lightspeed_evaluation/pipeline/evaluation/errors.py (6)

18-48: LGTM! Tag parameter properly added to result creation.

The _create_result method now correctly accepts and propagates the tag parameter with an appropriate default value ("eval"), proper documentation, and correct parameter handling (keyword-only with *).


50-72: LGTM! Error result creation includes tag.

The create_error_result method properly propagates the tag parameter through to _create_result, maintaining consistency in error handling.


74-96: LGTM! Skipped result creation includes tag.

The create_skipped_result method correctly handles tag propagation, ensuring skipped evaluations are also properly tagged.


98-149: LGTM! All metrics error marking propagates tag correctly.

The mark_all_metrics_as_error method correctly passes tag=conv_data.tag to all error results for both turn-level and conversation-level metrics, ensuring comprehensive tag propagation during error scenarios.


151-189: LGTM! Turn metrics error marking includes tag.

The mark_turn_metrics_as_error method properly propagates tag=conv_data.tag when creating error results for individual turn metrics.


191-244: LGTM! Cascade marking propagates tag correctly.

The _mark_cascade helper method correctly passes tag=conv_data.tag when marking remaining turns and conversation-level metrics, ensuring tag consistency in cascading error/skip scenarios.

tests/unit/core/output/test_statistics.py (2)

683-740: LGTM! Comprehensive test coverage for by_tag statistics.

The test validates that calculate_detailed_stats correctly aggregates results by tag, checking pass/fail counts, rates, and score statistics for multiple tags (production, staging). This provides good coverage of the new tag-based aggregation functionality.


741-759: LGTM! Default tag behavior properly tested.

The test validates that when no explicit tag is provided, the default "eval" tag is used in the by_tag statistics. This ensures backward compatibility and correct default behavior.

src/lightspeed_evaluation/core/output/statistics.py (1)

81-114: LGTM! By-tag statistics properly integrated.

The calculate_detailed_stats function now correctly:

  • Initializes and populates by_tag statistics alongside existing categories
  • Includes score statistics for tag-level aggregation (consistent with by_metric)
  • Properly finalizes tag statistics with confidence intervals
  • Returns the expanded structure including by_tag

The implementation follows the same pattern as existing by_metric statistics, ensuring consistency.

src/lightspeed_evaluation/core/output/generator.py (5)

100-100: LGTM! Consistent default structure.

Adding "by_tag": {} to the default empty detailed stats structure ensures consistency with the other breakdown categories and prevents potential KeyError issues downstream.


227-232: LGTM! Good defensive coding.

The use of .get("by_tag", {}) provides a safe fallback and is consistent with how other statistics categories are handled.


241-259: LGTM! Proper inclusion of tag field in JSON output.

The tag field is correctly added to each result object in the JSON output, consistent with the serialization pattern for other fields.


302-304: LGTM! Consistent text output formatting.

The "By Tag" breakdown section is properly implemented with defensive coding and score statistics support, following the same pattern as other breakdown categories.


166-202: Tag field is already included in CSV columns configuration.

Verification confirms that SUPPORTED_CSV_COLUMNS in src/lightspeed_evaluation/core/constants.py includes "tag" at line 29. The CSV generation code correctly handles dynamic column inclusion using hasattr(), so no action is required.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/output/statistics.py (1)

81-114: LGTM! Consistent implementation of by_tag statistics.

The by_tag aggregation follows the same pattern as by_metric, appropriately using include_scores=True and properly finalizing statistics with confidence intervals. The empty-result handling is correct.

📝 Optional: Enhance docstring for clarity

Consider making the docstring more explicit about the returned structure:

 def calculate_detailed_stats(results: list[EvaluationResult]) -> dict[str, Any]:
-    """Calculate detailed statistics broken down by different categories."""
+    """Calculate detailed statistics broken down by different categories.
+    
+    Returns:
+        Dictionary with keys 'by_metric', 'by_conversation', and 'by_tag',
+        each containing aggregated pass/fail/error counts, rates, and score statistics.
+    """
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d43a51 and 6230230.

📒 Files selected for processing (11)
  • README.md
  • config/evaluation_data.yaml
  • docs/configuration.md
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/output/test_statistics.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • config/evaluation_data.yaml
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
  • src/lightspeed_evaluation/core/constants.py
  • src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
📓 Path-based instructions (3)
**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

  • tests/unit/core/models/test_data.py
  • tests/unit/core/output/test_statistics.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

  • tests/unit/core/models/test_data.py
  • tests/unit/core/output/test_statistics.py
src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

  • src/lightspeed_evaluation/core/output/statistics.py
  • src/lightspeed_evaluation/core/output/generator.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

  • README.md
  • docs/configuration.md
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Do not add new features to lsc_agent_eval/ directory - add to src/lightspeed_evaluation/ instead

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (2)
tests/unit/core/models/test_data.py (1)
src/lightspeed_evaluation/core/models/data.py (3)
  • TurnData (36-315)
  • EvaluationData (318-386)
  • EvaluationResult (389-463)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationResult (389-463)
src/lightspeed_evaluation/core/output/statistics.py (1)
  • calculate_detailed_stats (81-114)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.11)
🔇 Additional comments (13)
docs/configuration.md (1)

216-216: LGTM! Documentation is clear and consistent.

The documentation properly describes the new "tag" CSV column and includes it in the example configuration. This aligns well with the field description in README.md.

Also applies to: 249-249

README.md (1)

157-157: LGTM! Clear documentation of the new tag field.

The documentation properly describes the tag field as optional with a default value of "eval", making it easy for users to understand and use this feature for grouping evaluation conversations.

Also applies to: 214-214

tests/unit/core/output/test_statistics.py (1)

683-758: LGTM! Comprehensive test coverage for by_tag functionality.

The tests properly validate:

  • Tag-based aggregation with explicit tag values ("production", "staging")
  • Default tag behavior (defaulting to "eval")
  • Pass/fail/error counts and rates per tag
  • Score statistics inclusion for tag breakdowns

This provides solid coverage for the new feature.

src/lightspeed_evaluation/core/output/generator.py (4)

100-100: LGTM! Proper default for empty results.

The empty detailed stats correctly includes the by_tag key, ensuring consistent structure across all code paths.


227-232: LGTM! Proper inclusion of by_tag in JSON summary.

The by_tag statistics are correctly added to the summary_stats structure with appropriate fallback to an empty dict.


241-259: LGTM! Tag field properly included in result output.

Each evaluation result now includes the tag field (line 244), enabling downstream consumers to filter or group results by tag.


302-304: LGTM! Text summary includes by_tag breakdown.

The "By Tag" section is appropriately added to the text output with include_scores=True, providing score statistics for tag-based groupings consistent with the by_metric section.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

166-204: LGTM! Tag propagation implemented correctly.

The tag field is properly propagated from request.conv_data.tag to the EvaluationResult in the success path, enabling conversation grouping by tag.


215-238: LGTM! Tag propagation in error path is consistent.

The tag field is correctly propagated in the error result path, ensuring all evaluation results include the tag regardless of success or failure status.

tests/unit/core/models/test_data.py (4)

276-296: LGTM! Explicit tag value test is comprehensive.

The test correctly verifies that explicit tag values are properly stored in EvaluationData.


297-309: LGTM! Tag validation tests are thorough.

The tests comprehensively cover both the default tag value ("eval") and the validation constraint that rejects empty tags.


327-342: LGTM! Default tag assertion added correctly.

The test properly verifies that EvaluationResult has the default tag value "eval".


343-366: LGTM! EvaluationResult tag tests are comprehensive.

The tests thoroughly verify both explicit tag values and validation constraints for EvaluationResult, maintaining consistency with the EvaluationData tests.

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one comment for the future refactor.

turn_data = request.turn_data
return EvaluationResult(
conversation_group_id=request.conv_data.conversation_group_id,
tag=request.conv_data.tag,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EvaluationResult is becoming really crazy :-/

Not in this PR, but for some future refactor, I'd recommend:

  • storing the whole request
  • storing the whole turn_data

@asamal4 asamal4 merged commit 4fb4eb8 into lightspeed-core:main Jan 11, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants