add support for fail_on_invalid_data option#94
Conversation
WalkthroughAdds config flag Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Config as Config
participant Validator as DataValidator
participant Data as Turn/Eval Data
participant Processor as ConversationProcessor
participant Evaluator as MetricsEvaluator
Config->>Validator: __init__(api_enabled, fail_on_invalid_data)
Validator->>Validator: validate(data)
alt fail_on_invalid_data = true
Validator-->>Config: return False on validation error
Note over Validator: validation fails (hard) color:`#f8d7da`
else fail_on_invalid_data = false
Validator->>Data: add_invalid_metric(metric)
Validator->>Validator: log notice, continue
Note over Validator: metric flagged but validation continues color:`#fff3cd`
end
Processor->>Data: is_metric_invalid(metric)?
alt metric invalid
Processor->>Processor: log error, emit ERROR result, skip metric color:`#f8d7da`
else metric valid
Processor->>Evaluator: evaluate(EvaluationRequest)
Evaluator-->>Processor: result / None
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
adef500 to
7eb05b7
Compare
|
Thanks @VladimirKadlec for working on this 👍 I agree with comments from Asutosh. |
3a498c2 to
059c7c9
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/models/system.py (1)
265-268: Improve field description for clarity.The description has a minor grammatical issue (missing comma) and could be more explicit about the behavior in both states.
Consider this improved description:
fail_on_invalid_data: bool = Field( default=True, - description="If False don't fail on invalid conversations", + description="When True, raise errors on validation failures; when False, log warnings and continue processing", )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
README.md(1 hunks)config/system.yaml(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/models/system.py(1 hunks)src/lightspeed_evaluation/core/system/validator.py(6 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(2 hunks)src/lightspeed_evaluation/runner/evaluation.py(1 hunks)tests/unit/pipeline/evaluation/test_processor.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- README.md
- config/system.yaml
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/system/validator.py
- src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/pipeline.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/pipeline/evaluation/processor.pytests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/processor.pytests/unit/pipeline/evaluation/test_processor.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)
src/lightspeed_evaluation/core/system/validator.py (1)
DataValidator(79-331)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
is_metric_invalid(95-97)is_metric_invalid(355-357)
🔇 Additional comments (2)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)
65-68: LGTM!The DataValidator initialization correctly passes the
fail_on_invalid_dataconfiguration parameter, aligning with the PR objective to add configurable validation behavior.tests/unit/pipeline/evaluation/test_processor.py (1)
1-394: Excellent test coverage!The test suite comprehensively covers the new invalid metric handling functionality with well-structured test cases including:
- Valid metrics scenarios
- Invalid metric detection and skipping
- Mixed valid/invalid metrics
- Edge cases (empty metrics, evaluator returning None)
- Verification of correct request construction and metric ordering
The tests properly use fixtures, mocks, and log capture to validate behavior.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/lightspeed_evaluation/pipeline/evaluation/processor.py(2 hunks)tests/unit/pipeline/evaluation/test_processor.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lightspeed_evaluation/pipeline/evaluation/processor.py
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/pipeline/evaluation/test_processor.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
🧬 Code graph analysis (1)
tests/unit/pipeline/evaluation/test_processor.py (8)
src/lightspeed_evaluation/core/models/data.py (8)
EvaluationData(309-367)EvaluationRequest(426-479)EvaluationResult(370-409)TurnData(35-306)add_invalid_metric(91-93)add_invalid_metric(351-353)is_metric_invalid(95-97)is_metric_invalid(355-357)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(271-301)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(70-125)src/lightspeed_evaluation/pipeline/evaluation/processor.py (3)
ConversationProcessor(37-299)ProcessorComponents(27-34)_evaluate_turn(181-214)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(25-172)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
APIDataAmender(13-80)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
EvaluationErrorHandler(10-201)src/lightspeed_evaluation/core/metrics/manager.py (1)
MetricManager(17-166)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: mypy
🔇 Additional comments (11)
tests/unit/pipeline/evaluation/test_processor.py (11)
1-19: LGTM!Imports are well-organized and appropriate for the test module. Correctly uses pytest as required by the coding guidelines.
21-121: LGTM!Fixtures are well-structured and follow pytest best practices. Correctly uses
mockerfixture from pytest-mock as required by coding guidelines. Good use of fixture composition for assembling the processor with all its dependencies.
126-150: LGTM!Comprehensive test of the happy path with valid metrics. Correctly verifies result count, types, evaluator invocation count, and metric identifier ordering.
152-184: LGTM!Well-structured test for invalid metric handling. Correctly verifies that invalid metrics produce ERROR results without invoking the evaluator, while valid metrics are still evaluated. Good use of
caplogto verify error logging.
218-254: LGTM!Excellent test for mixed valid/invalid metrics. Correctly verifies that result order is preserved, invalid metrics produce ERROR results at the correct position, and only valid metrics trigger evaluator calls.
256-273: LGTM!Good edge case test for empty metrics list. Correctly verifies no results and no evaluator invocation.
275-299: LGTM!Thorough verification of EvaluationRequest construction. Correctly checks that all key fields (conv_data, metric_identifier, turn_id, turn_idx) are properly populated from the input data.
301-324: LGTM!Good test for handling evaluator returning None. Correctly reconfigures the mock and verifies graceful handling with empty results while confirming the evaluator is still invoked.
326-347: LGTM!Good test for turn index handling in multi-turn conversations. Correctly verifies that turn_idx and turn_id are properly set when evaluating a non-first turn.
349-374: LGTM!Excellent test for metric evaluation order preservation. Correctly verifies that metrics are evaluated in the order provided, which is important for consistent behavior.
376-401: LGTM!Comprehensive test for TurnData's invalid metric tracking. While this tests the data model directly rather than the processor, it's appropriately placed here since this functionality is central to the processor's behavior. Good coverage of initial state, additions, and idempotent behavior.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unit/pipeline/evaluation/test_processor.py (1)
41-61: Consider constructingEvaluationResultwith all required fields in the mock.
mock_metrics_evaluatorcurrently only sets a subset ofEvaluationResultfields. Given the model also exposesquery,response, andexecution_time, it’s safer (and closer to production behavior) to populate them from the request so tests don’t silently diverge from real data or break if those fields are required.For example:
- def evaluate_metric(request): - """Mock evaluate_metric that returns a result based on metric.""" - return EvaluationResult( - conversation_group_id=request.conv_data.conversation_group_id, - turn_id=request.turn_id, - metric_identifier=request.metric_identifier, - result="PASS", - score=0.85, - reason="Test evaluation", - threshold=0.7, - ) + def evaluate_metric(request): + """Mock evaluate_metric that returns a result based on metric.""" + return EvaluationResult( + conversation_group_id=request.conv_data.conversation_group_id, + turn_id=request.turn_id, + metric_identifier=request.metric_identifier, + result="PASS", + score=0.85, + reason="Test evaluation", + threshold=0.7, + query=getattr(getattr(request, "turn_data", None), "query", ""), + response=getattr(getattr(request, "turn_data", None), "response", ""), + execution_time=0.0, + )Also, good call using the
mockerfixture andspec=MetricsEvaluator/ConfigLoader, which aligns with the pytest/pytest-mock testing guidelines. As per coding guidelines, ...
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/unit/pipeline/evaluation/test_processor.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/pipeline/evaluation/test_processor.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
🧬 Code graph analysis (1)
tests/unit/pipeline/evaluation/test_processor.py (1)
src/lightspeed_evaluation/core/models/data.py (8)
EvaluationData(309-367)EvaluationRequest(426-479)EvaluationResult(370-409)TurnData(35-306)add_invalid_metric(91-93)add_invalid_metric(351-353)is_metric_invalid(95-97)is_metric_invalid(355-357)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: tests (3.12)
- GitHub Check: mypy
🔇 Additional comments (5)
tests/unit/pipeline/evaluation/test_processor.py (5)
126-151: Valid-metric path is well covered.This test nicely asserts both the number of results and that
evaluate_metricis invoked twice with correctly orderedmetric_identifiervalues, which pins the expected request sequence for the happy path.
152-216: Invalid and mixed-metric tests accurately encode ERROR semantics and logging.The three tests around invalid metrics (single invalid, all invalid, and mixed valid/invalid) clearly assert:
- ERROR
EvaluationResults for invalid metrics,- that valid metrics are still evaluated in-place,
- evaluator call counts (0 / 1 / 2 as appropriate), and
- presence of the expected log messages via
caplog.The updated docstring for the “all invalid metrics” case now matches the asserted behavior, resolving the earlier mismatch noted in prior review comments.
Also applies to: 218-255
256-324: Edge cases for empty metric lists andNoneevaluator results are handled correctly.The tests for an empty
turn_metricslist and forevaluate_metricreturningNoneboth assert that no results are produced while still verifying evaluator invocation where relevant. This gives good coverage of “no-op” and “dropped result” scenarios without over-specifying internal behavior.
326-374: Turn index and metric evaluation order are well specified.The tests around multiple turns and metric ordering ensure:
turn_idxandturn_idare correctly propagated intoEvaluationRequest, and- metrics are evaluated strictly in the order provided.
These constraints are important for downstream consumers and make future refactors to
_evaluate_turnsafer.
376-401:TurnDatainvalid-metric helpers are thoroughly tested.This test cleanly covers the default state, marking metrics invalid, multiple invalid metrics, and idempotent re-adding of the same metric. It gives solid confidence in the internal
_invalid_metricsbehavior that the processor relies on.
LCORE-647
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.