chore: refactor processor & errors.py by asamal4 · Pull Request #119 · lightspeed-core/lightspeed-evaluation

asamal4 · 2025-12-15T13:12:33Z

Description

Refactor to eliminate too many local in processor.py and reduce code in error.py

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Cursor

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Release Notes

Refactor
- Improved error handling and setup/cleanup script management in the evaluation pipeline for more consistent behavior across different failure scenarios.
- Standardized error result construction across the metric evaluation system for better consistency and maintainability.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-15T13:12:42Z

Walkthrough

This PR introduces a new create_error_result helper method in EvaluationErrorHandler and refactors error-marking logic to centralize ERROR result construction. Additionally, it restructures ConversationProcessor.process_conversation with a new TurnProcessingContext dataclass and adds helper methods for setup/cleanup script execution, per-turn API amendment, error cascade handling, and turn/conversation-level evaluation orchestration.

Changes

Cohort / File(s)	Summary
Error handling centralization `src/lightspeed_evaluation/pipeline/evaluation/errors.py`	Adds `create_error_result` helper method to standardize ERROR result construction; refactors `mark_all_metrics_as_error`, `mark_turn_metrics_as_error`, and `mark_cascade_failure` to use the new helper; updates variable names from `metric_identifier` to `metric_id`; adds Optional typing import.
Conversation processor restructuring `src/lightspeed_evaluation/pipeline/evaluation/processor.py`	Introduces `TurnProcessingContext` dataclass to encapsulate per-turn and per-conversation metric state; reworks `process_conversation` to build context early, run setup script with failure handling, delegate to new orchestration methods, and ensure cleanup runs; adds helper methods `_build_processing_context`, `_has_metrics_to_evaluate`, `_handle_setup_failure`, `_process_turns_and_conversation`, `_process_turn_api`, `_handle_api_error`, `_evaluate_turn`, `_evaluate_conversation`, `_run_setup_script`, `_run_cleanup_script`; integrates `error_handler.create_error_result` for result construction.
Test mock updates `tests/unit/pipeline/evaluation/test_processor.py`	Updates `mock_error_handler` fixture to include `create_error_result` side effect that constructs ERROR EvaluationResult entries with conversation ID, turn ID, metric ID, reason, and query fields.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ConversationProcessor
    participant ErrorHandler
    participant SetupScript
    participant TurnAPI
    participant TurnEvaluator
    participant ConvEvaluator
    participant CleanupScript

    Client->>ConversationProcessor: process_conversation(conv_data)
    ConversationProcessor->>ConversationProcessor: _build_processing_context()
    ConversationProcessor->>ConversationProcessor: _has_metrics_to_evaluate()
    
    alt No metrics to evaluate
        ConversationProcessor-->>Client: return []
    else Metrics exist
        ConversationProcessor->>SetupScript: _run_setup_script()
        
        alt Setup fails
            SetupScript-->>ConversationProcessor: error
            ConversationProcessor->>ErrorHandler: mark all metrics as ERROR
            ErrorHandler-->>ConversationProcessor: error results
            ConversationProcessor->>CleanupScript: _run_cleanup_script()
            CleanupScript-->>ConversationProcessor: done
            ConversationProcessor-->>Client: return error results
        else Setup succeeds
            SetupScript-->>ConversationProcessor: success
            
            ConversationProcessor->>ConversationProcessor: _process_turns_and_conversation()
            
            loop For each turn
                ConversationProcessor->>TurnAPI: _process_turn_api()
                
                alt API call fails
                    TurnAPI-->>ConversationProcessor: error message
                    ConversationProcessor->>ErrorHandler: _handle_api_error(turn)
                    ErrorHandler-->>ConversationProcessor: mark turn + cascade to remaining turns/conv metrics as ERROR
                    ConversationProcessor->>ConversationProcessor: break early from turns loop
                else API call succeeds
                    TurnAPI-->>ConversationProcessor: success
                    ConversationProcessor->>TurnEvaluator: _evaluate_turn()
                    TurnEvaluator-->>ConversationProcessor: turn evaluation results
                end
            end
            
            ConversationProcessor->>ConvEvaluator: _evaluate_conversation()
            ConvEvaluator-->>ConversationProcessor: conversation evaluation results
            
            ConversationProcessor->>CleanupScript: _run_cleanup_script()
            CleanupScript-->>ConversationProcessor: done
            
            ConversationProcessor-->>Client: return all results
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–60 minutes

Error cascade logic (_handle_api_error method and early break in turn processing loop): verify correct propagation of errors to remaining turns and conversation-level metrics
Setup/cleanup script integration: ensure setup failures correctly trigger metric marking and cleanup execution; validate that cleanup always runs regardless of intermediate failures
Context building (_build_processing_context): confirm accurate resolution of per-turn and per-conversation metrics
Refactoring of existing methods: validate that new helper method usage (create_error_result) produces identical results to prior inlined construction across mark_all_metrics_as_error, mark_turn_metrics_as_error, and mark_cascade_failure
Test mock coverage: ensure mock create_error_result side effect accurately reflects actual implementation behavior for all call sites

Possibly related PRs

LCORE-748: Addded unit test cases coverage for the evaluation framework #95: Adds unit tests for the new create_error_result helper and tests for restructured ConversationProcessor flow, directly validating the public API changes introduced in this PR.
fix: run turn evaluation immediately after api call #105: Modifies the same error-handling and turn-processing paths; introduces per-turn cascade logic alongside changes to EvaluationErrorHandler methods.

Suggested reviewers

tisnik
VladimirKadlec

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'chore: refactor processor & errors.py' accurately reflects the main changes—refactoring in processor.py and errors.py to eliminate local variables and reduce code duplication through a new helper method.
Docstring Coverage	✅ Passed	Docstring coverage is 94.12% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
18-43: Consider adding Returns section to docstring.

The method has proper type hints and follows Google-style docstring format for Args, but the Returns section could be more explicit about what fields are populated in the returned EvaluationResult.

Consider this enhancement:
     def create_error_result(  # pylint: disable=too-many-arguments
         self,
         conv_id: str,
         metric_id: str,
         reason: str,
         *,
         turn_id: Optional[str] = None,
         query: str = "",
     ) -> EvaluationResult:
         """Create an ERROR EvaluationResult with common defaults.
 
         Args:
             conv_id: Conversation group ID
             metric_id: Metric identifier
             reason: Error reason
             turn_id: Turn ID (None for conversation-level)
             query: Query text
+
+        Returns:
+            EvaluationResult with result="ERROR" and other fields set to defaults
         """

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aacebdf and d8d4dec.

📒 Files selected for processing (3)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (6 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (5 hunks)
tests/unit/pipeline/evaluation/test_processor.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/pipeline/evaluation/test_processor.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/pipeline/evaluation/test_processor.py

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/pipeline/evaluation/errors.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py

🧠 Learnings (8)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/errors.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/errors.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use structured logging with appropriate levels in lightspeed_evaluation modules

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/errors.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/errors.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/errors.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/processor.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/processor.py

🧬 Code graph analysis (3)

tests/unit/pipeline/evaluation/test_processor.py (2)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationResult (378-446)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

create_error_result (18-43)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationResult (378-446)

EvaluationData (317-375)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)

src/lightspeed_evaluation/core/models/data.py (3)

EvaluationData (317-375)

EvaluationResult (378-446)

TurnData (35-314)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (4)

mark_all_metrics_as_error (45-92)

mark_turn_metrics_as_error (94-131)

mark_cascade_failure (133-184)

create_error_result (18-43)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

amend_single_turn (20-78)

src/lightspeed_evaluation/core/api/client.py (1)

query (72-106)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.13)
GitHub Check: tests (3.11)
GitHub Check: tests (3.12)
GitHub Check: mypy

🔇 Additional comments (14)

tests/unit/pipeline/evaluation/test_processor.py (1)

535-550: LGTM! Test fixture properly mocks the new error result factory.

The create_error_result_side_effect function correctly mirrors the signature and behavior of the new create_error_result method in EvaluationErrorHandler, enabling proper testing of error result construction.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (3)

68-89: LGTM! Effective refactoring using the centralized error factory.

The refactoring successfully eliminates code duplication by delegating ERROR result construction to create_error_result, improving maintainability and consistency.

120-129: LGTM! Concise list comprehension using the error factory.

The refactoring uses a clean list comprehension with create_error_result, reducing code duplication while maintaining clarity.

159-181: LGTM! Consistent refactoring pattern applied to cascade failure.

The refactoring consistently applies the create_error_result factory pattern to cascade failure scenarios, maintaining the same benefits of reduced duplication and improved maintainability.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (10)

37-44: LGTM! Dataclass effectively reduces parameter passing.

The TurnProcessingContext dataclass successfully encapsulates per-conversation processing state, addressing the PR objective of reducing local variables. The mutable conversation_id field appropriately tracks API state across turns.

56-88: LGTM! Well-structured refactoring improves readability.

The refactored process_conversation method effectively addresses the PR objective by extracting helper methods and reducing local variables. The try-finally block ensures cleanup runs regardless of outcomes.

89-106: LGTM! Clean extraction of context building logic.

The method effectively encapsulates metric resolution logic, contributing to the PR objective of reducing local variables in the main processing method.

108-112: LGTM! Clear helper method for metrics check.

The method provides a readable abstraction for checking whether any metrics need evaluation.

114-127: LGTM! Proper setup failure handling with cleanup.

The method correctly marks all metrics as ERROR and ensures cleanup runs even after setup failure, maintaining proper resource cleanup guarantees.

129-167: LGTM! Well-orchestrated turn and conversation processing.

The method provides clear orchestration of the processing flow, with appropriate early returns for API errors and proper cascade failure handling.

169-182: LGTM! Clean extraction of API processing logic.

The method appropriately encapsulates API call handling and maintains conversation state through the context object.

184-221: LGTM! Proper API error handling with cascade.

The method correctly implements cascade failure logic, marking the failed turn and all subsequent turns and conversation metrics as ERROR with appropriate error messages.

237-244: LGTM! Consistent use of centralized error factory.

The refactoring correctly uses error_handler.create_error_result for invalid turn metrics, aligning with the centralized error result construction pattern.

268-271: LGTM! Consistent use of centralized error factory.

The refactoring correctly uses error_handler.create_error_result for invalid conversation metrics, maintaining consistency with the error handling pattern.

VladimirKadlec

LGTM 👍

chore: refactor processor & errors.py

d8d4dec

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

VladimirKadlec approved these changes Dec 16, 2025

View reviewed changes

asamal4 merged commit 5fc3f0c into lightspeed-core:main Dec 16, 2025
15 checks passed

coderabbitai bot mentioned this pull request Dec 17, 2025

Skip rest of the eval for an metric failure within a conversation group #121

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: refactor processor & errors.py#119

chore: refactor processor & errors.py#119
asamal4 merged 1 commit intolightspeed-core:mainfrom
asamal4:refactor-proc

asamal4 commented Dec 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

VladimirKadlec left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

asamal4 commented Dec 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asamal4 commented Dec 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 15, 2025 •

edited

Loading