Skip to content

LCORE-748: Addded unit test cases coverage for the evaluation framework#95

Merged
tisnik merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev
Nov 11, 2025
Merged

LCORE-748: Addded unit test cases coverage for the evaluation framework#95
tisnik merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev

Conversation

@bsatapat-jpg
Copy link
Collaborator

@bsatapat-jpg bsatapat-jpg commented Nov 8, 2025

Issue Link: https://issues.redhat.com/browse/LCORE-748

Summary by CodeRabbit

  • Tests
    • Removed centralized pytest configuration and global fixtures.
    • Updated test path resolution and import setup.
    • Removed an obsolete trivial test.
    • Added extensive new unit test suites covering API client, metrics management, script execution, system configuration loading/validation, data validation, evaluation pipeline logic, error handling, and amendment flows.
    • Introduced test package initializers to reorganize test structure.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 8, 2025

Warning

Rate limit exceeded

@bsatapat-jpg has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 34 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 0f7791b and b3f2e77.

📒 Files selected for processing (16)
  • tests/conftest.py (0 hunks)
  • tests/script/test_compare_evaluations.py (2 hunks)
  • tests/test_evaluation.py (0 hunks)
  • tests/unit/core/api/__init__.py (1 hunks)
  • tests/unit/core/api/test_client.py (1 hunks)
  • tests/unit/core/metrics/test_manager.py (1 hunks)
  • tests/unit/core/script/__init__.py (1 hunks)
  • tests/unit/core/script/test_manager.py (1 hunks)
  • tests/unit/core/system/__init__.py (1 hunks)
  • tests/unit/core/system/test_loader.py (1 hunks)
  • tests/unit/core/system/test_validator.py (1 hunks)
  • tests/unit/pipeline/__init__.py (1 hunks)
  • tests/unit/pipeline/evaluation/__init__.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_amender.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_errors.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (1 hunks)

Walkthrough

Removed the centralized pytest fixture module and a trivial test, adjusted a script test import path, and added many unit test modules and package initializers covering core API, metrics, scripting, system config/validation, and pipeline evaluation components.

Changes

Cohort / File(s) Summary
Removed fixtures & trivial test
tests/conftest.py, tests/test_evaluation.py
Deleted global pytest configuration, fixtures, mocks, temp-file/env handling, custom markers and collection hooks, and a trivial test.
Updated script test import
tests/script/test_compare_evaluations.py
Adjusted sys.path and module resolution to import script.compare_evaluations from project root.
Test package initializers
tests/unit/core/api/__init__.py, tests/unit/core/script/__init__.py, tests/unit/core/system/__init__.py, tests/unit/pipeline/__init__.py, tests/unit/pipeline/evaluation/__init__.py
Added __init__.py files with module docstrings to organize tests into packages.
Core API tests
tests/unit/core/api/test_client.py
Added comprehensive unit tests for APIClient (init, queries, conversation handling, attachments, streaming, HTTP/timeouts/errors, tool-call formatting).
Core metrics tests
tests/unit/core/metrics/test_manager.py
Added tests for MetricManager: resolution with defaults/None/empty lists, threshold precedence, counting, and edge cases.
Core script tests
tests/unit/core/script/test_manager.py
Added tests for ScriptExecutionManager: execution success/failure, path/permission checks, stdout/stderr, relative paths, timeouts, and ScriptExecutionError cases.
Core system tests
tests/unit/core/system/test_loader.py, tests/unit/core/system/test_validator.py
Added tests for ConfigLoader (YAML parsing, defaults, metrics population) and DataValidator (pydantic error formatting, metric validation, file loading, accumulated errors).
Pipeline evaluation tests
tests/unit/pipeline/evaluation/test_amender.py, tests/unit/pipeline/evaluation/test_errors.py, tests/unit/pipeline/evaluation/test_evaluator.py
Added tests for APIDataAmender (amend flows, tool_calls, attachments, error handling), EvaluationErrorHandler (marking errors, summaries), and MetricsEvaluator (initialization, metric evaluation, status logic, exception handling).

Sequence Diagram(s)

sequenceDiagram
    participant Pytest as Pytest (collector)
    participant Tests as Test files
    participant Conftest as removed `tests/conftest.py`
    participant LocalFixtures as local per-test fixtures

    note right of Pytest: Previous flow (with conftest)
    Pytest->>Conftest: load session fixtures / hooks
    Conftest-->>Pytest: provide fixtures, env setup, markers

    note right of Pytest: New flow (conftest removed)
    Pytest->>Tests: collect test modules
    Tests->>LocalFixtures: use in-file or per-package fixtures / mocks
    LocalFixtures-->>Tests: provide isolated setup/teardown

    rect rgba(0,128,128,0.08)
      note over Tests,LocalFixtures: Tests now rely on local fixtures and explicit setup\ninstead of centralized conftest-provided state
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas needing extra attention:
    • Removal of tests/conftest.py — ensure no tests implicitly depended on globally provided fixtures/markers.
    • tests/unit/core/system/test_loader.py — many YAML parsing/edge-case scenarios and global metric mapping interactions.
    • tests/unit/pipeline/evaluation/test_evaluator.py — dense evaluator logic with many mocked metrics and error paths.
    • tests/unit/core/api/test_client.py — streaming/HTTP mocking and parsing of streaming responses.

Possibly related PRs

  • PR #84: Overlaps tests/conftest.py changes; directly related to test fixture refactor/removal.
  • PR #62: Related to ScriptExecutionManager and script-related tests introduced here.
  • PR #55: Related to metric handling and MetricManager behaviors exercised by new metric tests.

Suggested reviewers

  • VladimirKadlec
  • tisnik
  • asamal4

Poem

🐰
I nudged the conftest from its nest,
New tests sprung up — a busy quest,
API, scripts, metrics in a row,
I hop and sniff where failures go,
Carrots for green CI success! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding unit test coverage for the evaluation framework. It is specific, clear, and directly reflects the changeset's primary objective.
Docstring Coverage ✅ Passed Docstring coverage is 99.07% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
tests/script/test_compare_evaluations.py (1)

181-183: Consider cleaning up sys.path modification after the fixture.

The fixture modifies sys.path without cleanup. While pytest's test isolation typically prevents issues, explicitly removing the added path after each test would improve test hygiene and prevent potential sys.path pollution.

Consider refactoring to use a yield fixture with cleanup:

     @pytest.fixture
     def comparison_instance(self):
         """Create an EvaluationComparison instance for testing."""
-        # Import here to avoid module loading issues
         import sys
 
-        # Add project root to path (tests/script/ -> tests/ -> project_root/)
-        sys.path.append(str(Path(__file__).parent.parent.parent))
+        project_root = str(Path(__file__).parent.parent.parent)
+        sys.path.insert(0, project_root)
+        
         from script.compare_evaluations import EvaluationComparison
-
-        return EvaluationComparison(alpha=0.05)
+        
+        yield EvaluationComparison(alpha=0.05)
+        
+        # Cleanup
+        if project_root in sys.path:
+            sys.path.remove(project_root)
tests/unit/core/script/test_manager.py (1)

154-172: Move the os import to the top of the file.

The os module is imported inline at line 164, but it's better to import it at the module level alongside other imports for consistency and clarity.

Apply this diff:

 """Unit tests for core script manager module."""

 import tempfile
+import os
 from pathlib import Path

 import pytest

And remove the inline import:

             # Use relative path
-            import os
             original_cwd = os.getcwd()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64f8892 and 156a541.

📒 Files selected for processing (16)
  • tests/conftest.py (0 hunks)
  • tests/script/test_compare_evaluations.py (2 hunks)
  • tests/test_evaluation.py (0 hunks)
  • tests/unit/core/api/__init__.py (1 hunks)
  • tests/unit/core/api/test_client.py (1 hunks)
  • tests/unit/core/metrics/test_manager.py (1 hunks)
  • tests/unit/core/script/__init__.py (1 hunks)
  • tests/unit/core/script/test_manager.py (1 hunks)
  • tests/unit/core/system/__init__.py (1 hunks)
  • tests/unit/core/system/test_loader.py (1 hunks)
  • tests/unit/core/system/test_validator.py (1 hunks)
  • tests/unit/pipeline/__init__.py (1 hunks)
  • tests/unit/pipeline/evaluation/__init__.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_amender.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_errors.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (1 hunks)
💤 Files with no reviewable changes (2)
  • tests/conftest.py
  • tests/test_evaluation.py
🧰 Additional context used
📓 Path-based instructions (3)
tests/unit/**

📄 CodeRabbit inference engine (AGENTS.md)

Place unit tests under tests/unit/ mirroring the source structure

Files:

  • tests/unit/core/script/__init__.py
  • tests/unit/core/script/test_manager.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/pipeline/evaluation/__init__.py
  • tests/unit/pipeline/__init__.py
  • tests/unit/core/system/__init__.py
  • tests/unit/core/api/test_client.py
  • tests/unit/core/api/__init__.py
  • tests/unit/pipeline/evaluation/test_errors.py
  • tests/unit/core/system/test_validator.py
  • tests/unit/pipeline/evaluation/test_amender.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/core/metrics/test_manager.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

  • tests/unit/core/script/__init__.py
  • tests/script/test_compare_evaluations.py
  • tests/unit/core/script/test_manager.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/pipeline/evaluation/__init__.py
  • tests/unit/pipeline/__init__.py
  • tests/unit/core/system/__init__.py
  • tests/unit/core/api/test_client.py
  • tests/unit/core/api/__init__.py
  • tests/unit/pipeline/evaluation/test_errors.py
  • tests/unit/core/system/test_validator.py
  • tests/unit/pipeline/evaluation/test_amender.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/core/metrics/test_manager.py
tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

  • tests/script/test_compare_evaluations.py
  • tests/unit/core/script/test_manager.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/core/api/test_client.py
  • tests/unit/pipeline/evaluation/test_errors.py
  • tests/unit/core/system/test_validator.py
  • tests/unit/pipeline/evaluation/test_amender.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/core/metrics/test_manager.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • tests/script/test_compare_evaluations.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/core/api/test_client.py
  • tests/unit/core/system/test_validator.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/core/metrics/test_manager.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
  • tests/unit/core/system/test_validator.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/core/metrics/test_manager.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

  • tests/unit/core/metrics/test_manager.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • tests/unit/core/metrics/test_manager.py
🧬 Code graph analysis (8)
tests/unit/core/script/test_manager.py (1)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-117)
tests/unit/pipeline/evaluation/test_evaluator.py (5)
src/lightspeed_evaluation/core/models/data.py (5)
  • EvaluationData (264-311)
  • EvaluationRequest (370-423)
  • TurnData (35-261)
  • for_turn (398-412)
  • for_conversation (415-423)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (267-297)
src/lightspeed_evaluation/core/system/loader.py (1)
  • ConfigLoader (70-125)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
  • MetricsEvaluator (25-167)
  • evaluate_metric (60-140)
  • _determine_status (159-163)
src/lightspeed_evaluation/core/metrics/manager.py (3)
  • MetricManager (17-136)
  • MetricLevel (10-14)
  • get_effective_threshold (50-80)
tests/unit/core/api/test_client.py (4)
src/lightspeed_evaluation/core/models/system.py (1)
  • APIConfig (117-156)
src/lightspeed_evaluation/core/models/api.py (2)
  • APIRequest (30-77)
  • APIResponse (80-116)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • APIError (8-9)
src/lightspeed_evaluation/core/api/client.py (3)
  • query (71-105)
  • _handle_response_errors (198-206)
  • _extract_error_message (208-229)
tests/unit/pipeline/evaluation/test_errors.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • TurnData (35-261)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (3)
  • EvaluationErrorHandler (10-85)
  • mark_all_metrics_as_error (17-77)
  • get_error_summary (79-85)
tests/unit/core/system/test_validator.py (3)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • TurnData (35-261)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • DataValidationError (16-17)
src/lightspeed_evaluation/core/system/validator.py (4)
  • DataValidator (75-311)
  • format_pydantic_error (65-72)
  • validate_evaluation_data (139-157)
  • load_evaluation_data (85-137)
tests/unit/pipeline/evaluation/test_amender.py (5)
src/lightspeed_evaluation/core/models/api.py (1)
  • APIResponse (80-116)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • TurnData (35-261)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • APIError (8-9)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (3)
  • APIDataAmender (13-78)
  • amend_conversation_data (20-67)
  • get_amendment_summary (69-78)
src/lightspeed_evaluation/core/api/client.py (1)
  • query (71-105)
tests/unit/core/system/test_loader.py (3)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • DataValidationError (16-17)
src/lightspeed_evaluation/core/system/loader.py (3)
  • ConfigLoader (70-125)
  • populate_metric_mappings (29-48)
  • load_system_config (79-108)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (267-297)
tests/unit/core/metrics/test_manager.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (5)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
  • get_effective_threshold (50-80)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • TurnData (35-261)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (267-297)
🪛 GitHub Actions: Ruff
tests/unit/pipeline/evaluation/test_evaluator.py

[error] 37-37: F401: 'lightspeed_evaluation.core.metrics.manager.MetricLevel' imported but unused. Remove unused import or run 'ruff --fix' to fix.

tests/unit/core/api/test_client.py

[error] 5-5: F401: 'lightspeed_evaluation.core.models.APIRequest' imported but unused. Remove unused import or run 'ruff --fix' to fix.

tests/unit/pipeline/evaluation/test_errors.py

[error] 3-3: F401: 'pytest' imported but unused. Remove unused import or run 'ruff --fix' to fix.

tests/unit/pipeline/evaluation/test_amender.py

[error] 3-3: F401: 'pytest' imported but unused. Remove unused import or run 'ruff --fix' to fix.

tests/unit/core/system/test_loader.py

[error] 8-8: F401: 'lightspeed_evaluation.core.system.exceptions.DataValidationError' imported but unused. Remove unused import or run 'ruff --fix' to fix.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.11)
  • GitHub Check: mypy
🔇 Additional comments (12)
tests/script/test_compare_evaluations.py (1)

16-17: LGTM!

The path calculation correctly navigates from tests/script/ to the project root and then locates the target script. The comment clearly documents the path relationship.

tests/unit/core/system/test_validator.py (1)

1-422: LGTM! Comprehensive test coverage for the validator module.

The test suite provides excellent coverage:

  • Validation of Pydantic errors
  • Valid and invalid data scenarios
  • Metric availability checks
  • API-enabled behavior
  • File loading edge cases
  • Error accumulation

All tests follow pytest conventions and use the mocker fixture correctly.

tests/unit/core/script/test_manager.py (1)

1-195: Excellent test coverage for script execution scenarios.

The tests comprehensively cover success, failure, error handling, and edge cases including timeout behavior.

tests/unit/core/metrics/test_manager.py (1)

1-329: LGTM! Comprehensive test coverage for MetricManager.

The test suite thoroughly validates:

  • Metric resolution with defaults
  • Threshold hierarchy (level-specific vs system defaults)
  • Metric counting across various scenarios
  • Edge cases with empty metadata

All tests follow pytest best practices.

tests/unit/core/api/test_client.py (1)

10-281: Excellent test coverage for APIClient.

The test suite comprehensively covers:

  • Configuration validation
  • Standard and streaming query paths
  • Error handling (HTTP, timeout, validation)
  • Tool call formatting
  • Internal error extraction

All tests use pytest-mock correctly.

tests/unit/core/system/test_loader.py (1)

18-372: LGTM! Comprehensive test coverage for system loader.

The test suite thoroughly validates:

  • Metric mapping population and clearing
  • File loading error handling
  • YAML validation
  • Configuration sections and defaults
  • Metrics metadata handling

All tests follow pytest best practices with proper cleanup.

tests/unit/pipeline/__init__.py (1)

1-2: LGTM!

This package initializer correctly establishes the namespace for pipeline unit tests with an appropriate docstring.

tests/unit/pipeline/evaluation/__init__.py (1)

1-2: LGTM!

This package initializer correctly establishes the namespace for pipeline evaluation unit tests with an appropriate docstring.

tests/unit/pipeline/evaluation/test_errors.py (1)

9-197: Comprehensive test coverage for EvaluationErrorHandler.

The test suite thoroughly covers error handling scenarios including turn-level metrics, conversation-level metrics, mixed scenarios, empty metrics, error summaries, and accumulation across multiple conversations. All tests follow pytest conventions and properly construct test data.

tests/unit/pipeline/evaluation/test_amender.py (1)

10-309: Excellent test coverage for APIDataAmender.

The test suite comprehensively covers the amendment flow including scenarios with/without API clients, single and multi-turn conversations, tool calls, attachments, error handling at different stages, empty contexts, and amendment summaries. The tests properly use pytest-mock's mocker fixture and validate both success and failure paths.

tests/unit/pipeline/evaluation/test_evaluator.py (2)

15-60: Well-structured test fixtures.

The fixtures properly mock ConfigLoader, MetricManager, and ScriptExecutionManager with appropriate configurations and behaviors for testing the evaluator.


62-420: Comprehensive test coverage for MetricsEvaluator.

The test suite thoroughly exercises the evaluator's functionality including initialization validation, turn-level and conversation-level metric evaluation, pass/fail scenarios, unsupported framework handling, None score handling, exception handling, script metric gating when API is disabled, and status determination logic with various threshold values. All tests properly use mocking to isolate the component under test.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/unit/pipeline/evaluation/test_amender.py (1)

3-3: Remove unused import to fix pipeline failure.

The pytest import is unused and causes a pipeline failure. Since this was already flagged in a previous review, please remove it.

Apply this diff:

-import pytest
-
 from lightspeed_evaluation.core.models import APIResponse, EvaluationData, TurnData
tests/unit/core/system/test_loader.py (1)

8-8: Remove unused import to fix pipeline failure.

The DataValidationError import is unused and causes a pipeline failure. Since this was already flagged in a previous review, please remove it.

Apply this diff:

 import pytest
 
-from lightspeed_evaluation.core.system.exceptions import DataValidationError
 from lightspeed_evaluation.core.system.loader import (
     ConfigLoader,
     populate_metric_mappings,
🧹 Nitpick comments (2)
tests/unit/core/script/test_manager.py (2)

152-152: Move import to module level.

The os module should be imported at the top of the file rather than inside the function for better readability and convention compliance.

Apply this diff to move the import to the top of the file with other imports:

 import tempfile
 from pathlib import Path
+import os
 
 import pytest

And remove the import from inside the function:

             # Use relative path
-            import os
-
             original_cwd = os.getcwd()

15-181: Consider using pytest fixtures for test script management.

Multiple tests repeat the pattern of creating temporary scripts, making them executable, and cleaning them up. Pytest fixtures could reduce this duplication and improve test maintainability.

Example fixture:

@pytest.fixture
def executable_script():
    """Create a temporary executable script."""
    def _make_script(content: str, executable: bool = True):
        with tempfile.NamedTemporaryFile(mode="w", suffix=".sh", delete=False) as f:
            f.write(content)
            script_path = Path(f.name)
        if executable:
            script_path.chmod(0o755)
        else:
            script_path.chmod(0o644)
        yield script_path
        script_path.unlink()
    return _make_script

This would simplify tests like test_run_script_success to:

def test_run_script_success(self, executable_script):
    script_path = executable_script("#!/bin/bash\nexit 0\n")
    manager = ScriptExecutionManager()
    assert manager.run_script(script_path) is True
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 156a541 and 47de9b4.

📒 Files selected for processing (16)
  • tests/conftest.py (0 hunks)
  • tests/script/test_compare_evaluations.py (2 hunks)
  • tests/test_evaluation.py (0 hunks)
  • tests/unit/core/api/__init__.py (1 hunks)
  • tests/unit/core/api/test_client.py (1 hunks)
  • tests/unit/core/metrics/test_manager.py (1 hunks)
  • tests/unit/core/script/__init__.py (1 hunks)
  • tests/unit/core/script/test_manager.py (1 hunks)
  • tests/unit/core/system/__init__.py (1 hunks)
  • tests/unit/core/system/test_loader.py (1 hunks)
  • tests/unit/core/system/test_validator.py (1 hunks)
  • tests/unit/pipeline/__init__.py (1 hunks)
  • tests/unit/pipeline/evaluation/__init__.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_amender.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_errors.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (1 hunks)
💤 Files with no reviewable changes (2)
  • tests/test_evaluation.py
  • tests/conftest.py
✅ Files skipped from review due to trivial changes (1)
  • tests/unit/pipeline/init.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • tests/unit/core/system/test_validator.py
  • tests/script/test_compare_evaluations.py
  • tests/unit/core/api/test_client.py
  • tests/unit/pipeline/evaluation/test_errors.py
  • tests/unit/pipeline/evaluation/init.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
🧰 Additional context used
📓 Path-based instructions (3)
tests/unit/**

📄 CodeRabbit inference engine (AGENTS.md)

Place unit tests under tests/unit/ mirroring the source structure

Files:

  • tests/unit/pipeline/evaluation/test_amender.py
  • tests/unit/core/metrics/test_manager.py
  • tests/unit/core/script/test_manager.py
  • tests/unit/core/script/__init__.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/core/system/__init__.py
  • tests/unit/core/api/__init__.py
tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

  • tests/unit/pipeline/evaluation/test_amender.py
  • tests/unit/core/metrics/test_manager.py
  • tests/unit/core/script/test_manager.py
  • tests/unit/core/system/test_loader.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

  • tests/unit/pipeline/evaluation/test_amender.py
  • tests/unit/core/metrics/test_manager.py
  • tests/unit/core/script/test_manager.py
  • tests/unit/core/script/__init__.py
  • tests/unit/core/system/test_loader.py
  • tests/unit/core/system/__init__.py
  • tests/unit/core/api/__init__.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

Applied to files:

  • tests/unit/core/metrics/test_manager.py
  • tests/unit/core/system/test_loader.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • tests/unit/core/metrics/test_manager.py
  • tests/unit/core/system/test_loader.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • tests/unit/core/metrics/test_manager.py
🧬 Code graph analysis (4)
tests/unit/pipeline/evaluation/test_amender.py (5)
src/lightspeed_evaluation/core/models/api.py (1)
  • APIResponse (80-116)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • TurnData (35-261)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • APIError (8-9)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (3)
  • APIDataAmender (13-78)
  • amend_conversation_data (20-67)
  • get_amendment_summary (69-78)
src/lightspeed_evaluation/core/api/client.py (1)
  • query (71-105)
tests/unit/core/metrics/test_manager.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (5)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
  • get_effective_threshold (50-80)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • TurnData (35-261)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (267-297)
tests/unit/core/script/test_manager.py (1)
src/lightspeed_evaluation/core/script/manager.py (1)
  • ScriptExecutionManager (14-117)
tests/unit/core/system/test_loader.py (2)
src/lightspeed_evaluation/core/system/loader.py (3)
  • ConfigLoader (70-125)
  • populate_metric_mappings (29-48)
  • load_system_config (79-108)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (267-297)
🪛 GitHub Actions: Ruff
tests/unit/pipeline/evaluation/test_amender.py

[error] 3-3: Ruff: F401 imported but unused: pytest

tests/unit/core/system/test_loader.py

[error] 8-8: Ruff: F401 imported but unused: lightspeed_evaluation.core.system.exceptions.DataValidationError

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: mypy
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.13)
🔇 Additional comments (6)
tests/unit/core/api/__init__.py (1)

1-1: LGTM!

The package initializer with a clear docstring follows Python best practices for test organization.

tests/unit/core/metrics/test_manager.py (1)

1-318: Excellent comprehensive test coverage!

This test suite thoroughly validates MetricManager behavior across all key scenarios:

  • Metric resolution with defaults, explicit lists, and empty lists
  • Threshold hierarchy with system defaults and level-specific overrides
  • Metric counting across multiple edge cases
  • Proper handling of missing metadata and unknown metrics

The tests are well-structured, follow pytest conventions, and provide confidence in the MetricManager implementation.

tests/unit/pipeline/evaluation/test_amender.py (1)

10-293: Comprehensive test coverage for APIDataAmender!

Excellent test suite covering:

  • Amendment with/without API client
  • Single and multi-turn conversations with conversation_id tracking
  • Tool calls and attachments handling
  • Error scenarios at different turn positions
  • Empty contexts handling
  • Amendment summaries with various data states

The tests properly use pytest-mock and validate all key behaviors of the amender.

tests/unit/core/system/__init__.py (1)

1-1: LGTM!

Standard package initializer for the core system unit tests.

tests/unit/core/system/test_loader.py (1)

18-373: Excellent comprehensive test coverage!

Outstanding test suite covering:

  • populate_metric_mappings: turn-level, conversation-level, and clearing behavior
  • ConfigLoader: file errors, YAML errors, empty files, invalid formats
  • Valid configurations: minimal, complete, with defaults, partial metadata
  • Proper verification of metric population into global sets
  • Appropriate use of tempfile for isolated testing

The tests thoroughly validate the loader's robustness and configuration handling.

tests/unit/core/script/__init__.py (1)

1-1: LGTM!

Standard package initializer with appropriate docstring.

@bsatapat-jpg bsatapat-jpg force-pushed the dev branch 2 times, most recently from 0f7791b to 5cf16c0 Compare November 8, 2025 15:18
Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank You !! Now we have a baseline.
LGTM

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants