LCORE-748: Addded unit test cases coverage for the evaluation framework by bsatapat-jpg · Pull Request #95 · lightspeed-core/lightspeed-evaluation

bsatapat-jpg · 2025-11-08T15:03:18Z

Issue Link: https://issues.redhat.com/browse/LCORE-748

Summary by CodeRabbit

Tests
- Removed centralized pytest configuration and global fixtures.
- Updated test path resolution and import setup.
- Removed an obsolete trivial test.
- Added extensive new unit test suites covering API client, metrics management, script execution, system configuration loading/validation, data validation, evaluation pipeline logic, error handling, and amendment flows.
- Introduced test package initializers to reorganize test structure.

coderabbitai · 2025-11-08T15:03:27Z

Warning

Rate limit exceeded

@bsatapat-jpg has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 34 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 0f7791b and b3f2e77.

📒 Files selected for processing (16)

tests/conftest.py (0 hunks)
tests/script/test_compare_evaluations.py (2 hunks)
tests/test_evaluation.py (0 hunks)
tests/unit/core/api/__init__.py (1 hunks)
tests/unit/core/api/test_client.py (1 hunks)
tests/unit/core/metrics/test_manager.py (1 hunks)
tests/unit/core/script/__init__.py (1 hunks)
tests/unit/core/script/test_manager.py (1 hunks)
tests/unit/core/system/__init__.py (1 hunks)
tests/unit/core/system/test_loader.py (1 hunks)
tests/unit/core/system/test_validator.py (1 hunks)
tests/unit/pipeline/__init__.py (1 hunks)
tests/unit/pipeline/evaluation/__init__.py (1 hunks)
tests/unit/pipeline/evaluation/test_amender.py (1 hunks)
tests/unit/pipeline/evaluation/test_errors.py (1 hunks)
tests/unit/pipeline/evaluation/test_evaluator.py (1 hunks)

Walkthrough

Removed the centralized pytest fixture module and a trivial test, adjusted a script test import path, and added many unit test modules and package initializers covering core API, metrics, scripting, system config/validation, and pipeline evaluation components.

Changes

Cohort / File(s)	Summary
Removed fixtures & trivial test `tests/conftest.py`, `tests/test_evaluation.py`	Deleted global pytest configuration, fixtures, mocks, temp-file/env handling, custom markers and collection hooks, and a trivial test.
Updated script test import `tests/script/test_compare_evaluations.py`	Adjusted sys.path and module resolution to import `script.compare_evaluations` from project root.
Test package initializers `tests/unit/core/api/__init__.py`, `tests/unit/core/script/__init__.py`, `tests/unit/core/system/__init__.py`, `tests/unit/pipeline/__init__.py`, `tests/unit/pipeline/evaluation/__init__.py`	Added `__init__.py` files with module docstrings to organize tests into packages.
Core API tests `tests/unit/core/api/test_client.py`	Added comprehensive unit tests for APIClient (init, queries, conversation handling, attachments, streaming, HTTP/timeouts/errors, tool-call formatting).
Core metrics tests `tests/unit/core/metrics/test_manager.py`	Added tests for MetricManager: resolution with defaults/None/empty lists, threshold precedence, counting, and edge cases.
Core script tests `tests/unit/core/script/test_manager.py`	Added tests for ScriptExecutionManager: execution success/failure, path/permission checks, stdout/stderr, relative paths, timeouts, and ScriptExecutionError cases.
Core system tests `tests/unit/core/system/test_loader.py`, `tests/unit/core/system/test_validator.py`	Added tests for ConfigLoader (YAML parsing, defaults, metrics population) and DataValidator (pydantic error formatting, metric validation, file loading, accumulated errors).
Pipeline evaluation tests `tests/unit/pipeline/evaluation/test_amender.py`, `tests/unit/pipeline/evaluation/test_errors.py`, `tests/unit/pipeline/evaluation/test_evaluator.py`	Added tests for APIDataAmender (amend flows, tool_calls, attachments, error handling), EvaluationErrorHandler (marking errors, summaries), and MetricsEvaluator (initialization, metric evaluation, status logic, exception handling).

Sequence Diagram(s)

sequenceDiagram
    participant Pytest as Pytest (collector)
    participant Tests as Test files
    participant Conftest as removed `tests/conftest.py`
    participant LocalFixtures as local per-test fixtures

    note right of Pytest: Previous flow (with conftest)
    Pytest->>Conftest: load session fixtures / hooks
    Conftest-->>Pytest: provide fixtures, env setup, markers

    note right of Pytest: New flow (conftest removed)
    Pytest->>Tests: collect test modules
    Tests->>LocalFixtures: use in-file or per-package fixtures / mocks
    LocalFixtures-->>Tests: provide isolated setup/teardown

    rect rgba(0,128,128,0.08)
      note over Tests,LocalFixtures: Tests now rely on local fixtures and explicit setup\ninstead of centralized conftest-provided state
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas needing extra attention:
- Removal of tests/conftest.py — ensure no tests implicitly depended on globally provided fixtures/markers.
- tests/unit/core/system/test_loader.py — many YAML parsing/edge-case scenarios and global metric mapping interactions.
- tests/unit/pipeline/evaluation/test_evaluator.py — dense evaluator logic with many mocked metrics and error paths.
- tests/unit/core/api/test_client.py — streaming/HTTP mocking and parsing of streaming responses.

Possibly related PRs

PR #84: Overlaps tests/conftest.py changes; directly related to test fixture refactor/removal.
PR #62: Related to ScriptExecutionManager and script-related tests introduced here.
PR #55: Related to metric handling and MetricManager behaviors exercised by new metric tests.

Suggested reviewers

VladimirKadlec
tisnik
asamal4

Poem

🐰
I nudged the conftest from its nest,
New tests sprung up — a busy quest,
API, scripts, metrics in a row,
I hop and sniff where failures go,
Carrots for green CI success! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding unit test coverage for the evaluation framework. It is specific, clear, and directly reflects the changeset's primary objective.
Docstring Coverage	✅ Passed	Docstring coverage is 99.07% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

tests/script/test_compare_evaluations.py (1)
181-183: Consider cleaning up sys.path modification after the fixture.

The fixture modifies sys.path without cleanup. While pytest's test isolation typically prevents issues, explicitly removing the added path after each test would improve test hygiene and prevent potential sys.path pollution.

Consider refactoring to use a yield fixture with cleanup:
     @pytest.fixture
     def comparison_instance(self):
         """Create an EvaluationComparison instance for testing."""
-        # Import here to avoid module loading issues
         import sys
 
-        # Add project root to path (tests/script/ -> tests/ -> project_root/)
-        sys.path.append(str(Path(__file__).parent.parent.parent))
+        project_root = str(Path(__file__).parent.parent.parent)
+        sys.path.insert(0, project_root)
+        
         from script.compare_evaluations import EvaluationComparison
-
-        return EvaluationComparison(alpha=0.05)
+        
+        yield EvaluationComparison(alpha=0.05)
+        
+        # Cleanup
+        if project_root in sys.path:
+            sys.path.remove(project_root)
tests/unit/core/script/test_manager.py (1)
154-172: Move the os import to the top of the file.

The os module is imported inline at line 164, but it's better to import it at the module level alongside other imports for consistency and clarity.

Apply this diff:
 """Unit tests for core script manager module."""

 import tempfile
+import os
 from pathlib import Path

 import pytest
And remove the inline import:
             # Use relative path
-            import os
             original_cwd = os.getcwd()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64f8892 and 156a541.

📒 Files selected for processing (16)

tests/conftest.py (0 hunks)
tests/script/test_compare_evaluations.py (2 hunks)
tests/test_evaluation.py (0 hunks)
tests/unit/core/api/__init__.py (1 hunks)
tests/unit/core/api/test_client.py (1 hunks)
tests/unit/core/metrics/test_manager.py (1 hunks)
tests/unit/core/script/__init__.py (1 hunks)
tests/unit/core/script/test_manager.py (1 hunks)
tests/unit/core/system/__init__.py (1 hunks)
tests/unit/core/system/test_loader.py (1 hunks)
tests/unit/core/system/test_validator.py (1 hunks)
tests/unit/pipeline/__init__.py (1 hunks)
tests/unit/pipeline/evaluation/__init__.py (1 hunks)
tests/unit/pipeline/evaluation/test_amender.py (1 hunks)
tests/unit/pipeline/evaluation/test_errors.py (1 hunks)
tests/unit/pipeline/evaluation/test_evaluator.py (1 hunks)

💤 Files with no reviewable changes (2)

tests/conftest.py
tests/test_evaluation.py

🧰 Additional context used

📓 Path-based instructions (3)

tests/unit/**

📄 CodeRabbit inference engine (AGENTS.md)

Place unit tests under tests/unit/ mirroring the source structure

Files:

tests/unit/core/script/__init__.py
tests/unit/core/script/test_manager.py
tests/unit/pipeline/evaluation/test_evaluator.py
tests/unit/pipeline/evaluation/__init__.py
tests/unit/pipeline/__init__.py
tests/unit/core/system/__init__.py
tests/unit/core/api/test_client.py
tests/unit/core/api/__init__.py
tests/unit/pipeline/evaluation/test_errors.py
tests/unit/core/system/test_validator.py
tests/unit/pipeline/evaluation/test_amender.py
tests/unit/core/system/test_loader.py
tests/unit/core/metrics/test_manager.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

tests/unit/core/script/__init__.py
tests/script/test_compare_evaluations.py
tests/unit/core/script/test_manager.py
tests/unit/pipeline/evaluation/test_evaluator.py
tests/unit/pipeline/evaluation/__init__.py
tests/unit/pipeline/__init__.py
tests/unit/core/system/__init__.py
tests/unit/core/api/test_client.py
tests/unit/core/api/__init__.py
tests/unit/pipeline/evaluation/test_errors.py
tests/unit/core/system/test_validator.py
tests/unit/pipeline/evaluation/test_amender.py
tests/unit/core/system/test_loader.py
tests/unit/core/metrics/test_manager.py

tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

tests/script/test_compare_evaluations.py
tests/unit/core/script/test_manager.py
tests/unit/pipeline/evaluation/test_evaluator.py
tests/unit/core/api/test_client.py
tests/unit/pipeline/evaluation/test_errors.py
tests/unit/core/system/test_validator.py
tests/unit/pipeline/evaluation/test_amender.py
tests/unit/core/system/test_loader.py
tests/unit/core/metrics/test_manager.py

🧠 Learnings (7)

📓 Common learnings

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

📚 Learning: 2025-07-16T12:07:29.169Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

tests/script/test_compare_evaluations.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

Applied to files:

tests/unit/pipeline/evaluation/test_evaluator.py
tests/unit/core/api/test_client.py
tests/unit/core/system/test_validator.py
tests/unit/core/system/test_loader.py
tests/unit/core/metrics/test_manager.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

tests/unit/pipeline/evaluation/test_evaluator.py
tests/unit/core/system/test_validator.py
tests/unit/core/system/test_loader.py
tests/unit/core/metrics/test_manager.py

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

tests/unit/pipeline/evaluation/test_evaluator.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

tests/unit/core/metrics/test_manager.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

tests/unit/core/metrics/test_manager.py

🧬 Code graph analysis (8)

tests/unit/core/script/test_manager.py (1)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-117)

tests/unit/pipeline/evaluation/test_evaluator.py (5)

src/lightspeed_evaluation/core/models/data.py (5)

EvaluationData (264-311)

EvaluationRequest (370-423)

TurnData (35-261)

for_turn (398-412)

for_conversation (415-423)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (267-297)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (70-125)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)

MetricsEvaluator (25-167)

evaluate_metric (60-140)

_determine_status (159-163)

src/lightspeed_evaluation/core/metrics/manager.py (3)

MetricManager (17-136)

MetricLevel (10-14)

get_effective_threshold (50-80)

tests/unit/core/api/test_client.py (4)

src/lightspeed_evaluation/core/models/system.py (1)

APIConfig (117-156)

src/lightspeed_evaluation/core/models/api.py (2)

APIRequest (30-77)

APIResponse (80-116)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

src/lightspeed_evaluation/core/api/client.py (3)

query (71-105)

_handle_response_errors (198-206)

_extract_error_message (208-229)

tests/unit/pipeline/evaluation/test_errors.py (2)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

TurnData (35-261)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (3)

EvaluationErrorHandler (10-85)

mark_all_metrics_as_error (17-77)

get_error_summary (79-85)

tests/unit/core/system/test_validator.py (3)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

TurnData (35-261)

src/lightspeed_evaluation/core/system/exceptions.py (1)

DataValidationError (16-17)

src/lightspeed_evaluation/core/system/validator.py (4)

DataValidator (75-311)

format_pydantic_error (65-72)

validate_evaluation_data (139-157)

load_evaluation_data (85-137)

tests/unit/pipeline/evaluation/test_amender.py (5)

src/lightspeed_evaluation/core/models/api.py (1)

APIResponse (80-116)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

TurnData (35-261)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (3)

APIDataAmender (13-78)

amend_conversation_data (20-67)

get_amendment_summary (69-78)

src/lightspeed_evaluation/core/api/client.py (1)

query (71-105)

tests/unit/core/system/test_loader.py (3)

src/lightspeed_evaluation/core/system/exceptions.py (1)

DataValidationError (16-17)

src/lightspeed_evaluation/core/system/loader.py (3)

ConfigLoader (70-125)

populate_metric_mappings (29-48)

load_system_config (79-108)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (267-297)

tests/unit/core/metrics/test_manager.py (3)

src/lightspeed_evaluation/core/metrics/manager.py (5)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

get_effective_threshold (50-80)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

TurnData (35-261)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (267-297)

🪛 GitHub Actions: Ruff

tests/unit/pipeline/evaluation/test_evaluator.py

[error] 37-37: F401: 'lightspeed_evaluation.core.metrics.manager.MetricLevel' imported but unused. Remove unused import or run 'ruff --fix' to fix.

tests/unit/core/api/test_client.py

[error] 5-5: F401: 'lightspeed_evaluation.core.models.APIRequest' imported but unused. Remove unused import or run 'ruff --fix' to fix.

tests/unit/pipeline/evaluation/test_errors.py

[error] 3-3: F401: 'pytest' imported but unused. Remove unused import or run 'ruff --fix' to fix.

tests/unit/pipeline/evaluation/test_amender.py

[error] 3-3: F401: 'pytest' imported but unused. Remove unused import or run 'ruff --fix' to fix.

tests/unit/core/system/test_loader.py

[error] 8-8: F401: 'lightspeed_evaluation.core.system.exceptions.DataValidationError' imported but unused. Remove unused import or run 'ruff --fix' to fix.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.13)
GitHub Check: tests (3.12)
GitHub Check: tests (3.11)
GitHub Check: mypy

🔇 Additional comments (12)

tests/script/test_compare_evaluations.py (1)

16-17: LGTM!

The path calculation correctly navigates from tests/script/ to the project root and then locates the target script. The comment clearly documents the path relationship.

tests/unit/core/system/test_validator.py (1)

1-422: LGTM! Comprehensive test coverage for the validator module.

The test suite provides excellent coverage:

Validation of Pydantic errors

Valid and invalid data scenarios

Metric availability checks

API-enabled behavior

File loading edge cases

Error accumulation

All tests follow pytest conventions and use the mocker fixture correctly.

tests/unit/core/script/test_manager.py (1)

1-195: Excellent test coverage for script execution scenarios.

The tests comprehensively cover success, failure, error handling, and edge cases including timeout behavior.

tests/unit/core/metrics/test_manager.py (1)

1-329: LGTM! Comprehensive test coverage for MetricManager.

The test suite thoroughly validates:

Metric resolution with defaults

Threshold hierarchy (level-specific vs system defaults)

Metric counting across various scenarios

Edge cases with empty metadata

All tests follow pytest best practices.

tests/unit/core/api/test_client.py (1)

10-281: Excellent test coverage for APIClient.

The test suite comprehensively covers:

Configuration validation

Standard and streaming query paths

Error handling (HTTP, timeout, validation)

Tool call formatting

Internal error extraction

All tests use pytest-mock correctly.

tests/unit/core/system/test_loader.py (1)

18-372: LGTM! Comprehensive test coverage for system loader.

The test suite thoroughly validates:

Metric mapping population and clearing

File loading error handling

YAML validation

Configuration sections and defaults

Metrics metadata handling

All tests follow pytest best practices with proper cleanup.

tests/unit/pipeline/__init__.py (1)

1-2: LGTM!

This package initializer correctly establishes the namespace for pipeline unit tests with an appropriate docstring.

tests/unit/pipeline/evaluation/__init__.py (1)

1-2: LGTM!

This package initializer correctly establishes the namespace for pipeline evaluation unit tests with an appropriate docstring.

tests/unit/pipeline/evaluation/test_errors.py (1)

9-197: Comprehensive test coverage for EvaluationErrorHandler.

The test suite thoroughly covers error handling scenarios including turn-level metrics, conversation-level metrics, mixed scenarios, empty metrics, error summaries, and accumulation across multiple conversations. All tests follow pytest conventions and properly construct test data.

tests/unit/pipeline/evaluation/test_amender.py (1)

10-309: Excellent test coverage for APIDataAmender.

The test suite comprehensively covers the amendment flow including scenarios with/without API clients, single and multi-turn conversations, tool calls, attachments, error handling at different stages, empty contexts, and amendment summaries. The tests properly use pytest-mock's mocker fixture and validate both success and failure paths.

tests/unit/pipeline/evaluation/test_evaluator.py (2)

15-60: Well-structured test fixtures.

The fixtures properly mock ConfigLoader, MetricManager, and ScriptExecutionManager with appropriate configurations and behaviors for testing the evaluator.

62-420: Comprehensive test coverage for MetricsEvaluator.

The test suite thoroughly exercises the evaluator's functionality including initialization validation, turn-level and conversation-level metric evaluation, pass/fail scenarios, unsupported framework handling, None score handling, exception handling, script metric gating when API is disabled, and status determination logic with various threshold values. All tests properly use mocking to isolate the component under test.

tests/unit/core/api/test_client.py

tests/unit/core/system/test_loader.py

tests/unit/pipeline/evaluation/test_amender.py

tests/unit/pipeline/evaluation/test_errors.py

tests/unit/pipeline/evaluation/test_evaluator.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

tests/unit/pipeline/evaluation/test_amender.py (1)
3-3: Remove unused import to fix pipeline failure.

The pytest import is unused and causes a pipeline failure. Since this was already flagged in a previous review, please remove it.

Apply this diff:
-import pytest
-
 from lightspeed_evaluation.core.models import APIResponse, EvaluationData, TurnData
tests/unit/core/system/test_loader.py (1)
8-8: Remove unused import to fix pipeline failure.

The DataValidationError import is unused and causes a pipeline failure. Since this was already flagged in a previous review, please remove it.

Apply this diff:
 import pytest
 
-from lightspeed_evaluation.core.system.exceptions import DataValidationError
 from lightspeed_evaluation.core.system.loader import (
     ConfigLoader,
     populate_metric_mappings,

🧹 Nitpick comments (2)

tests/unit/core/script/test_manager.py (2)
152-152: Move import to module level.

The os module should be imported at the top of the file rather than inside the function for better readability and convention compliance.

Apply this diff to move the import to the top of the file with other imports:
 import tempfile
 from pathlib import Path
+import os
 
 import pytest
And remove the import from inside the function:
             # Use relative path
-            import os
-
             original_cwd = os.getcwd()
15-181: Consider using pytest fixtures for test script management.

Multiple tests repeat the pattern of creating temporary scripts, making them executable, and cleaning them up. Pytest fixtures could reduce this duplication and improve test maintainability.

Example fixture:
@pytest.fixture
def executable_script():
    """Create a temporary executable script."""
    def _make_script(content: str, executable: bool = True):
        with tempfile.NamedTemporaryFile(mode="w", suffix=".sh", delete=False) as f:
            f.write(content)
            script_path = Path(f.name)
        if executable:
            script_path.chmod(0o755)
        else:
            script_path.chmod(0o644)
        yield script_path
        script_path.unlink()
    return _make_script
This would simplify tests like test_run_script_success to:
def test_run_script_success(self, executable_script):
    script_path = executable_script("#!/bin/bash\nexit 0\n")
    manager = ScriptExecutionManager()
    assert manager.run_script(script_path) is True

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 156a541 and 47de9b4.

📒 Files selected for processing (16)

tests/conftest.py (0 hunks)
tests/script/test_compare_evaluations.py (2 hunks)
tests/test_evaluation.py (0 hunks)
tests/unit/core/api/__init__.py (1 hunks)
tests/unit/core/api/test_client.py (1 hunks)
tests/unit/core/metrics/test_manager.py (1 hunks)
tests/unit/core/script/__init__.py (1 hunks)
tests/unit/core/script/test_manager.py (1 hunks)
tests/unit/core/system/__init__.py (1 hunks)
tests/unit/core/system/test_loader.py (1 hunks)
tests/unit/core/system/test_validator.py (1 hunks)
tests/unit/pipeline/__init__.py (1 hunks)
tests/unit/pipeline/evaluation/__init__.py (1 hunks)
tests/unit/pipeline/evaluation/test_amender.py (1 hunks)
tests/unit/pipeline/evaluation/test_errors.py (1 hunks)
tests/unit/pipeline/evaluation/test_evaluator.py (1 hunks)

💤 Files with no reviewable changes (2)

tests/test_evaluation.py
tests/conftest.py

✅ Files skipped from review due to trivial changes (1)

tests/unit/pipeline/init.py

🚧 Files skipped from review as they are similar to previous changes (6)

tests/unit/core/system/test_validator.py
tests/script/test_compare_evaluations.py
tests/unit/core/api/test_client.py
tests/unit/pipeline/evaluation/test_errors.py
tests/unit/pipeline/evaluation/init.py
tests/unit/pipeline/evaluation/test_evaluator.py

🧰 Additional context used

📓 Path-based instructions (3)

tests/unit/**

📄 CodeRabbit inference engine (AGENTS.md)

Place unit tests under tests/unit/ mirroring the source structure

Files:

tests/unit/pipeline/evaluation/test_amender.py
tests/unit/core/metrics/test_manager.py
tests/unit/core/script/test_manager.py
tests/unit/core/script/__init__.py
tests/unit/core/system/test_loader.py
tests/unit/core/system/__init__.py
tests/unit/core/api/__init__.py

tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

tests/unit/pipeline/evaluation/test_amender.py
tests/unit/core/metrics/test_manager.py
tests/unit/core/script/test_manager.py
tests/unit/core/system/test_loader.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

tests/unit/pipeline/evaluation/test_amender.py
tests/unit/core/metrics/test_manager.py
tests/unit/core/script/test_manager.py
tests/unit/core/script/__init__.py
tests/unit/core/system/test_loader.py
tests/unit/core/system/__init__.py
tests/unit/core/api/__init__.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

Applied to files:

tests/unit/core/metrics/test_manager.py
tests/unit/core/system/test_loader.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

tests/unit/core/metrics/test_manager.py
tests/unit/core/system/test_loader.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

tests/unit/core/metrics/test_manager.py

🧬 Code graph analysis (4)

tests/unit/pipeline/evaluation/test_amender.py (5)

src/lightspeed_evaluation/core/models/api.py (1)

APIResponse (80-116)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

TurnData (35-261)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (3)

APIDataAmender (13-78)

amend_conversation_data (20-67)

get_amendment_summary (69-78)

src/lightspeed_evaluation/core/api/client.py (1)

query (71-105)

tests/unit/core/metrics/test_manager.py (3)

src/lightspeed_evaluation/core/metrics/manager.py (5)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

get_effective_threshold (50-80)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

TurnData (35-261)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (267-297)

tests/unit/core/script/test_manager.py (1)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-117)

tests/unit/core/system/test_loader.py (2)

src/lightspeed_evaluation/core/system/loader.py (3)

ConfigLoader (70-125)

populate_metric_mappings (29-48)

load_system_config (79-108)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (267-297)

🪛 GitHub Actions: Ruff

tests/unit/pipeline/evaluation/test_amender.py

[error] 3-3: Ruff: F401 imported but unused: pytest

tests/unit/core/system/test_loader.py

[error] 8-8: Ruff: F401 imported but unused: lightspeed_evaluation.core.system.exceptions.DataValidationError

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: mypy
GitHub Check: tests (3.12)
GitHub Check: tests (3.13)

🔇 Additional comments (6)

tests/unit/core/api/__init__.py (1)

1-1: LGTM!

The package initializer with a clear docstring follows Python best practices for test organization.

tests/unit/core/metrics/test_manager.py (1)

1-318: Excellent comprehensive test coverage!

This test suite thoroughly validates MetricManager behavior across all key scenarios:

Metric resolution with defaults, explicit lists, and empty lists

Threshold hierarchy with system defaults and level-specific overrides

Metric counting across multiple edge cases

Proper handling of missing metadata and unknown metrics

The tests are well-structured, follow pytest conventions, and provide confidence in the MetricManager implementation.

tests/unit/pipeline/evaluation/test_amender.py (1)

10-293: Comprehensive test coverage for APIDataAmender!

Excellent test suite covering:

Amendment with/without API client

Single and multi-turn conversations with conversation_id tracking

Tool calls and attachments handling

Error scenarios at different turn positions

Empty contexts handling

Amendment summaries with various data states

The tests properly use pytest-mock and validate all key behaviors of the amender.

tests/unit/core/system/__init__.py (1)

1-1: LGTM!

Standard package initializer for the core system unit tests.

tests/unit/core/system/test_loader.py (1)

18-373: Excellent comprehensive test coverage!

Outstanding test suite covering:

populate_metric_mappings: turn-level, conversation-level, and clearing behavior

ConfigLoader: file errors, YAML errors, empty files, invalid formats

Valid configurations: minimal, complete, with defaults, partial metadata

Proper verification of metric population into global sets

Appropriate use of tempfile for isolated testing

The tests thoroughly validate the loader's robustness and configuration handling.

tests/unit/core/script/__init__.py (1)

1-1: LGTM!

Standard package initializer with appropriate docstring.

tests/unit/core/script/test_manager.py

asamal4

Thank You !! Now we have a baseline.
LGTM

tisnik

LGTM

coderabbitai bot reviewed Nov 8, 2025

View reviewed changes

bsatapat-jpg force-pushed the dev branch from 156a541 to 47de9b4 Compare November 8, 2025 15:08

coderabbitai bot reviewed Nov 8, 2025

View reviewed changes

tests/unit/core/script/test_manager.py Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch 2 times, most recently from 0f7791b to 5cf16c0 Compare November 8, 2025 15:18

LCORE-748: Addded unit test cases coverage for the evaluation framework

b3f2e77

bsatapat-jpg force-pushed the dev branch from 5cf16c0 to b3f2e77 Compare November 8, 2025 15:20

asamal4 approved these changes Nov 11, 2025

View reviewed changes

tisnik approved these changes Nov 11, 2025

View reviewed changes

tisnik merged commit 55ae93c into lightspeed-core:main Nov 11, 2025
15 checks passed

bsatapat-jpg deleted the dev branch November 11, 2025 09:07

This was referenced Nov 13, 2025

GEval Integration #97

Merged

LEADS-8: Lazy imports for eval tool #102

Closed

This was referenced Nov 24, 2025

add support for fail_on_invalid_data option #94

Merged

LEADS-8: Lazy imports for eval tool #106

Merged

bsatapat-jpg mentioned this pull request Nov 26, 2025

LEADS-26: Increased Unit test cases coverage #109

Merged

15 tasks

coderabbitai bot mentioned this pull request Dec 2, 2025

[LEADS-40]: Get statistics about the token usage for lightspeed-evaluation #112

Merged

15 tasks

This was referenced Dec 15, 2025

chore: refactor processor & errors.py #119

Merged

Skip rest of the eval for an metric failure within a conversation group #121

Merged

This was referenced Jan 8, 2026

Add optional property tag to group eval conversations #134

Merged

[LEADS-172] Support classical evaluation metrics #130

Merged

Leads 6 set expected responses #138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCORE-748: Addded unit test cases coverage for the evaluation framework#95

LCORE-748: Addded unit test cases coverage for the evaluation framework#95
tisnik merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev

bsatapat-jpg commented Nov 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 8, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

asamal4 left a comment

Uh oh!

tisnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bsatapat-jpg commented Nov 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bsatapat-jpg commented Nov 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 8, 2025 •

edited

Loading