LCORE-748: Addded unit test cases coverage for the evaluation framework#95
LCORE-748: Addded unit test cases coverage for the evaluation framework#95tisnik merged 1 commit intolightspeed-core:mainfrom
Conversation
|
Warning Rate limit exceeded@bsatapat-jpg has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 34 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (16)
WalkthroughRemoved the centralized pytest fixture module and a trivial test, adjusted a script test import path, and added many unit test modules and package initializers covering core API, metrics, scripting, system config/validation, and pipeline evaluation components. Changes
Sequence Diagram(s)sequenceDiagram
participant Pytest as Pytest (collector)
participant Tests as Test files
participant Conftest as removed `tests/conftest.py`
participant LocalFixtures as local per-test fixtures
note right of Pytest: Previous flow (with conftest)
Pytest->>Conftest: load session fixtures / hooks
Conftest-->>Pytest: provide fixtures, env setup, markers
note right of Pytest: New flow (conftest removed)
Pytest->>Tests: collect test modules
Tests->>LocalFixtures: use in-file or per-package fixtures / mocks
LocalFixtures-->>Tests: provide isolated setup/teardown
rect rgba(0,128,128,0.08)
note over Tests,LocalFixtures: Tests now rely on local fixtures and explicit setup\ninstead of centralized conftest-provided state
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
tests/script/test_compare_evaluations.py (1)
181-183: Consider cleaning upsys.pathmodification after the fixture.The fixture modifies
sys.pathwithout cleanup. While pytest's test isolation typically prevents issues, explicitly removing the added path after each test would improve test hygiene and prevent potentialsys.pathpollution.Consider refactoring to use a yield fixture with cleanup:
@pytest.fixture def comparison_instance(self): """Create an EvaluationComparison instance for testing.""" - # Import here to avoid module loading issues import sys - # Add project root to path (tests/script/ -> tests/ -> project_root/) - sys.path.append(str(Path(__file__).parent.parent.parent)) + project_root = str(Path(__file__).parent.parent.parent) + sys.path.insert(0, project_root) + from script.compare_evaluations import EvaluationComparison - - return EvaluationComparison(alpha=0.05) + + yield EvaluationComparison(alpha=0.05) + + # Cleanup + if project_root in sys.path: + sys.path.remove(project_root)tests/unit/core/script/test_manager.py (1)
154-172: Move theosimport to the top of the file.The
osmodule is imported inline at line 164, but it's better to import it at the module level alongside other imports for consistency and clarity.Apply this diff:
"""Unit tests for core script manager module.""" import tempfile +import os from pathlib import Path import pytestAnd remove the inline import:
# Use relative path - import os original_cwd = os.getcwd()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
tests/conftest.py(0 hunks)tests/script/test_compare_evaluations.py(2 hunks)tests/test_evaluation.py(0 hunks)tests/unit/core/api/__init__.py(1 hunks)tests/unit/core/api/test_client.py(1 hunks)tests/unit/core/metrics/test_manager.py(1 hunks)tests/unit/core/script/__init__.py(1 hunks)tests/unit/core/script/test_manager.py(1 hunks)tests/unit/core/system/__init__.py(1 hunks)tests/unit/core/system/test_loader.py(1 hunks)tests/unit/core/system/test_validator.py(1 hunks)tests/unit/pipeline/__init__.py(1 hunks)tests/unit/pipeline/evaluation/__init__.py(1 hunks)tests/unit/pipeline/evaluation/test_amender.py(1 hunks)tests/unit/pipeline/evaluation/test_errors.py(1 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(1 hunks)
💤 Files with no reviewable changes (2)
- tests/conftest.py
- tests/test_evaluation.py
🧰 Additional context used
📓 Path-based instructions (3)
tests/unit/**
📄 CodeRabbit inference engine (AGENTS.md)
Place unit tests under tests/unit/ mirroring the source structure
Files:
tests/unit/core/script/__init__.pytests/unit/core/script/test_manager.pytests/unit/pipeline/evaluation/test_evaluator.pytests/unit/pipeline/evaluation/__init__.pytests/unit/pipeline/__init__.pytests/unit/core/system/__init__.pytests/unit/core/api/test_client.pytests/unit/core/api/__init__.pytests/unit/pipeline/evaluation/test_errors.pytests/unit/core/system/test_validator.pytests/unit/pipeline/evaluation/test_amender.pytests/unit/core/system/test_loader.pytests/unit/core/metrics/test_manager.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest
Files:
tests/unit/core/script/__init__.pytests/script/test_compare_evaluations.pytests/unit/core/script/test_manager.pytests/unit/pipeline/evaluation/test_evaluator.pytests/unit/pipeline/evaluation/__init__.pytests/unit/pipeline/__init__.pytests/unit/core/system/__init__.pytests/unit/core/api/test_client.pytests/unit/core/api/__init__.pytests/unit/pipeline/evaluation/test_errors.pytests/unit/core/system/test_validator.pytests/unit/pipeline/evaluation/test_amender.pytests/unit/core/system/test_loader.pytests/unit/core/metrics/test_manager.py
tests/**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Name test files as test_*.py
Files:
tests/script/test_compare_evaluations.pytests/unit/core/script/test_manager.pytests/unit/pipeline/evaluation/test_evaluator.pytests/unit/core/api/test_client.pytests/unit/pipeline/evaluation/test_errors.pytests/unit/core/system/test_validator.pytests/unit/pipeline/evaluation/test_amender.pytests/unit/core/system/test_loader.pytests/unit/core/metrics/test_manager.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
tests/script/test_compare_evaluations.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.pytests/unit/core/api/test_client.pytests/unit/core/system/test_validator.pytests/unit/core/system/test_loader.pytests/unit/core/metrics/test_manager.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.pytests/unit/core/system/test_validator.pytests/unit/core/system/test_loader.pytests/unit/core/metrics/test_manager.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Applied to files:
tests/unit/core/metrics/test_manager.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
tests/unit/core/metrics/test_manager.py
🧬 Code graph analysis (8)
tests/unit/core/script/test_manager.py (1)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-117)
tests/unit/pipeline/evaluation/test_evaluator.py (5)
src/lightspeed_evaluation/core/models/data.py (5)
EvaluationData(264-311)EvaluationRequest(370-423)TurnData(35-261)for_turn(398-412)for_conversation(415-423)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(267-297)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(70-125)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
MetricsEvaluator(25-167)evaluate_metric(60-140)_determine_status(159-163)src/lightspeed_evaluation/core/metrics/manager.py (3)
MetricManager(17-136)MetricLevel(10-14)get_effective_threshold(50-80)
tests/unit/core/api/test_client.py (4)
src/lightspeed_evaluation/core/models/system.py (1)
APIConfig(117-156)src/lightspeed_evaluation/core/models/api.py (2)
APIRequest(30-77)APIResponse(80-116)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)src/lightspeed_evaluation/core/api/client.py (3)
query(71-105)_handle_response_errors(198-206)_extract_error_message(208-229)
tests/unit/pipeline/evaluation/test_errors.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)TurnData(35-261)src/lightspeed_evaluation/pipeline/evaluation/errors.py (3)
EvaluationErrorHandler(10-85)mark_all_metrics_as_error(17-77)get_error_summary(79-85)
tests/unit/core/system/test_validator.py (3)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)TurnData(35-261)src/lightspeed_evaluation/core/system/exceptions.py (1)
DataValidationError(16-17)src/lightspeed_evaluation/core/system/validator.py (4)
DataValidator(75-311)format_pydantic_error(65-72)validate_evaluation_data(139-157)load_evaluation_data(85-137)
tests/unit/pipeline/evaluation/test_amender.py (5)
src/lightspeed_evaluation/core/models/api.py (1)
APIResponse(80-116)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)TurnData(35-261)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)src/lightspeed_evaluation/pipeline/evaluation/amender.py (3)
APIDataAmender(13-78)amend_conversation_data(20-67)get_amendment_summary(69-78)src/lightspeed_evaluation/core/api/client.py (1)
query(71-105)
tests/unit/core/system/test_loader.py (3)
src/lightspeed_evaluation/core/system/exceptions.py (1)
DataValidationError(16-17)src/lightspeed_evaluation/core/system/loader.py (3)
ConfigLoader(70-125)populate_metric_mappings(29-48)load_system_config(79-108)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(267-297)
tests/unit/core/metrics/test_manager.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (5)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)get_effective_threshold(50-80)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)TurnData(35-261)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(267-297)
🪛 GitHub Actions: Ruff
tests/unit/pipeline/evaluation/test_evaluator.py
[error] 37-37: F401: 'lightspeed_evaluation.core.metrics.manager.MetricLevel' imported but unused. Remove unused import or run 'ruff --fix' to fix.
tests/unit/core/api/test_client.py
[error] 5-5: F401: 'lightspeed_evaluation.core.models.APIRequest' imported but unused. Remove unused import or run 'ruff --fix' to fix.
tests/unit/pipeline/evaluation/test_errors.py
[error] 3-3: F401: 'pytest' imported but unused. Remove unused import or run 'ruff --fix' to fix.
tests/unit/pipeline/evaluation/test_amender.py
[error] 3-3: F401: 'pytest' imported but unused. Remove unused import or run 'ruff --fix' to fix.
tests/unit/core/system/test_loader.py
[error] 8-8: F401: 'lightspeed_evaluation.core.system.exceptions.DataValidationError' imported but unused. Remove unused import or run 'ruff --fix' to fix.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.11)
- GitHub Check: mypy
🔇 Additional comments (12)
tests/script/test_compare_evaluations.py (1)
16-17: LGTM!The path calculation correctly navigates from
tests/script/to the project root and then locates the target script. The comment clearly documents the path relationship.tests/unit/core/system/test_validator.py (1)
1-422: LGTM! Comprehensive test coverage for the validator module.The test suite provides excellent coverage:
- Validation of Pydantic errors
- Valid and invalid data scenarios
- Metric availability checks
- API-enabled behavior
- File loading edge cases
- Error accumulation
All tests follow pytest conventions and use the mocker fixture correctly.
tests/unit/core/script/test_manager.py (1)
1-195: Excellent test coverage for script execution scenarios.The tests comprehensively cover success, failure, error handling, and edge cases including timeout behavior.
tests/unit/core/metrics/test_manager.py (1)
1-329: LGTM! Comprehensive test coverage for MetricManager.The test suite thoroughly validates:
- Metric resolution with defaults
- Threshold hierarchy (level-specific vs system defaults)
- Metric counting across various scenarios
- Edge cases with empty metadata
All tests follow pytest best practices.
tests/unit/core/api/test_client.py (1)
10-281: Excellent test coverage for APIClient.The test suite comprehensively covers:
- Configuration validation
- Standard and streaming query paths
- Error handling (HTTP, timeout, validation)
- Tool call formatting
- Internal error extraction
All tests use pytest-mock correctly.
tests/unit/core/system/test_loader.py (1)
18-372: LGTM! Comprehensive test coverage for system loader.The test suite thoroughly validates:
- Metric mapping population and clearing
- File loading error handling
- YAML validation
- Configuration sections and defaults
- Metrics metadata handling
All tests follow pytest best practices with proper cleanup.
tests/unit/pipeline/__init__.py (1)
1-2: LGTM!This package initializer correctly establishes the namespace for pipeline unit tests with an appropriate docstring.
tests/unit/pipeline/evaluation/__init__.py (1)
1-2: LGTM!This package initializer correctly establishes the namespace for pipeline evaluation unit tests with an appropriate docstring.
tests/unit/pipeline/evaluation/test_errors.py (1)
9-197: Comprehensive test coverage for EvaluationErrorHandler.The test suite thoroughly covers error handling scenarios including turn-level metrics, conversation-level metrics, mixed scenarios, empty metrics, error summaries, and accumulation across multiple conversations. All tests follow pytest conventions and properly construct test data.
tests/unit/pipeline/evaluation/test_amender.py (1)
10-309: Excellent test coverage for APIDataAmender.The test suite comprehensively covers the amendment flow including scenarios with/without API clients, single and multi-turn conversations, tool calls, attachments, error handling at different stages, empty contexts, and amendment summaries. The tests properly use pytest-mock's
mockerfixture and validate both success and failure paths.tests/unit/pipeline/evaluation/test_evaluator.py (2)
15-60: Well-structured test fixtures.The fixtures properly mock ConfigLoader, MetricManager, and ScriptExecutionManager with appropriate configurations and behaviors for testing the evaluator.
62-420: Comprehensive test coverage for MetricsEvaluator.The test suite thoroughly exercises the evaluator's functionality including initialization validation, turn-level and conversation-level metric evaluation, pass/fail scenarios, unsupported framework handling, None score handling, exception handling, script metric gating when API is disabled, and status determination logic with various threshold values. All tests properly use mocking to isolate the component under test.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/unit/pipeline/evaluation/test_amender.py (1)
3-3: Remove unused import to fix pipeline failure.The
pytestimport is unused and causes a pipeline failure. Since this was already flagged in a previous review, please remove it.Apply this diff:
-import pytest - from lightspeed_evaluation.core.models import APIResponse, EvaluationData, TurnDatatests/unit/core/system/test_loader.py (1)
8-8: Remove unused import to fix pipeline failure.The
DataValidationErrorimport is unused and causes a pipeline failure. Since this was already flagged in a previous review, please remove it.Apply this diff:
import pytest -from lightspeed_evaluation.core.system.exceptions import DataValidationError from lightspeed_evaluation.core.system.loader import ( ConfigLoader, populate_metric_mappings,
🧹 Nitpick comments (2)
tests/unit/core/script/test_manager.py (2)
152-152: Move import to module level.The
osmodule should be imported at the top of the file rather than inside the function for better readability and convention compliance.Apply this diff to move the import to the top of the file with other imports:
import tempfile from pathlib import Path +import os import pytestAnd remove the import from inside the function:
# Use relative path - import os - original_cwd = os.getcwd()
15-181: Consider using pytest fixtures for test script management.Multiple tests repeat the pattern of creating temporary scripts, making them executable, and cleaning them up. Pytest fixtures could reduce this duplication and improve test maintainability.
Example fixture:
@pytest.fixture def executable_script(): """Create a temporary executable script.""" def _make_script(content: str, executable: bool = True): with tempfile.NamedTemporaryFile(mode="w", suffix=".sh", delete=False) as f: f.write(content) script_path = Path(f.name) if executable: script_path.chmod(0o755) else: script_path.chmod(0o644) yield script_path script_path.unlink() return _make_scriptThis would simplify tests like
test_run_script_successto:def test_run_script_success(self, executable_script): script_path = executable_script("#!/bin/bash\nexit 0\n") manager = ScriptExecutionManager() assert manager.run_script(script_path) is True
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
tests/conftest.py(0 hunks)tests/script/test_compare_evaluations.py(2 hunks)tests/test_evaluation.py(0 hunks)tests/unit/core/api/__init__.py(1 hunks)tests/unit/core/api/test_client.py(1 hunks)tests/unit/core/metrics/test_manager.py(1 hunks)tests/unit/core/script/__init__.py(1 hunks)tests/unit/core/script/test_manager.py(1 hunks)tests/unit/core/system/__init__.py(1 hunks)tests/unit/core/system/test_loader.py(1 hunks)tests/unit/core/system/test_validator.py(1 hunks)tests/unit/pipeline/__init__.py(1 hunks)tests/unit/pipeline/evaluation/__init__.py(1 hunks)tests/unit/pipeline/evaluation/test_amender.py(1 hunks)tests/unit/pipeline/evaluation/test_errors.py(1 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(1 hunks)
💤 Files with no reviewable changes (2)
- tests/test_evaluation.py
- tests/conftest.py
✅ Files skipped from review due to trivial changes (1)
- tests/unit/pipeline/init.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tests/unit/core/system/test_validator.py
- tests/script/test_compare_evaluations.py
- tests/unit/core/api/test_client.py
- tests/unit/pipeline/evaluation/test_errors.py
- tests/unit/pipeline/evaluation/init.py
- tests/unit/pipeline/evaluation/test_evaluator.py
🧰 Additional context used
📓 Path-based instructions (3)
tests/unit/**
📄 CodeRabbit inference engine (AGENTS.md)
Place unit tests under tests/unit/ mirroring the source structure
Files:
tests/unit/pipeline/evaluation/test_amender.pytests/unit/core/metrics/test_manager.pytests/unit/core/script/test_manager.pytests/unit/core/script/__init__.pytests/unit/core/system/test_loader.pytests/unit/core/system/__init__.pytests/unit/core/api/__init__.py
tests/**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Name test files as test_*.py
Files:
tests/unit/pipeline/evaluation/test_amender.pytests/unit/core/metrics/test_manager.pytests/unit/core/script/test_manager.pytests/unit/core/system/test_loader.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest
Files:
tests/unit/pipeline/evaluation/test_amender.pytests/unit/core/metrics/test_manager.pytests/unit/core/script/test_manager.pytests/unit/core/script/__init__.pytests/unit/core/system/test_loader.pytests/unit/core/system/__init__.pytests/unit/core/api/__init__.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest
Applied to files:
tests/unit/core/metrics/test_manager.pytests/unit/core/system/test_loader.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/core/metrics/test_manager.pytests/unit/core/system/test_loader.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
tests/unit/core/metrics/test_manager.py
🧬 Code graph analysis (4)
tests/unit/pipeline/evaluation/test_amender.py (5)
src/lightspeed_evaluation/core/models/api.py (1)
APIResponse(80-116)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)TurnData(35-261)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)src/lightspeed_evaluation/pipeline/evaluation/amender.py (3)
APIDataAmender(13-78)amend_conversation_data(20-67)get_amendment_summary(69-78)src/lightspeed_evaluation/core/api/client.py (1)
query(71-105)
tests/unit/core/metrics/test_manager.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (5)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)get_effective_threshold(50-80)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)TurnData(35-261)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(267-297)
tests/unit/core/script/test_manager.py (1)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-117)
tests/unit/core/system/test_loader.py (2)
src/lightspeed_evaluation/core/system/loader.py (3)
ConfigLoader(70-125)populate_metric_mappings(29-48)load_system_config(79-108)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(267-297)
🪛 GitHub Actions: Ruff
tests/unit/pipeline/evaluation/test_amender.py
[error] 3-3: Ruff: F401 imported but unused: pytest
tests/unit/core/system/test_loader.py
[error] 8-8: Ruff: F401 imported but unused: lightspeed_evaluation.core.system.exceptions.DataValidationError
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: mypy
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
🔇 Additional comments (6)
tests/unit/core/api/__init__.py (1)
1-1: LGTM!The package initializer with a clear docstring follows Python best practices for test organization.
tests/unit/core/metrics/test_manager.py (1)
1-318: Excellent comprehensive test coverage!This test suite thoroughly validates MetricManager behavior across all key scenarios:
- Metric resolution with defaults, explicit lists, and empty lists
- Threshold hierarchy with system defaults and level-specific overrides
- Metric counting across multiple edge cases
- Proper handling of missing metadata and unknown metrics
The tests are well-structured, follow pytest conventions, and provide confidence in the MetricManager implementation.
tests/unit/pipeline/evaluation/test_amender.py (1)
10-293: Comprehensive test coverage for APIDataAmender!Excellent test suite covering:
- Amendment with/without API client
- Single and multi-turn conversations with conversation_id tracking
- Tool calls and attachments handling
- Error scenarios at different turn positions
- Empty contexts handling
- Amendment summaries with various data states
The tests properly use pytest-mock and validate all key behaviors of the amender.
tests/unit/core/system/__init__.py (1)
1-1: LGTM!Standard package initializer for the core system unit tests.
tests/unit/core/system/test_loader.py (1)
18-373: Excellent comprehensive test coverage!Outstanding test suite covering:
populate_metric_mappings: turn-level, conversation-level, and clearing behaviorConfigLoader: file errors, YAML errors, empty files, invalid formats- Valid configurations: minimal, complete, with defaults, partial metadata
- Proper verification of metric population into global sets
- Appropriate use of tempfile for isolated testing
The tests thoroughly validate the loader's robustness and configuration handling.
tests/unit/core/script/__init__.py (1)
1-1: LGTM!Standard package initializer with appropriate docstring.
0f7791b to
5cf16c0
Compare
asamal4
left a comment
There was a problem hiding this comment.
Thank You !! Now we have a baseline.
LGTM
Issue Link: https://issues.redhat.com/browse/LCORE-748
Summary by CodeRabbit