LEADS-26: Increased Unit test cases coverage#109
LEADS-26: Increased Unit test cases coverage#109tisnik merged 2 commits intolightspeed-core:mainfrom
Conversation
WalkthroughAdds a large suite of new and expanded unit tests across core components (API client & streaming parser, evaluation pipeline & processor, runner/CLI, LLM managers, models, output/statistics, script/system utilities). All changes are test-only; no production public APIs were modified. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30–40 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
tests/unit/core/api/test_client.py (1)
67-79: Consider a more specific assertion for API key header.The current assertion
call_count >= 1only verifies thatheaders.updatewas called at least once but doesn't confirm the Authorization header was set. For stronger verification, consider checking the actual call arguments.APIClient(basic_api_config) # Verify headers were updated (should include Authorization header) - assert mock_client.headers.update.call_count >= 1 + # Verify Authorization header was set + calls = mock_client.headers.update.call_args_list + assert any("Authorization" in str(call) for call in calls)tests/unit/pipeline/evaluation/test_pipeline.py (4)
20-24: Consider using pytest'stmp_pathfixture for cross-platform compatibility.Hardcoded
/tmp/test_outputis Unix-specific. Usingtmp_pathwould ensure tests work on Windows as well.@pytest.fixture -def mock_config_loader(mocker): +def mock_config_loader(mocker, tmp_path): """Create a mock config loader with system config.""" loader = mocker.Mock(spec=ConfigLoader) config = SystemConfig() config.api.enabled = False - config.output.output_dir = "/tmp/test_output" + config.output.output_dir = str(tmp_path / "test_output") config.output.base_filename = "test" config.core.max_threads = 2
50-76: Extract repeated mocking setup into a reusable fixture to eliminate duplication.The same set of
mocker.patch()calls is repeated across all 12 test methods, creating significant maintenance burden. Consider extracting this into anautousefixture or a helper fixture that each test can use.@pytest.fixture def mock_pipeline_dependencies(mocker): """Mock all EvaluationPipeline dependencies.""" mocks = { "validator": mocker.patch( "lightspeed_evaluation.pipeline.evaluation.pipeline.DataValidator" ), "metric_manager": mocker.patch( "lightspeed_evaluation.pipeline.evaluation.pipeline.MetricManager" ), "api_client": mocker.patch( "lightspeed_evaluation.pipeline.evaluation.pipeline.APIClient" ), "api_amender": mocker.patch( "lightspeed_evaluation.pipeline.evaluation.pipeline.APIDataAmender" ), "error_handler": mocker.patch( "lightspeed_evaluation.pipeline.evaluation.pipeline.EvaluationErrorHandler" ), "script_manager": mocker.patch( "lightspeed_evaluation.pipeline.evaluation.pipeline.ScriptExecutionManager" ), "metrics_evaluator": mocker.patch( "lightspeed_evaluation.pipeline.evaluation.pipeline.MetricsEvaluator" ), "processor": mocker.patch( "lightspeed_evaluation.pipeline.evaluation.pipeline.ConversationProcessor" ), } return mocksThen simplify each test:
def test_initialization_success(self, mock_config_loader, mock_pipeline_dependencies): """Test successful pipeline initialization.""" pipeline = EvaluationPipeline(mock_config_loader) assert pipeline.config_loader == mock_config_loader assert pipeline.system_config is not None assert pipeline.output_dir == "/tmp/test_output"
113-116: Consider verifying APIClient constructor arguments.The test confirms the client is created but doesn't verify it receives the expected configuration. This would catch regressions if the pipeline changes how it initializes the client.
pipeline = EvaluationPipeline(mock_config_loader) assert pipeline.api_client is not None - mock_api_client.assert_called_once() + mock_api_client.assert_called_once_with( + api_base="http://test.com", + endpoint_type="test", + # ... other expected args + )
331-341: Consider verifying the exception was actually handled (e.g., via logging).The test verifies no exception propagates, but doesn't confirm the exception path was actually taken and logged. This could be a silent pass if
save_evaluation_datais never called.mock_save.side_effect = Exception("Save error") pipeline = EvaluationPipeline(mock_config_loader) # Should not raise, just log warning results = pipeline.run_evaluation(sample_evaluation_data, "/tmp/original.yaml") assert results is not None + mock_save.assert_called_once() # Verify the save was attemptedOptionally, you could also capture logs to verify warning was logged:
def test_save_amended_data_handles_exception( self, mock_config_loader, sample_evaluation_data, mocker, caplog ): # ... setup ... with caplog.at_level(logging.WARNING): results = pipeline.run_evaluation(sample_evaluation_data, "/tmp/original.yaml") assert "Save error" in caplog.text or any("save" in r.message.lower() for r in caplog.records)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tests/unit/core/api/test_client.py(1 hunks)tests/unit/core/api/test_streaming_parser.py(1 hunks)tests/unit/pipeline/evaluation/test_pipeline.py(1 hunks)tests/unit/pipeline/evaluation/test_processor.py(1 hunks)tests/unit/runner/__init__.py(1 hunks)tests/unit/runner/test_evaluation.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/runner/__init__.pytests/unit/pipeline/evaluation/test_pipeline.pytests/unit/runner/test_evaluation.pytests/unit/core/api/test_client.pytests/unit/pipeline/evaluation/test_processor.py
🧠 Learnings (4)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
tests/unit/core/api/test_streaming_parser.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
tests/unit/runner/test_evaluation.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
🧬 Code graph analysis (4)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
_parse_sse_line(63-72)
tests/unit/pipeline/evaluation/test_pipeline.py (5)
src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(267-297)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(70-125)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)
EvaluationPipeline(32-202)src/lightspeed_evaluation/core/api/client.py (1)
query(71-105)src/lightspeed_evaluation/core/system/validator.py (1)
validate_evaluation_data(143-161)
tests/unit/runner/test_evaluation.py (5)
tests/unit/pipeline/evaluation/test_pipeline.py (2)
test_run_evaluation_success(176-222)mock_config_loader(16-27)tests/unit/core/metrics/test_manager.py (1)
system_config(14-50)src/lightspeed_evaluation/core/system/loader.py (1)
load_system_config(79-108)src/lightspeed_evaluation/core/system/validator.py (1)
load_evaluation_data(89-141)tests/unit/pipeline/evaluation/test_processor.py (1)
mock_config_loader(22-28)
tests/unit/core/api/test_client.py (3)
src/lightspeed_evaluation/core/api/client.py (1)
query(71-105)src/lightspeed_evaluation/core/models/system.py (1)
APIConfig(117-156)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)
🪛 GitHub Actions: Ruff
tests/unit/runner/test_evaluation.py
[error] 3-3: F401 'pytest' imported but unused. Use or remove the import. This can be fixed by removing the import or using it in the test.
tests/unit/pipeline/evaluation/test_processor.py
[error] 210-210: F841 Local variable 'results' is assigned to but never used. This can be fixed by removing the assignment or using the variable.
🔇 Additional comments (16)
tests/unit/runner/__init__.py (1)
1-1: LGTM!Minor docstring update to reflect plural modules - no concerns.
tests/unit/core/api/test_streaming_parser.py (5)
13-17: LGTM!Simple fixture correctly uses the
mockerfixture per coding guidelines.
20-129: Comprehensive test coverage for streaming response parsing.Good edge case coverage including error events, missing fields, and malformed input. The tests properly verify both success and failure paths.
131-172: LGTM!Tests correctly verify default values for missing
event(empty string) and missingdata(empty dict), matching the implementation instreaming_parser.py.
175-220: LGTM!Good coverage of tool call parsing edge cases, including validation that both
tool_nameandargumentsare required, but empty arguments dict is acceptable.
223-253: LGTM!Tests verify the expected output structure for tool sequence formatting.
tests/unit/runner/test_evaluation.py (2)
239-302: LGTM!Good coverage of CLI entry point behavior, including default arguments, custom arguments, and error exit codes.
33-34: I was unable to access the repository directly due to a clone operation failure. However, I can provide an assessment based on the available information:Verify the mock path by checking the import statement in
src/lightspeed_evaluation/runner/evaluation.py.From the retrieved learning, DataValidator is confirmed to be located in
lightspeed_evaluation.core.system. The mock path"lightspeed_evaluation.core.system.DataValidator"appears structurally correct for the module location.However, the effectiveness of this mock path depends on the specific import pattern used in
run_evaluation:
- If the code uses:
from lightspeed_evaluation.core.system import DataValidator— the mock path is correct- If the code uses:
import lightspeed_evaluation.core.system— the mock path may need adjustmentYou should verify the actual import statement in
src/lightspeed_evaluation/runner/evaluation.pyto confirm the mock will intercept correctly. Check whether DataValidator is imported at module level or if it's accessed via the full module path.tests/unit/pipeline/evaluation/test_processor.py (2)
21-65: LGTM!Well-structured fixtures with proper use of
specparameter for type safety on mocks.
68-426: Comprehensive test coverage for ConversationProcessor.Good coverage of:
- Initialization and wiring
- API-enabled and API-disabled paths
- Script execution lifecycle (setup/cleanup)
- Error handling and cascade failures
- Both turn-level and conversation-level metrics
The tests align with the learned behavior that metric resolution happens upstream in
process_conversation().tests/unit/core/api/test_client.py (2)
10-21: LGTM!Good fixture setup with clear naming. The streaming endpoint configuration aligns with the PR's focus on streaming behavior.
24-202: LGTM!Good test coverage for APIClient including:
- Both streaming and query endpoint types
- Cache initialization and key generation
- Header handling (Content-Type, Authorization)
- Client lifecycle (initialization, close)
- Request preparation with various parameters
Tests correctly use the
mockerfixture per coding guidelines.tests/unit/pipeline/evaluation/test_pipeline.py (4)
78-84: LGTM!Good test for the early validation guard. No mocking needed since the
ValueErroris raised before any components are instantiated.
176-222: LGTM!The test correctly verifies the happy path for evaluation runs. The assertions cover the essential behavior (results returned with expected status).
343-385: LGTM!Good test coverage for the cleanup path. Properly mocks
litellm.cacheandasyncio.rundependencies, and verifiesapi_client.close()is called.
418-440: LGTM!Clean test for verifying the
output_diroverride parameter works as expected.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/unit/core/api/test_client.py (1)
67-80: Consider more specific header verification.The test verifies that
headers.updatewas called but doesn't specifically validate the Authorization header content. While the current assertion confirms setup succeeds with an API key present, you could strengthen this test by inspecting the actual call arguments.Optional enhancement:
APIClient(basic_api_config) - # Verify headers were updated (should include Authorization header) - assert mock_client.headers.update.call_count >= 1 + # Verify Authorization header was set + calls = mock_client.headers.update.call_args_list + assert any( + "Authorization" in str(call) and "Bearer test_secret_key" in str(call) + for call in calls + )tests/unit/pipeline/evaluation/test_pipeline.py (1)
50-440: Consider fixture for repeated component mocking.Many tests repeat the same 7-8 component patches (DataValidator, MetricManager, APIDataAmender, etc.). While the current approach is explicit and clear, you could reduce repetition with a fixture.
Optional enhancement:
@pytest.fixture def mock_pipeline_components(mocker): """Mock all pipeline components.""" return { 'DataValidator': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.DataValidator"), 'MetricManager': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.MetricManager"), 'APIDataAmender': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.APIDataAmender"), 'EvaluationErrorHandler': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.EvaluationErrorHandler"), 'ScriptExecutionManager': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.ScriptExecutionManager"), 'MetricsEvaluator': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.MetricsEvaluator"), 'ConversationProcessor': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.ConversationProcessor"), } def test_initialization_success(self, mock_config_loader, mock_pipeline_components): """Test successful pipeline initialization.""" pipeline = EvaluationPipeline(mock_config_loader) # ... assertions ...However, the current explicit approach has its merits for test clarity and is perfectly acceptable.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tests/unit/core/api/test_client.py(1 hunks)tests/unit/core/api/test_streaming_parser.py(1 hunks)tests/unit/pipeline/evaluation/test_pipeline.py(1 hunks)tests/unit/pipeline/evaluation/test_processor.py(1 hunks)tests/unit/runner/__init__.py(1 hunks)tests/unit/runner/test_evaluation.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/unit/runner/test_evaluation.py
- tests/unit/pipeline/evaluation/test_processor.py
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/api/test_client.pytests/unit/runner/__init__.pytests/unit/core/api/test_streaming_parser.pytests/unit/pipeline/evaluation/test_pipeline.py
🧠 Learnings (1)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
tests/unit/core/api/test_streaming_parser.py
🧬 Code graph analysis (3)
tests/unit/core/api/test_client.py (3)
src/lightspeed_evaluation/core/api/client.py (3)
APIClient(22-282)query(71-105)_prepare_request(107-122)src/lightspeed_evaluation/core/models/system.py (1)
APIConfig(117-156)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
_parse_sse_line(63-72)
tests/unit/pipeline/evaluation/test_pipeline.py (2)
src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(267-297)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)
EvaluationPipeline(32-202)validate_data(114-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: mypy
- GitHub Check: tests (3.12)
🔇 Additional comments (17)
tests/unit/runner/__init__.py (1)
1-1: LGTM!The plural form accurately reflects that this package contains multiple test modules.
tests/unit/core/api/test_streaming_parser.py (5)
13-17: LGTM!Clean fixture for mocking streaming response objects.
20-129: Excellent comprehensive coverage of streaming response parsing!The test suite thoroughly validates:
- Complete streaming flows with conversation IDs and responses
- Tool call handling (single and multiple)
- Error conditions (missing final response, missing conversation ID, error events)
- Edge cases (empty lines, non-data lines)
This provides strong confidence in the streaming parser's behavior.
131-173: Solid validation of SSE line parsing behavior.These tests directly exercise the private
_parse_sse_linefunction. While testing private functions is sometimes avoided in favor of testing through public APIs, this level of granularity is appropriate here given the critical nature of parsing logic and the specific contract (returningOptional[tuple[str, dict]]).The tests properly validate default values for missing fields, which is important for resilient parsing.
175-221: Thorough validation of tool call parsing logic.The tests properly validate:
- Required fields (both
tool_nameandargumentsmust be present)- Valid edge case (empty arguments dict is acceptable)
- Defensive behavior (returns None for invalid structures)
This ensures robust handling of tool call data from the streaming API.
223-253: LGTM!Clean validation of tool sequence formatting. The tests confirm the expected structure where each tool call is wrapped in its own sequence (list), resulting in the
list[list[dict]]format.tests/unit/core/api/test_client.py (3)
10-21: LGTM!The renamed fixture
basic_api_configclearly indicates its purpose as a baseline streaming configuration. The defaults (cache disabled, streaming endpoint) are appropriate for isolated unit tests.
27-66: Solid initialization test coverage.The tests properly validate:
- Streaming endpoint configuration
- Cache creation when enabled
- Cache absence when disabled
- Valid endpoint type acceptance
The mocking strategy correctly isolates the client from external dependencies.
81-202: Excellent comprehensive coverage of APIClient behavior.The test suite thoroughly validates:
- Error handling (uninitialized client)
- Request preparation variants (basic, with conversation_id, with attachments)
- Resource cleanup (client closing)
- Cache key consistency (critical for cache correctness)
- Header initialization
- Both streaming and standard endpoint types
This provides strong confidence in the client's reliability across different configurations.
tests/unit/pipeline/evaluation/test_pipeline.py (8)
15-44: LGTM!Clean, well-structured fixtures that provide baseline test data:
mock_config_loadersets up a realistic SystemConfig with API disabled by defaultsample_evaluation_dataprovides minimal valid evaluation data for testing
50-84: Solid initialization validation.The tests properly validate:
- Successful pipeline initialization with all components
- Defensive behavior when SystemConfig is missing
The extensive mocking is appropriate for unit testing the pipeline orchestrator, which coordinates many components.
86-142: LGTM!The tests properly validate conditional API client creation:
- Client is initialized when API is enabled
- Client is None when API is disabled
This ensures the pipeline respects the configuration.
144-174: LGTM!The test properly validates that the pipeline delegates data validation to the DataValidator component and correctly propagates the result.
176-253: Comprehensive evaluation flow testing.The tests validate:
- Successful evaluation flow with proper result propagation
- Early failure when data validation fails (with appropriate error message)
This ensures the pipeline's control flow is correct for both success and failure scenarios.
255-341: Solid validation of data persistence behavior.The tests properly validate:
- Amended data is saved when API is enabled
- Save failures are handled gracefully (logged but don't fail the evaluation)
This ensures data persistence is reliable without making the pipeline brittle.
343-416: LGTM!The tests properly validate resource cleanup:
- API client is closed when present
- No errors occur when API client is absent
This ensures proper resource management in both configurations.
418-440: LGTM!The test validates that the output directory can be overridden at pipeline construction, confirming the configuration is flexible as intended.
|
Increased test cases coverage 56% -> 70% |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (10)
tests/unit/core/models/test_api_additional.py (4)
14-35: RAGChunk tests are solid; consider tightening the extra-field assertionThese tests nicely cover creation, optional
score, and the forbidden extra-field behavior. If you want stricter guarantees, you could optionally assert the validation error content (e.g., usingmatch="extra fields not permitted"or similar) to ensure the failure reason doesn’t regress, but the current checks are already functionally useful.
37-58: Good coverage of AttachmentData defaults and overrides; optional extra-field testThe tests correctly verify both default values and a custom
(attachment_type, content_type)combination. For parity withRAGChunk, you might optionally add a small test that passes an unexpected field to confirm that extra fields are also rejected here, but not strictly necessary for this PR’s goals.
60-107: APIRequest tests cover main paths; consider also exercisingcreatewith invalid queryThese tests sensibly cover:
- Simple
createusage,- Full parameter population,
- Attachment string inputs being wrapped into attachment objects,
- Direct model validation on empty
query.To fully exercise the public helper, you could optionally add a test that calls
APIRequest.create(query="")and asserts validation failure as well, ensuring the factory path can’t bypass themin_lengthconstraint. Current tests are still valid and helpful as-is.
109-166: APIResponse tests are comprehensive; a couple of minor edge cases could be added laterThe suite nicely checks:
- Basic construction with contexts,
- Default empty contexts,
- Tool call wiring,
from_raw_responsehappy path (including rag chunk extraction),- Required
conversation_idenforcement.If you decide to extend coverage further, you could add:
- A
from_raw_responsecase with norag_chunksto assertcontextsstays empty.- (Optionally) a case where
responsecontains leading/trailing whitespace to lock in the.strip()behavior.Not required for this PR, but would future-proof behavior around those edges.
tests/unit/core/output/test_generator.py (3)
11-37: Consider consolidating duplicate fixtures across test files.The
sample_resultsfixture is defined identically (or nearly so) in bothtest_generator.pyandtest_statistics.py. Consider moving shared fixtures to aconftest.pyfile attests/unit/core/output/conftest.pyto promote reusability and maintain consistency across test modules.
102-118: Consider strengthening the assertions.Lines 117-118 use OR conditions that make the test lenient. The test could pass even if the JSON structure doesn't match expectations. Consider asserting the exact expected structure based on the actual implementation of
_generate_json_summary.For example:
- assert "summary_stats" in data or "results" in data - assert data.get("total_evaluations") == 2 or len(data.get("results", [])) == 2 + # Assert the actual expected structure + assert "summary_stats" in data + assert "results" in data + assert data["total_evaluations"] == 2
120-132: Consider strengthening the assertion.Line 132 uses an OR condition that makes the test lenient. Consider asserting the specific expected content based on the actual implementation of
_generate_text_summary.tests/unit/core/system/test_setup.py (1)
66-96: Clarify “KeyError” vs “TypeError” scenario and consider slightly stronger assertions
config_data = {"environment": None}and{"environment": "invalid_type"}both currently trigger aTypeErrorduring the**config_data.get("environment", {})merge, so the “handles_key_error” test name/docstring are a bit misleading and the KeyError path is not actually exercised. You could either (a) rename the first test to reflect a generic invalid environment config, or (b) introduce a small custom mapping whose.get()raisesKeyErrorif you really want coverage for that branch. Also, for the fallback-case tests, asserting all expected default keys (not just one) would make the behavior under failure more thoroughly specified.tests/unit/core/api/test_client.py (2)
67-80: Optionally assert the actual API key header rather than just “some” header update
test_setup_client_with_api_keycurrently only checks thatmock_client.headers.updatewas called at least once, which would still pass if unrelated headers were updated. If feasible, you could strengthen this by asserting that one of the calls contains the expected API-key-bearing header, for example by inspectingmock_client.headers.update.call_args_listfor the configured key/value.
168-185: Tighten Content-Type assertion to require both header name and valueThe
test_client_initialization_sets_content_type_headerassertion usesor, so it passes if either"Content-Type"or"application/json"appears in a call, which could give a false positive (e.g., an unrelatedAccept: application/jsonheader). You can make this more precise by requiring both:- calls = mock_client.headers.update.call_args_list - assert any( - "Content-Type" in str(call) or "application/json" in str(call) - for call in calls - ) + calls = mock_client.headers.update.call_args_list + assert any( + "Content-Type" in str(call) and "application/json" in str(call) + for call in calls + )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
tests/unit/core/api/test_client.py(1 hunks)tests/unit/core/api/test_streaming_parser.py(1 hunks)tests/unit/core/llm/test_deepeval_manager.py(1 hunks)tests/unit/core/llm/test_llm_manager.py(1 hunks)tests/unit/core/models/test_api_additional.py(1 hunks)tests/unit/core/models/test_data_additional.py(1 hunks)tests/unit/core/models/test_system_additional.py(1 hunks)tests/unit/core/output/test_final_coverage.py(1 hunks)tests/unit/core/output/test_generator.py(1 hunks)tests/unit/core/output/test_statistics.py(1 hunks)tests/unit/core/script/test_manager_additional.py(1 hunks)tests/unit/core/system/test_env_validator.py(1 hunks)tests/unit/core/system/test_lazy_import.py(1 hunks)tests/unit/core/system/test_setup.py(1 hunks)tests/unit/pipeline/evaluation/test_pipeline.py(1 hunks)tests/unit/pipeline/evaluation/test_processor.py(1 hunks)tests/unit/runner/__init__.py(1 hunks)tests/unit/runner/test_evaluation.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (12)
- tests/unit/pipeline/evaluation/test_pipeline.py
- tests/unit/runner/test_evaluation.py
- tests/unit/core/models/test_data_additional.py
- tests/unit/core/llm/test_deepeval_manager.py
- tests/unit/core/output/test_final_coverage.py
- tests/unit/core/system/test_lazy_import.py
- tests/unit/core/llm/test_llm_manager.py
- tests/unit/core/system/test_env_validator.py
- tests/unit/core/api/test_streaming_parser.py
- tests/unit/runner/init.py
- tests/unit/pipeline/evaluation/test_processor.py
- tests/unit/core/models/test_system_additional.py
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/system/test_setup.pytests/unit/core/api/test_client.pytests/unit/core/output/test_generator.pytests/unit/core/output/test_statistics.pytests/unit/core/script/test_manager_additional.pytests/unit/core/models/test_api_additional.py
🧬 Code graph analysis (4)
tests/unit/core/system/test_setup.py (2)
src/lightspeed_evaluation/core/models/system.py (1)
LoggingConfig(204-223)src/lightspeed_evaluation/core/system/setup.py (2)
setup_environment_variables(10-28)setup_logging(31-93)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/output/statistics.py (3)
bootstrap_intervals(12-34)calculate_basic_stats(37-63)calculate_detailed_stats(66-86)tests/unit/core/output/test_generator.py (1)
sample_results(12-37)
tests/unit/core/script/test_manager_additional.py (1)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-117)
tests/unit/core/models/test_api_additional.py (1)
src/lightspeed_evaluation/core/models/api.py (6)
RAGChunk(8-17)AttachmentData(20-27)APIRequest(30-77)APIResponse(80-116)create(50-77)from_raw_response(95-116)
🔇 Additional comments (35)
tests/unit/core/script/test_manager_additional.py (1)
14-85: LGTM! Error handling and Path object tests are well-structured.The timeout, subprocess error, unexpected error, Path object, and file validation tests correctly use pytest-mock for mocking and appropriately verify exception raising and basic functionality.
tests/unit/core/output/test_generator.py (13)
40-53: LGTM!The fixture correctly uses the
mockerfixture from pytest-mock as per coding guidelines.
59-65: LGTM!The test correctly validates initialization and uses pytest's
tmp_pathfixture appropriately.
67-75: LGTM!The test correctly validates statistics calculation with the expected counts from the fixture.
77-83: LGTM!The test correctly validates the edge case of empty results.
85-100: LGTM!The test correctly validates CSV generation including file creation and content verification.
134-138: LGTM!The test correctly validates the
get_output_directorymethod.
140-163: LGTM!The test correctly mocks datetime using pytest-mock and validates file creation for all output formats.
165-183: LGTM!The test correctly validates the edge case of generating reports with empty results.
185-199: LGTM!The test correctly validates selective CSV-only report generation.
201-214: LGTM!The test correctly validates selective JSON-only report generation.
216-229: LGTM!The test correctly validates selective TXT-only report generation.
231-253: LGTM!The test correctly validates CSV generation with all available columns configured.
255-269: LGTM!The test correctly validates default report generation behavior when system config is not provided.
tests/unit/core/output/test_statistics.py (17)
14-54: LGTM!The fixture is well-designed with diverse scenarios covering multiple conversations, metrics, and result types (PASS, FAIL, ERROR).
60-70: LGTM!The test correctly validates basic bootstrap interval calculation. Using
bootstrap_steps=100(vs production default of 10000) is acceptable for unit test performance.
72-77: LGTM!The test correctly validates error handling for invalid confidence values.
79-84: LGTM!The test correctly validates error handling for negative confidence values.
86-94: LGTM!The test correctly validates custom confidence level usage.
96-102: LGTM!The test correctly validates custom bootstrap steps usage.
108-118: LGTM!The test correctly validates basic statistics calculation with the expected counts and rates from the fixture.
120-130: LGTM!The test correctly validates the edge case of empty results with all counts and rates at zero.
132-151: LGTM!The test correctly validates the all-pass scenario using programmatic test data generation.
153-172: LGTM!The test correctly validates the all-fail scenario using programmatic test data generation.
174-192: LGTM!The test correctly validates the all-error scenario using programmatic test data generation.
198-207: LGTM!The test correctly validates the structure of detailed statistics including both metric and conversation breakdowns.
209-214: LGTM!The test correctly validates the edge case of empty results.
216-226: LGTM!The test correctly validates metric-level breakdown with expected counts.
228-238: LGTM!The test correctly validates conversation-level breakdown with expected counts.
240-248: LGTM!The test correctly validates that percentage rates are calculated and included in detailed statistics.
250-266: LGTM!The test correctly validates the single metric scenario.
tests/unit/core/system/test_setup.py (2)
15-65: Environment defaults/custom/override tests match implementation wellThe three tests for default envs, custom additions, and overriding
LITELLM_LOGclosely mirrorsetup_environment_variablesbehavior and usemocker.patch.dictcorrectly to isolateos.environ. No issues here.
101-236: Logging configuration tests are comprehensive and aligned withsetup_loggingThe logging tests cover basic/DEBUG/ERROR/WARNING source levels, package-level overrides, invalid override handling via printed warning, and default noisy-package configuration, all consistent with the provided
LoggingConfigandsetup_loggingimplementation. Usinglogging.getLogger(...).levelassertions is appropriate givenlogging.basicConfig(..., force=True)resets state each time.tests/unit/core/api/test_client.py (2)
10-58: Streaming initialization and cache tests are structured wellThe
basic_api_configfixture plus the streaming and cache-enabled initialization tests give good coverage ofAPIClientconstruction, including endpoint type, timeout, and cache wiring via theCachepatch. Use ofmocker.patchforhttpx.ClientandCacheavoids real I/O and keeps tests fast.
81-167: Request preparation, cache key, lifecycle, and standard-endpoint tests look solidThe tests around
_prepare_request(including conversation IDs and attachments), cache-key generation consistency, enforcing thatqueryfails when the client is uninitialized, verifyingclose()delegates to the underlying HTTP client, and confirming non-streaming (endpoint_type="query") initialization all exercise meaningful behavior without over-specifying implementation details. Accessing private helpers here is reasonable for thorough unit coverage.Also applies to: 187-202
VladimirKadlec
left a comment
There was a problem hiding this comment.
Thanks for working on this!
Just be careful to not remove the current tests. There are many tests removed for some reason, the new tests are somewhat trivial in comparison with the current ones.
I checked just my test_processor.py but there are many other files affected.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/unit/pipeline/evaluation/test_processor.py (1)
436-525: Consider consolidating duplicate fixtures.The
_prsuffixed fixtures (lines 436-525) largely duplicate the earlier fixtures (lines 27-55). While theconfig_loader_prhas more detailed configuration, the other fixtures are identical. Consider:
- Using parametrization if different configurations are needed
- Reusing the original fixtures where possible
- Clearly documenting why separate fixtures are necessary if they must remain
Apply this pattern to consolidate:
-@pytest.fixture -def mock_metrics_evaluator_pr(mocker): - """Create a mock metrics evaluator.""" - evaluator = mocker.Mock(spec=MetricsEvaluator) - ... - return evaluator - -# Similar for other _pr fixtures +# Reuse the existing processor_components fixture and only create config_loader_pr if truly neededtests/unit/core/output/test_generator.py (1)
394-410: Simplifytest_filename_timestamp_formatby dropping the unused datetime patchThis test calls
_generate_csv_reportwith a preformatted"test_20240101_120000"base filename, so the patcheddatetimeobject is never used. You can simplify the test by removing the datetime patch block and its associated comment without losing coverage, since other tests already verify timestamp wiring ingenerate_reports.For example:
def test_filename_timestamp_format(self, tmp_path, mocker): """Test that generated filenames include proper timestamps.""" results = [] - mocker.patch("builtins.print") - - handler = OutputHandler(output_dir=str(tmp_path), base_filename="test") - - # Mock datetime to get predictable timestamps - mock_datetime = mocker.patch( - "lightspeed_evaluation.core.output.generator.datetime" - ) - mock_datetime.now.return_value.strftime.return_value = "20240101_120000" - - csv_file = handler._generate_csv_report(results, "test_20240101_120000") + mocker.patch("builtins.print") + handler = OutputHandler(output_dir=str(tmp_path), base_filename="test") + csv_file = handler._generate_csv_report(results, "test_20240101_120000")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tests/unit/core/api/test_client.py(2 hunks)tests/unit/core/output/test_generator.py(1 hunks)tests/unit/core/output/test_statistics.py(5 hunks)tests/unit/pipeline/evaluation/test_processor.py(13 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_generator.pytests/unit/core/output/test_statistics.pytests/unit/pipeline/evaluation/test_processor.pytests/unit/core/api/test_client.py
🧠 Learnings (7)
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to tests/**/*.py : Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
🧬 Code graph analysis (3)
tests/unit/core/output/test_generator.py (1)
src/lightspeed_evaluation/core/output/generator.py (8)
OutputHandler(23-295)_calculate_stats(70-79)_generate_csv_report(133-169)_generate_json_summary(171-206)_generate_text_summary(208-291)get_output_directory(293-295)generate_reports(40-68)_generate_individual_reports(81-103)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/output/statistics.py (3)
bootstrap_intervals(12-34)calculate_basic_stats(37-63)calculate_detailed_stats(66-86)tests/unit/core/output/test_generator.py (1)
sample_results(12-37)
tests/unit/core/api/test_client.py (3)
src/lightspeed_evaluation/core/models/system.py (1)
APIConfig(117-156)src/lightspeed_evaluation/core/api/client.py (3)
APIClient(22-282)query(71-105)_prepare_request(107-122)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)
🪛 GitHub Actions: Ruff
tests/unit/core/output/test_generator.py
[error] 293-293: ruff check failed: F841 Local variable 'mock_print' is assigned to but never used. Remove assignment to unused variable 'mock_print'.
[error] 309-309: ruff check failed: F841 Local variable 'mock_print' is assigned to but never used. Remove assignment to unused variable 'mock_print'.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: mypy
🔇 Additional comments (4)
tests/unit/pipeline/evaluation/test_processor.py (1)
1-1: Address reviewer feedback about removed tests.Based on VladimirKadlec's comment, the original tests should not have been removed. The current structure with two test classes (
TestConversationProcessorandTestConversationProcessorEvaluateTurn) may be an attempt to preserve both old and new tests, but please confirm with the reviewer that this addresses their concern about the removed tests.tests/unit/core/api/test_client.py (1)
322-500: LGTM! Well-structured streaming configuration tests.The new
TestAPIClientConfigurationclass provides comprehensive coverage for streaming endpoint initialization, caching, API key handling, request preparation, and client lifecycle management. The tests correctly use pytest-mock and follow the project's testing conventions.tests/unit/core/output/test_generator.py (1)
11-270: OutputHandler test coverage and pytest-mock usage look solidThese tests exercise initialization, stats calculation, CSV/JSON/TXT generation, enabled_outputs behavior, csv_columns configuration, and timestamped filenames, both with and without a system_config. Using the
mockerfixture everywhere (nounittest.mock) and relying ontmp_pathfor file IO all align with the repo’s testing guidelines. I don’t see functional issues here.tests/unit/core/output/test_statistics.py (1)
14-537: Statistics tests comprehensively cover bootstrap and PASS/FAIL/ERROR breakdownsThe new suites around
bootstrap_intervals,calculate_basic_stats, andcalculate_detailed_statslook consistent with the implementation: you validate happy paths, invalid confidence handling, empty/all-pass/all-fail/all-error cases, mixed distributions, metric and conversation breakdowns, andscore_statistics(including no-score scenarios). Thesample_resultsfixture and explicit"PASS","FAIL", and"ERROR"usages align with the established semantics for evaluation outcomes. Tests use pytest fixtures only (nounittest.mock), matching the repository’s testing guidelines.Based on learnings, the distinction between FAIL and ERROR in the assertions matches the framework’s intended behavior.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/unit/core/output/test_generator.py (1)
288-314: Fix unusedmock_printvariables to satisfy Ruff F841.Lines 293 and 309 assign
mock_printbut never use it. Since these tests only need to suppress output, patchdef test_output_handler_initialization_custom(self, tmp_path, mocker): """Test OutputHandler initialization with custom parameters.""" system_config = mocker.Mock() system_config.llm.provider = "openai" - mock_print = mocker.patch("builtins.print") + mocker.patch("builtins.print") handler = OutputHandler( output_dir=str(tmp_path), base_filename="custom_eval", system_config=system_config, )def test_output_handler_creates_directory(self, tmp_path, mocker): """Test that OutputHandler creates output directory if it doesn't exist.""" output_path = tmp_path / "new_output_dir" - mock_print = mocker.patch("builtins.print") + mocker.patch("builtins.print") handler = OutputHandler(output_dir=str(output_path))
🧹 Nitpick comments (11)
tests/unit/core/system/test_setup.py (3)
66-96: Consider tightening how warnings are asserted in env var error testsThe edge-case coverage for bad
config_data["environment"]values is good, but the assertion:captured = capsys.readouterr() assert "Warning" in captured.out or "fallback" in captured.outis fairly loose and tied to both:
- The exact output stream (stdout vs stderr/logging), and
- Specific wording in the message.
If the implementation uses
logging.warningor changes the message text slightly, these tests may fail even though behavior is correct. Consider either:
- Asserting only on side-effects (i.e., that defaults are set) and dropping the output assertion, or
- If the implementation uses logging, switching to
caplogand asserting on a structured warning record instead of free-form text.
101-165: Logging setup tests are comprehensive; only minor potential brittleness around.levelThe tests cover:
- Basic logger creation and name (
"lightspeed_evaluation"),- Source/package level interaction,
- Package overrides for
httpxandurllib3,- Default noisy package behavior for
matplotlib.This is solid coverage. One minor consideration: asserting
logger.level == logging.Xandpkg_logger.level == logging.Xassumes those loggers are always explicitly set to that level, not left asNOTSETwith behavior controlled via propagation. Ifsetup_loggingever changes to rely more on root configuration,getEffectiveLevel()could be a more future-proof assertion.Not a required change, just something to keep in mind.
183-236: Good coverage of logging variants and default packages; verify default package list stays in syncThese tests nicely exercise:
- Different source/package levels (ERROR, WARNING),
- Custom format handling (ensuring no crash),
- Behavior when an invalid override level is provided,
- That a set of default noisy packages (
"httpx","requests","LiteLLM","ragas") are all configured.One small maintainability point: the hard-coded
packageslist intest_setup_logging_applies_to_all_default_packagesneeds to stay in sync with whateversetup_loggingdefines as its default noisy packages. If that list changes in the implementation, the test will need updating. A brief comment in the test pointing back to the definition insetup_logging(or centralizing the list in a shared constant) could help future readers understand the coupling.Overall, these tests look good.
tests/unit/core/llm/test_deepeval_manager.py (1)
22-32: Consider verifying LiteLLMModel call arguments.The test verifies that
LiteLLMModelis called once but doesn't verify the arguments passed. To ensure all parameters are correctly propagated fromllm_paramsto the LiteLLM constructor, consider adding assertions for the call arguments.Apply this diff to enhance the test:
def test_initialization(self, llm_params, mocker): """Test manager initialization.""" mock_model = mocker.patch( "lightspeed_evaluation.core.llm.deepeval.LiteLLMModel" ) manager = DeepEvalLLMManager("gpt-4", llm_params) assert manager.model_name == "gpt-4" assert manager.llm_params == llm_params - mock_model.assert_called_once() + mock_model.assert_called_once_with( + model="gpt-4", + temperature=0.5, + max_tokens=1024, + timeout=120, + num_retries=5, + )tests/unit/core/script/test_manager_additional.py (5)
14-58: Consider patching subprocess.run where it's used, not where it's defined.The tests patch
"subprocess.run"directly (lines 22, 37, 52). Following Python mocking best practices, you should patch where the object is looked up (in the manager module) rather than where it's defined.This makes the tests more explicit and robust.
Apply this pattern to lines 22, 37, 52:
- mock_run = mocker.patch("subprocess.run") + mock_run = mocker.patch("lightspeed_evaluation.core.script.manager.subprocess.run")
60-74: Consider patching subprocess.run where it's used.Same recommendation as above: patch
"lightspeed_evaluation.core.script.manager.subprocess.run"instead of"subprocess.run"on line 69 for best practices.
87-109: Consider moving the logging import to module level.The
import loggingstatement on line 89 is inside the test function. While functional, it's more idiomatic to place imports at the module level.Additionally, line 103 has the same subprocess.run patching consideration mentioned in earlier comments.
Apply this pattern:
+import logging + import subprocess import pytestThen remove the inline import on line 89.
111-133: Consider moving the logging import to module level.Same as above: the
import loggingon line 113 should be at module level. Also, line 127 has the patching consideration.
134-155: Consider moving the logging import to module level.Same as above: the
import loggingon line 136 should be at module level. Also, line 150 has the patching consideration.tests/unit/runner/test_evaluation.py (1)
237-300: CLImaintests are robust; just keep patch targets in sync with importsThe
maintests nicely validate default/custom argument parsing and non‑zero exit on failure by patchingrun_evaluation. One thing to keep an eye on: across this file you patch objects likeConfigLoader,DataValidator, andEvaluationPipelineat specific module paths. If the import style inlightspeed_evaluation.runner.evaluationchanges (e.g., from package‑level imports to submodule imports), these patch targets may need updating to continue intercepting the correct call‑sites.tests/unit/pipeline/evaluation/test_processor.py (1)
842-867: Note aboutTurnData._invalid_metricsbeing shared across instancesThe
is_metric_invalid/add_invalid_metrictests validate behavior on a singleTurnDatainstance and look correct. One thing to be aware of in the underlying model:TurnDatacurrently defines_invalid_metrics: set[str] = set()as a class‑level attribute, so the invalid‑metric set is effectively shared across allTurnDatainstances, not per‑turn. If you ever start constructing multipleTurnDataobjects in the same process and rely on invalid metrics being isolated per turn, this could cause surprising cross‑contamination and might be worth revisiting in a follow‑up change.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
tests/unit/core/api/test_client.py(2 hunks)tests/unit/core/api/test_streaming_parser.py(1 hunks)tests/unit/core/llm/test_deepeval_manager.py(1 hunks)tests/unit/core/llm/test_llm_manager.py(1 hunks)tests/unit/core/models/test_api_additional.py(1 hunks)tests/unit/core/models/test_data_additional.py(1 hunks)tests/unit/core/models/test_system_additional.py(1 hunks)tests/unit/core/output/test_final_coverage.py(1 hunks)tests/unit/core/output/test_generator.py(1 hunks)tests/unit/core/output/test_statistics.py(5 hunks)tests/unit/core/script/test_manager_additional.py(1 hunks)tests/unit/core/system/test_env_validator.py(1 hunks)tests/unit/core/system/test_lazy_import.py(1 hunks)tests/unit/core/system/test_setup.py(1 hunks)tests/unit/pipeline/evaluation/test_pipeline.py(1 hunks)tests/unit/pipeline/evaluation/test_processor.py(7 hunks)tests/unit/runner/test_evaluation.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/unit/core/system/test_env_validator.py
- tests/unit/core/models/test_api_additional.py
- tests/unit/core/models/test_data_additional.py
- tests/unit/core/llm/test_llm_manager.py
- tests/unit/core/models/test_system_additional.py
- tests/unit/core/output/test_final_coverage.py
- tests/unit/core/system/test_lazy_import.py
- tests/unit/pipeline/evaluation/test_pipeline.py
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/pipeline/evaluation/test_processor.pytests/unit/core/output/test_statistics.pytests/unit/core/api/test_client.pytests/unit/core/llm/test_deepeval_manager.pytests/unit/core/output/test_generator.pytests/unit/runner/test_evaluation.pytests/unit/core/script/test_manager_additional.pytests/unit/core/system/test_setup.pytests/unit/core/api/test_streaming_parser.py
🧠 Learnings (10)
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to tests/**/*.py : Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Applied to files:
tests/unit/pipeline/evaluation/test_processor.pytests/unit/core/output/test_generator.pytests/unit/runner/test_evaluation.py
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework
Applied to files:
tests/unit/runner/test_evaluation.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead
Applied to files:
tests/unit/runner/test_evaluation.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
tests/unit/core/api/test_streaming_parser.py
🧬 Code graph analysis (7)
tests/unit/pipeline/evaluation/test_processor.py (15)
src/lightspeed_evaluation/core/metrics/manager.py (4)
MetricManager(17-166)resolve_metrics(24-48)MetricLevel(10-14)count_metrics_for_conversation(145-166)src/lightspeed_evaluation/core/models/data.py (4)
EvaluationData(309-367)EvaluationRequest(426-479)EvaluationResult(370-409)TurnData(35-306)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(271-301)src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-117)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(70-125)src/lightspeed_evaluation/pipeline/evaluation/amender.py (2)
APIDataAmender(13-80)amend_single_turn(20-69)src/lightspeed_evaluation/pipeline/evaluation/errors.py (4)
EvaluationErrorHandler(10-201)mark_all_metrics_as_error(17-77)mark_turn_metrics_as_error(79-125)mark_cascade_failure(127-193)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(25-172)src/lightspeed_evaluation/pipeline/evaluation/processor.py (8)
ConversationProcessor(37-299)ProcessorComponents(27-34)process_conversation(46-179)_evaluate_turn(181-214)_evaluate_conversation(216-243)_run_setup_script(245-267)_run_cleanup_script(269-295)get_metrics_summary(297-299)tests/unit/pipeline/evaluation/test_pipeline.py (1)
mock_config_loader(16-27)tests/unit/core/metrics/test_manager.py (1)
system_config(14-50)src/lightspeed_evaluation/core/api/client.py (1)
query(71-105)tests/unit/core/llm/test_deepeval_manager.py (1)
test_initialization(22-32)tests/unit/core/output/test_generator.py (1)
test_initialization(59-65)src/lightspeed_evaluation/core/llm/custom.py (1)
call(18-75)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/output/statistics.py (3)
bootstrap_intervals(12-34)calculate_basic_stats(37-63)calculate_detailed_stats(66-86)tests/unit/core/output/test_generator.py (1)
sample_results(12-37)
tests/unit/core/api/test_client.py (3)
src/lightspeed_evaluation/core/models/system.py (1)
APIConfig(117-156)src/lightspeed_evaluation/core/api/client.py (3)
APIClient(22-282)query(71-105)_prepare_request(107-122)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)
tests/unit/core/llm/test_deepeval_manager.py (1)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)
tests/unit/runner/test_evaluation.py (5)
tests/unit/pipeline/evaluation/test_pipeline.py (2)
test_run_evaluation_success(176-222)mock_config_loader(16-27)tests/unit/core/metrics/test_manager.py (1)
system_config(14-50)src/lightspeed_evaluation/core/system/loader.py (1)
load_system_config(79-108)src/lightspeed_evaluation/core/system/validator.py (1)
load_evaluation_data(92-144)tests/unit/pipeline/evaluation/test_processor.py (1)
mock_config_loader(28-34)
tests/unit/core/script/test_manager_additional.py (1)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-117)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
_parse_sse_line(63-72)
🪛 GitHub Actions: Ruff
tests/unit/pipeline/evaluation/test_processor.py
[error] 553-553: F811 Redefinition of unused processor_components from line 38. Remove definition: processor_components.
tests/unit/core/output/test_generator.py
[error] 293-293: F841 Local variable mock_print is assigned to but never used. Remove assignment to unused variable mock_print.
[error] 309-309: F841 Local variable mock_print is assigned to but never used. Remove assignment to unused variable mock_print.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: mypy
- GitHub Check: Pylinter
🔇 Additional comments (25)
tests/unit/core/system/test_setup.py (1)
15-65: Env var setup tests are clear and align with pytest/pytest-mock guidelines
- Good use of
mocker.patch.dict(os.environ, {}, clear=True)to fully isolate environment per test.- Scenarios for defaults, custom vars, and overrides are well-separated and readable.
- Naming and structure follow the requested
Test*/test_*conventions.No changes needed here.
tests/unit/core/output/test_statistics.py (4)
57-102: LGTM!The bootstrap interval tests appropriately cover:
- Basic functionality with ordering validation
- Invalid confidence values (both > 100 and negative)
- Custom configuration parameters
Using reduced bootstrap steps (50-100) for faster test execution is a reasonable trade-off for unit tests.
105-273: LGTM!Comprehensive test coverage for
calculate_basic_stats:
- Empty results handling
- Homogeneous results (all PASS, all FAIL, all ERROR)
- Mixed results with rate calculations
- Single result edge case
The assertions correctly validate both counts and percentage rates.
276-537: LGTM!Thorough test coverage for
calculate_detailed_stats:
- Metric and conversation breakdowns
- Rate calculations at the detailed level
- Score statistics (mean, median, std, min, max)
- Edge cases including no-score ERROR results
Good use of
pytest.approxfor floating-point comparisons at line 509.
14-54: Well-structured fixture with diverse test data.The
sample_resultsfixture provides good variety:
- Multiple conversations (conv1, conv2)
- Multiple metrics (metric1, metric2)
- All result types (PASS, FAIL, ERROR)
- Both valid scores and None for ERROR case
tests/unit/core/output/test_generator.py (3)
11-53: LGTM!Well-designed fixtures:
sample_resultsprovides a minimal but representative test dataset with PASS/FAIL casesmock_system_configproperly uses themockerfixture from pytest-mock as per coding guidelines
56-269: LGTM!Comprehensive test coverage for
OutputHandler:
- Initialization with various parameters
- Statistics calculation (with results and empty)
- Report generation for all formats (CSV, JSON, TXT)
- Proper datetime mocking for deterministic timestamps
- Default behavior when system config is absent
316-410: LGTM!Good additional test coverage:
- CSV generation with specific result data and content verification
- Column configuration respects system config
- Timestamp formatting in filenames
The local
import csv as csv_module(lines 350, 386) works fine within test functions.tests/unit/core/llm/test_deepeval_manager.py (7)
1-6: LGTM!The imports and module docstring are clean and follow pytest conventions.
8-16: LGTM!The fixture provides comprehensive test parameters covering all configurable LLM options.
34-45: LGTM!The test correctly verifies that the default temperature value (0.0) is applied when not provided in the parameters.
47-58: LGTM!The test correctly verifies that the default num_retries value (3) is applied when not provided in the parameters.
60-71: LGTM!The test correctly verifies that
get_llm()returns theLiteLLMModelinstance created during initialization.
73-84: LGTM!The test thoroughly verifies that
get_model_info()returns all expected configuration parameters with correct values.
86-94: LGTM!The test correctly verifies the initialization message is printed to stdout using pytest's
capsysfixture.tests/unit/core/script/test_manager_additional.py (2)
1-9: LGTM!Module imports and docstring are well-structured and follow the coding guidelines.
76-85: LGTM!Good test verifying that directories are rejected with the appropriate error message.
tests/unit/core/api/test_streaming_parser.py (4)
13-128: Comprehensive coverage forparse_streaming_responselooks solidThe tests exercise the key control-flow paths (happy path, tool calls, missing final response/ID, error events, empty/non‑data lines, multiple tool calls) and align with the expected SSE format (
data: {..}). No issues from a correctness or style perspective.
131-173:_parse_sse_linetests match implementation semanticsThe cases for valid JSON, invalid JSON →
None, and missingevent/datadefaulting to""/{}match the current implementation instreaming_parser._parse_sse_line. This is a good level of granularity for a private helper.
175-221: Tool‑call parsing edge cases are well coveredThe
_parse_tool_calltests cover the expected structure (valid dict), missing keys, empty arguments, and non‑dict tokens. This matches how the helper is intended to be used and should make future refactors safer.
223-253: Tool sequence formatting tests align with downstream expectations
_format_tool_sequencesis validated for empty, single, and multiple tool calls, asserting the “list of lists” structure that the rest of the pipeline expects. This is clear and idiomatic pytest.tests/unit/runner/test_evaluation.py (1)
6-236: Runnerrun_evaluationtests give good behavioral coverageThe tests cover success, output‑dir override, config load failure, evaluation‑data failure, error statistics reporting, and ensuring the pipeline is closed on exceptions. Mock wiring through
ConfigLoader,DataValidator,EvaluationPipeline,OutputHandler, andcalculate_basic_statsis clear, and everything stays in pytest +mocker. No blocking issues here.tests/unit/pipeline/evaluation/test_processor.py (2)
27-291: ConversationProcessor high‑level tests are thorough and consistentThe new fixtures (
mock_config_loader,sample_conv_data) and theTestConversationProcessormethods cover:
- Initialization wiring,
- No‑metrics short‑circuit,
- Turn vs conversation metrics,
- Setup/cleanup script behavior (success, failure, always‑cleanup),
- API amendment behavior and cascade‑failure on API error,
- Metrics summary forwarding.
They align well with
ConversationProcessor.process_conversation,_run_setup_script,_run_cleanup_script, andget_metrics_summarylogic inprocessor.py, and comply with the pytest +mockerguidelines.
375-427:_evaluate_turnunit tests correctly pin invalid‑metric and ordering behaviorThe tests around
_evaluate_turn(both the simpler one around line ~375 and the detailedTestConversationProcessorEvaluateTurnclass) accurately reflect the implementation:
- Invalid metrics produce
ERRORresults, log the expected message, and do not callmetrics_evaluator.evaluate_metric.- Mixed valid/invalid lists preserve metric order in the results and only evaluate valid metrics.
- Empty metric lists short‑circuit cleanly.
- The generated
EvaluationRequestinstances are asserted in detail (ids, index, identifiers).- Handling of
evaluate_metricreturningNoneis exercised.This is exactly the kind of focused coverage that will keep refactors safe.
Also applies to: 579-791
tests/unit/core/api/test_client.py (1)
22-500: Additional APIClient configuration/streaming tests are well‑alignedThe new
basic_api_configfixture andTestAPIClientConfigurationclass nicely extend coverage into:
- Streaming endpoint initialization,
- Cache construction and key stability,
- Endpoint‑type validation,
- API key header wiring,
- Guarding against uninitialized clients,
- Request preparation (basic, with conversation_id, with attachments),
- Client close behavior,
- Content‑Type header setup,
- Standard vs streaming endpoint types.
All of this stays within pytest +
mockerand matches the documented behavior ofAPIClient. No changes needed here.
|
Now test case coverage increased: |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (7)
tests/unit/core/llm/test_llm_manager.py (1)
142-142: Remove unused parameterbasic_llm_config.The
basic_llm_configfixture parameter is declared but never used in this test. The test creates its own config instead.Apply this diff to clean up the signature:
- def test_initialization_generic_provider(self, basic_llm_config, mocker, capsys): + def test_initialization_generic_provider(self, mocker, capsys):tests/unit/core/models/test_system_additional.py (5)
48-54: Consider more specific assertions.The assertions
config.provider is not Noneandconfig.model is not Noneonly verify non-None values. Consider asserting against the actual expected default values (e.g., fromDEFAULT_EMBEDDING_PROVIDERandDEFAULT_EMBEDDING_MODELconstants) for more robust validation.
70-76: Consider more specific assertions.The
isinstancechecks for boolean fields andtimeout > 0provide minimal validation. Consider asserting against expected default values for more thorough testing.
116-119: Assertion doesn't verify the actual value.Line 119 asserts
len(config.csv_columns) >= 1after settingcsv_columns=["result"], which only confirms non-empty. Consider asserting the exact value:assert config.csv_columns == ["result"]for more precise validation.
125-131: Consider more specific assertions.Similar to other default value tests, the generic assertions (
isinstance,> 0, length checks) provide minimal validation. Consider checking against expected default values from the model constants.
155-161: Consider more specific assertions.The
is not Nonechecks andisinstanceassertion provide minimal validation. Consider checking against expected default values from the model constants for more thorough testing.tests/unit/core/script/test_manager_additional.py (1)
14-58: Consider patching subprocess.run at the import location.The tests patch
"subprocess.run"globally. While this works, best practice is to patch where the function is imported and used.Apply this pattern for more precise mocking:
- mock_run = mocker.patch("subprocess.run") + mock_run = mocker.patch("lightspeed_evaluation.core.script.manager.subprocess.run")This makes the mock more targeted and reduces the risk of affecting other tests or code that uses subprocess.
Note: This applies to all tests in this file that mock subprocess.run (lines 22, 37, 52, 69, 103, 127, 150).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
tests/unit/core/api/test_client.py(2 hunks)tests/unit/core/api/test_streaming_parser.py(1 hunks)tests/unit/core/llm/test_deepeval_manager.py(1 hunks)tests/unit/core/llm/test_llm_manager.py(1 hunks)tests/unit/core/models/test_api_additional.py(1 hunks)tests/unit/core/models/test_data_additional.py(1 hunks)tests/unit/core/models/test_system_additional.py(1 hunks)tests/unit/core/output/test_final_coverage.py(1 hunks)tests/unit/core/output/test_generator.py(1 hunks)tests/unit/core/output/test_statistics.py(5 hunks)tests/unit/core/script/test_manager_additional.py(1 hunks)tests/unit/core/system/test_env_validator.py(1 hunks)tests/unit/core/system/test_lazy_import.py(1 hunks)tests/unit/core/system/test_setup.py(1 hunks)tests/unit/pipeline/evaluation/test_pipeline.py(1 hunks)tests/unit/pipeline/evaluation/test_processor.py(9 hunks)tests/unit/runner/test_evaluation.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- tests/unit/core/models/test_data_additional.py
- tests/unit/core/models/test_api_additional.py
- tests/unit/core/system/test_env_validator.py
- tests/unit/core/system/test_setup.py
- tests/unit/runner/test_evaluation.py
- tests/unit/core/llm/test_deepeval_manager.py
- tests/unit/core/output/test_final_coverage.py
- tests/unit/core/system/test_lazy_import.py
- tests/unit/pipeline/evaluation/test_pipeline.py
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/core/output/test_statistics.pytests/unit/core/output/test_generator.pytests/unit/core/models/test_system_additional.pytests/unit/core/script/test_manager_additional.pytests/unit/pipeline/evaluation/test_processor.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/api/test_client.pytests/unit/core/llm/test_llm_manager.py
🧠 Learnings (10)
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
tests/unit/core/models/test_system_additional.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/core/models/test_system_additional.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to tests/**/*.py : Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
tests/unit/core/api/test_streaming_parser.py
🧬 Code graph analysis (7)
tests/unit/core/output/test_statistics.py (1)
src/lightspeed_evaluation/core/output/statistics.py (3)
bootstrap_intervals(12-34)calculate_basic_stats(37-63)calculate_detailed_stats(66-86)
tests/unit/core/output/test_generator.py (1)
src/lightspeed_evaluation/core/output/generator.py (8)
OutputHandler(23-295)_calculate_stats(70-79)_generate_csv_report(133-169)_generate_json_summary(171-206)_generate_text_summary(208-291)get_output_directory(293-295)generate_reports(40-68)_generate_individual_reports(81-103)
tests/unit/core/models/test_system_additional.py (1)
src/lightspeed_evaluation/core/models/system.py (6)
LLMConfig(36-75)EmbeddingConfig(78-114)APIConfig(117-156)OutputConfig(159-201)VisualizationConfig(226-252)LoggingConfig(204-223)
tests/unit/core/script/test_manager_additional.py (1)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-117)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
_parse_sse_line(63-72)
tests/unit/core/api/test_client.py (3)
src/lightspeed_evaluation/core/models/system.py (1)
APIConfig(117-156)src/lightspeed_evaluation/core/api/client.py (2)
query(71-105)_prepare_request(107-122)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)
tests/unit/core/llm/test_llm_manager.py (2)
src/lightspeed_evaluation/core/models/system.py (2)
LLMConfig(36-75)SystemConfig(271-301)src/lightspeed_evaluation/core/llm/manager.py (4)
LLMManager(10-117)get_llm_params(95-103)get_config(105-107)from_llm_config(115-117)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.12)
- GitHub Check: Pylinter
- GitHub Check: mypy
🔇 Additional comments (21)
tests/unit/core/llm/test_llm_manager.py (4)
1-7: LGTM!The imports are clean and properly organized. All necessary modules are imported for the test suite.
9-19: LGTM!The fixture provides a clean, reusable baseline configuration for tests, reducing code duplication.
25-140: LGTM!Comprehensive provider coverage with consistent test patterns. The Azure tests appropriately cover both deployment scenarios, and all assertions correctly verify provider-specific model name formats.
160-245: LGTM!Excellent coverage of the manager's API methods and behavior. The tests verify parameter propagation, config retrieval, class method constructors, and initialization output. All assertions are appropriate and thorough.
tests/unit/core/models/test_system_additional.py (3)
16-43: LGTM! Boundary validation tests are correct.These tests properly validate the Pydantic field constraints defined in the LLMConfig model, covering edge cases for temperature, max_tokens, timeout, and num_retries.
133-144: Excellent test of tuple-to-list conversion.The test correctly verifies that Pydantic converts the tuple
(12, 8)to a list[12, 8], and the comment on line 144 provides helpful documentation of this behavior.
175-181: LGTM! Boolean toggle test is clear and correct.The test properly verifies that the
show_timestampsconfiguration can be set to bothTrueandFalse, using two separate config instances for clarity.tests/unit/core/api/test_streaming_parser.py (1)
1-253: Excellent test coverage for streaming parser!The test suite comprehensively covers streaming response parsing, SSE line parsing, tool call parsing, and tool sequence formatting. Tests validate both successful parsing and error handling paths, including edge cases like missing fields, invalid JSON, and empty data.
tests/unit/core/api/test_client.py (2)
22-33: LGTM!The
basic_api_configfixture provides appropriate test configuration for streaming endpoint tests.
322-500: I encountered a technical issue accessing the repository to verify the duplication claim. Since I cannot inspect the codebase directly, I cannot confirm whetherTestAPIClientConfigurationis actually duplicated in the file or provide definitive evidence either way.Unable to verify duplication claim for TestAPIClientConfiguration due to repository access limitations.
The review comment asks to verify if the
TestAPIClientConfigurationclass appears twice intests/unit/core/api/test_client.py. While the AI summary suggests duplication exists, I cannot access the repository to confirm this through code inspection. Manual verification is required to:
- Search the full file for all
class TestAPIClientConfigurationdefinitions- Confirm whether multiple definitions exist
- If duplicates are found, remove the shadowed class definition(s)
tests/unit/pipeline/evaluation/test_processor.py (3)
27-72: Well-structured test fixtures.The fixtures provide comprehensive mocking for configuration, processor components, and sample data. The default behavior configuration in
processor_componentsis appropriate for test isolation.
74-481: Comprehensive test coverage for ConversationProcessor.The test class thoroughly validates processor behavior across multiple scenarios including metric processing, script execution, API amendment, and error handling. Tests properly use mocks and fixtures to isolate behavior.
576-867: Excellent coverage for _evaluate_turn method.The tests thoroughly validate metric evaluation behavior including handling of valid metrics, invalid metrics, empty metrics, and mixed scenarios. Tests properly verify both results and side effects like logging and evaluator calls.
tests/unit/core/script/test_manager_additional.py (1)
60-155: Good coverage for script execution edge cases.The tests validate Path object handling, error conditions, and logging behavior. The tests appropriately verify return values and exception raising for various scenarios.
tests/unit/core/output/test_statistics.py (4)
14-54: Excellent test fixture design!The
sample_resultsfixture provides a well-structured set of test data covering multiple scenarios (PASS/FAIL/ERROR, multiple conversations, multiple metrics, score variations including None). This fixture enables comprehensive testing across multiple test methods.
57-103: Comprehensive bootstrap interval testing!The test coverage for
bootstrap_intervalsis excellent:
- Basic functionality validation (line ordering, bounds checking)
- Invalid confidence values (both over 100 and negative)
- Custom confidence levels and bootstrap steps
This thoroughly validates the function's behavior and error handling.
105-273: Excellent coverage of basic statistics scenarios!The
TestCalculateBasicStatsclass covers all important edge cases:
- Normal mixed results
- Empty results
- All pass/fail/error scenarios
- Single result edge case
The assertions correctly verify counts and percentage calculations (e.g., 2/4 = 50% pass rate). This comprehensive coverage ensures the statistics function handles all scenarios correctly.
276-537: Thorough detailed statistics validation!The
TestCalculateDetailedStatsclass provides excellent coverage:
- Structure validation (by_metric, by_conversation)
- Metric and conversation breakdowns
- Rate calculations
- Score statistics including edge cases (no scores)
- Single and multiple metric/conversation scenarios
The test assertions correctly validate the expected breakdowns and statistical calculations. Great work on edge case coverage!
tests/unit/core/output/test_generator.py (3)
11-53: Well-designed fixtures for output handler testing!The fixtures provide appropriate test data:
sample_results: includes query/response fields relevant to output generation testingmock_system_config: configures output formats, CSV columns, and graph settingsThese fixtures enable comprehensive testing of the OutputHandler's report generation capabilities.
56-270: Excellent OutputHandler test coverage!The
TestOutputHandlerclass thoroughly validates:
- Initialization and configuration
- Statistics calculation (with data and empty)
- Individual report format generation (CSV, JSON, TXT)
- End-to-end report generation with mocked timestamps
- Partial format generation
- Column configuration
- Default behavior without config
The use of mocked datetime for predictable timestamps (lines 144-150, 169-175) is particularly good practice for testing filename generation.
272-411: Comprehensive initialization and configuration testing!The
TestOutputHandlerInitializationclass provides thorough validation of:
- Default vs. custom initialization
- Directory creation
- CSV content and column configuration
- Filename timestamp formatting
Note: The mock_print usage is correctly handled throughout - line 277 assigns and uses it (line 286), while lines 293, 309, 341, 375, 398 just patch without assignment.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/unit/core/llm/test_llm_manager.py (1)
156-158: Consider strengthening the assertion for generic provider warning.The OR condition allows the test to pass if either "generic" or "warning" appears in the output. If the production code is expected to print both, consider using AND to ensure the warning is meaningful. However, if only one is required, the current assertion is acceptable.
- assert "generic" in captured.out.lower() or "warning" in captured.out.lower() + # If both should appear, use AND; otherwise keep current behavior + assert "generic" in captured.out.lower()tests/unit/pipeline/evaluation/test_processor.py (3)
77-291: ConversationProcessor end‑to‑end tests look solid; any extra call‑level assertions are optionalThe
TestConversationProcessormethods collectively exercise the key flows: init wiring, no‑metrics short‑circuit, turn vs conversation metrics, setup/cleanup behavior (success + failure paths), API amendment, cascade failure, and metrics summary. The mocking pattern aroundMetricManager,ScriptExecutionManager,APIDataAmender,EvaluationErrorHandler, andMetricsEvaluatoris consistent and readable.If you ever want to tighten these further, you could (optionally) add expectations on
metrics_evaluator.evaluate_metriccall counts / call args in a couple of the process‑conversation tests to pin behavior more precisely, but it’s not required for correctness.
429-481: Consider asserting the warning log intest_run_cleanup_script_logs_warning_on_failureThe test currently only verifies that
_run_cleanup_scriptdoesn’t raise whenrun_scriptreturnsFalse, while the docstring says the failure “is logged as warning”. To fully exercise that contract, you could usecaplogto assert that a WARNING‑level message was emitted, e.g. by addingcaplogto the test signature and wrapping the call incaplog.at_level(logging.WARNING)before checkingcaplog.textfor the expected substring.This would strengthen the test without changing behavior.
553-574: PR‑specific processor fixtures work, but the naming could be clearer
processor_components_prand theprocessorfixture nicely reuse the lower fixtures to focus the_evaluate_turntests, and they avoid the previousprocessor_componentsredefinition issue.Minor nit: the
_prsuffix is a bit opaque in the test context; consider a more descriptive name likeprocessor_components_eval_turn(andprocessor_eval_turnif you want to be explicit) to make the fixture’s purpose clearer to future readers.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
tests/unit/core/api/test_client.py(2 hunks)tests/unit/core/api/test_streaming_parser.py(1 hunks)tests/unit/core/llm/test_deepeval_manager.py(1 hunks)tests/unit/core/llm/test_llm_manager.py(1 hunks)tests/unit/core/models/test_api_additional.py(1 hunks)tests/unit/core/models/test_data_additional.py(1 hunks)tests/unit/core/models/test_system_additional.py(1 hunks)tests/unit/core/output/test_final_coverage.py(1 hunks)tests/unit/core/output/test_generator.py(1 hunks)tests/unit/core/output/test_statistics.py(6 hunks)tests/unit/core/script/test_manager_additional.py(1 hunks)tests/unit/core/system/test_env_validator.py(1 hunks)tests/unit/core/system/test_lazy_import.py(1 hunks)tests/unit/core/system/test_setup.py(1 hunks)tests/unit/pipeline/evaluation/test_pipeline.py(1 hunks)tests/unit/pipeline/evaluation/test_processor.py(9 hunks)tests/unit/runner/test_evaluation.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/unit/core/models/test_data_additional.py
- tests/unit/core/output/test_final_coverage.py
- tests/unit/runner/test_evaluation.py
- tests/unit/core/api/test_streaming_parser.py
- tests/unit/core/system/test_env_validator.py
- tests/unit/core/system/test_setup.py
- tests/unit/core/models/test_api_additional.py
- tests/unit/core/system/test_lazy_import.py
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/pipeline/evaluation/test_processor.pytests/unit/pipeline/evaluation/test_pipeline.pytests/unit/core/output/test_generator.pytests/unit/core/llm/test_deepeval_manager.pytests/unit/core/output/test_statistics.pytests/unit/core/models/test_system_additional.pytests/unit/core/llm/test_llm_manager.pytests/unit/core/script/test_manager_additional.pytests/unit/core/api/test_client.py
🧠 Learnings (9)
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-07-29T05:20:28.916Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-11-24T16:59:21.420Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to tests/**/*.py : Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
tests/unit/core/output/test_statistics.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
tests/unit/core/models/test_system_additional.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/core/models/test_system_additional.py
🧬 Code graph analysis (6)
tests/unit/pipeline/evaluation/test_pipeline.py (3)
src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(271-301)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(70-125)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)
EvaluationPipeline(32-205)validate_data(117-119)
tests/unit/core/output/test_generator.py (1)
src/lightspeed_evaluation/core/output/generator.py (8)
OutputHandler(23-295)_calculate_stats(70-79)_generate_csv_report(133-169)_generate_json_summary(171-206)_generate_text_summary(208-291)get_output_directory(293-295)generate_reports(40-68)_generate_individual_reports(81-103)
tests/unit/core/output/test_statistics.py (2)
src/lightspeed_evaluation/core/output/statistics.py (3)
bootstrap_intervals(12-34)calculate_basic_stats(37-63)calculate_detailed_stats(66-86)tests/unit/core/output/test_generator.py (1)
sample_results(12-37)
tests/unit/core/models/test_system_additional.py (1)
src/lightspeed_evaluation/core/models/system.py (5)
LLMConfig(36-75)EmbeddingConfig(78-114)APIConfig(117-156)OutputConfig(159-201)LoggingConfig(204-223)
tests/unit/core/script/test_manager_additional.py (1)
src/lightspeed_evaluation/core/script/manager.py (1)
ScriptExecutionManager(14-117)
tests/unit/core/api/test_client.py (3)
src/lightspeed_evaluation/core/models/system.py (1)
APIConfig(117-156)src/lightspeed_evaluation/core/api/client.py (3)
APIClient(22-282)query(71-105)_prepare_request(107-122)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.12)
- GitHub Check: mypy
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
🔇 Additional comments (12)
tests/unit/core/models/test_system_additional.py (1)
1-181: LGTM! Well-structured unit tests for system configuration models.The tests properly validate Pydantic model configurations including boundary conditions (temperature min/max, timeout validation), default values, and custom configurations. Good coverage of LLMConfig, EmbeddingConfig, APIConfig, OutputConfig, VisualizationConfig, and LoggingConfig.
tests/unit/core/llm/test_llm_manager.py (1)
1-155: LGTM! Comprehensive LLM Manager tests with good provider coverage.The tests properly validate initialization across multiple providers (OpenAI, Azure, WatsonX, Anthropic, Gemini, Vertex, Ollama, hosted_vllm) with correct use of pytest-mock. Good coverage of
get_model_name,get_llm_params,get_config, and factory methods.Also applies to: 159-245
tests/unit/core/output/test_statistics.py (1)
1-181: LGTM! Comprehensive statistics tests with retained bootstrap coverage.The bootstrap interval tests (valid_confidence, edge_cases, confidence_levels) are properly retained as previously requested. Good coverage of
calculate_basic_statsandcalculate_detailed_statswith various scenarios including empty results, all-pass/fail/error, and mixed results.tests/unit/core/api/test_client.py (1)
1-500: LGTM! Thorough API client tests covering streaming, error handling, and configuration.The tests properly validate APIClient behavior including streaming endpoint handling, error scenarios (HTTP errors, timeouts, missing fields), tool call formatting, cache key generation, and client lifecycle management. Good use of pytest-mock for mocking httpx.Client and related dependencies.
tests/unit/core/llm/test_deepeval_manager.py (1)
1-94: LGTM! Well-structured DeepEval LLM Manager tests.The tests properly validate
DeepEvalLLMManagerinitialization with default and custom parameters, verifyLiteLLMModelinstantiation, and checkget_llmandget_model_infomethods. Good use of pytest-mock for isolating the LiteLLMModel dependency.tests/unit/core/script/test_manager_additional.py (1)
1-155: LGTM! Script execution manager tests with good error handling coverage.The tests properly validate timeout handling, subprocess errors, unexpected errors, and path object acceptance. The logging tests verify basic behavior as intended, with assertions confirmed acceptable per prior review discussion.
tests/unit/core/output/test_generator.py (1)
1-411: LGTM! Comprehensive output generator tests with proper fixture usage.The tests thoroughly validate
OutputHandlerinitialization, report generation (CSV, JSON, TXT), and configuration handling. The previously flagged unusedmock_printvariables have been properly addressed by removing the assignment where the return value isn't needed.tests/unit/pipeline/evaluation/test_pipeline.py (3)
86-116: Verify endpoint_type value in test configuration.The test sets
endpoint_type="test"which may not be a supported value according toSUPPORTED_ENDPOINT_TYPES. Based on theAPIConfigmodel, valid types are typically "streaming" or "query". Since the APIClient is mocked, this won't cause a runtime error, but using a valid value would make the test configuration more realistic.def test_create_api_client_when_enabled(self, mock_config_loader, mocker): """Test API client creation when enabled.""" mock_config_loader.system_config.api.enabled = True mock_config_loader.system_config.api.api_base = "http://test.com" - mock_config_loader.system_config.api.endpoint_type = "test" + mock_config_loader.system_config.api.endpoint_type = "streaming"
343-386: Same endpoint_type issue in close test.Similarly,
endpoint_type="test"at line 347 should use a valid value like "streaming" or "query" for consistency.def test_close_with_api_client(self, mock_config_loader, mocker): """Test close method with API client.""" mock_config_loader.system_config.api.enabled = True mock_config_loader.system_config.api.api_base = "http://test.com" - mock_config_loader.system_config.api.endpoint_type = "test" + mock_config_loader.system_config.api.endpoint_type = "streaming"
1-85: LGTM! Comprehensive evaluation pipeline tests with good component isolation.The tests properly validate
EvaluationPipelineinitialization, data validation, evaluation runs, amended data saving, close behavior, and output directory override. Good use of pytest-mock for isolating the many pipeline components.Also applies to: 117-342, 387-440
tests/unit/pipeline/evaluation/test_processor.py (2)
27-71: Core fixtures are well‑structured and follow pytest/pytest‑mock guidelines
mock_config_loader,processor_components, andsample_conv_datagive a clear, minimal wiring ofSystemConfig,ProcessorComponents, and a single‑turnEvaluationData, and they consistently usemockerinstead ofunittest.mock, which matches the repo’s testing guidelines. No changes needed here.As per coding guidelines, using pytest fixtures + pytest‑mock is the preferred pattern.
605-717: Invalid / mixed metric tests for_evaluate_turnaccurately capture behaviorThe updated docstrings, logging setup via
caplog, and result assertions for invalid, all‑invalid, and mixed metric lists align well with how_evaluate_turnis expected to behave: producingERRORresults for invalid metrics, preserving metric order, calling the evaluator only for valid metrics, and logging clear error messages.These tests give good guardrails for future refactors of
_evaluate_turn; no changes needed.
VladimirKadlec
left a comment
There was a problem hiding this comment.
LGTM, nice work 💪
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.