LEADS-26: Increased Unit test cases coverage by bsatapat-jpg · Pull Request #109 · lightspeed-core/lightspeed-evaluation

bsatapat-jpg · 2025-11-26T07:20:36Z

Description

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude
Generated by: Cursor

Related Tickets & Documents

Related Issue # https://issues.redhat.com/browse/LEADS-26
Closes # LCORE-748: Addded unit test cases coverage for the evaluation framework #95

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Tests
- Added extensive unit tests across API client streaming, streaming response parsing, evaluation pipeline & runner, LLM managers, data/system models, output generation & statistics, script execution manager, environment validation, lazy imports, and setup utilities — improving coverage for success paths, errors, edge cases, formatting, and cleanup.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-26T07:20:47Z

Walkthrough

Adds a large suite of new and expanded unit tests across core components (API client & streaming parser, evaluation pipeline & processor, runner/CLI, LLM managers, models, output/statistics, script/system utilities). All changes are test-only; no production public APIs were modified.

Changes

Cohort / File(s)	Summary
API & Streaming `tests/unit/core/api/test_client.py`, `tests/unit/core/api/test_streaming_parser.py`	New fixtures and extensive tests for API client streaming configuration, request preparation, cache behavior, headers, error handling, streaming query flows, SSE parsing, tool_call extraction, and edge cases.
Evaluation Pipeline & Processor `tests/unit/pipeline/evaluation/test_pipeline.py`, `tests/unit/pipeline/evaluation/test_processor.py`	Comprehensive tests for EvaluationPipeline and ConversationProcessor covering initialization, config-driven flows, data validation, API amendment, metric evaluation, script setup/cleanup, error cascades, save/close behavior, and many mocked component interactions.
Runner / CLI `tests/unit/runner/test_evaluation.py`	Tests for run_evaluation success and failure paths, error handling (FileNotFoundError/ValueError), result reporting, pipeline closure, and CLI main() behavior including exit codes and output-dir override.
LLM Managers `tests/unit/core/llm/test_deepeval_manager.py`, `tests/unit/core/llm/test_llm_manager.py`	Tests for DeepEvalLLMManager and LLMManager: initialization, default parameters (temperature/retries), get_llm/get_model_info, provider-specific model naming, env validation mocking, and informational print output.
Models (API / Data / System) `tests/unit/core/models/test_api_additional.py`, `tests/unit/core/models/test_data_additional.py`, `tests/unit/core/models/test_system_additional.py`	Validation tests for RAGChunk, AttachmentData, APIRequest/APIResponse, TurnData, EvaluationData, EvaluationResult, and various System/LLM/Embedding/Output config validations (Pydantic validation and defaults).
Output & Statistics `tests/unit/core/output/test_final_coverage.py`, `tests/unit/core/output/test_generator.py`, `tests/unit/core/output/test_statistics.py`	New and expanded tests for statistics computation, detailed stats, bootstrap intervals, OutputHandler report generation (CSV/JSON/TXT), file naming/timestamping, CSV column configurations, and edge cases.
Script Execution & System Utilities `tests/unit/core/script/test_manager_additional.py`, `tests/unit/core/system/test_env_validator.py`, `tests/unit/core/system/test_lazy_import.py`, `tests/unit/core/system/test_setup.py`	Tests for ScriptExecutionManager error/log handling, provider environment validators, lazy import helper behavior, and logging/environment setup edge cases.
Misc / Manifests `manifest_file`, `requirements.txt`	Test-context references; no production code changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30–40 minutes

Files/areas to pay extra attention:
- API streaming tests: request preparation, cache-key determinism, and mocked SSE parsing semantics.
- Evaluation pipeline & processor tests: extensive mocking and many control-flow assertions (amend/save/close/error paths).
- Output/CSV generation tests: filesystem interactions, timestamped filenames, and CSV column configuration checks.
- LLM manager tests: provider-specific model naming and env-validation mocking.

Possibly related PRs

API integration & refactoring #47 — Adds/changes APIClient, streaming_parser, models, and EvaluationPipeline functionality targeted by many of these tests.
Env setup/cleanup ability and verify through script #62 — Related to script execution and script-driven metric flows exercised by the script and pipeline tests.
LCORE-723: Compute correct confidence interval #71 — Changes to statistics/bootstrap interval logic validated by expanded statistics tests.

Suggested reviewers

asamal4
VladimirKadlec
tisnik

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: increased unit test coverage. It accurately reflects the changeset which adds extensive new unit tests across multiple modules.
Docstring Coverage	✅ Passed	Docstring coverage is 98.16% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

tests/unit/core/api/test_client.py (1)

67-79: Consider a more specific assertion for API key header.

The current assertion call_count >= 1 only verifies that headers.update was called at least once but doesn't confirm the Authorization header was set. For stronger verification, consider checking the actual call arguments.
         APIClient(basic_api_config)

         # Verify headers were updated (should include Authorization header)
-        assert mock_client.headers.update.call_count >= 1
+        # Verify Authorization header was set
+        calls = mock_client.headers.update.call_args_list
+        assert any("Authorization" in str(call) for call in calls)

tests/unit/pipeline/evaluation/test_pipeline.py (4)

20-24: Consider using pytest's tmp_path fixture for cross-platform compatibility.

Hardcoded /tmp/test_output is Unix-specific. Using tmp_path would ensure tests work on Windows as well.

 @pytest.fixture
-def mock_config_loader(mocker):
+def mock_config_loader(mocker, tmp_path):
     """Create a mock config loader with system config."""
     loader = mocker.Mock(spec=ConfigLoader)

     config = SystemConfig()
     config.api.enabled = False
-    config.output.output_dir = "/tmp/test_output"
+    config.output.output_dir = str(tmp_path / "test_output")
     config.output.base_filename = "test"
     config.core.max_threads = 2

50-76: Extract repeated mocking setup into a reusable fixture to eliminate duplication.

The same set of mocker.patch() calls is repeated across all 12 test methods, creating significant maintenance burden. Consider extracting this into an autouse fixture or a helper fixture that each test can use.

@pytest.fixture
def mock_pipeline_dependencies(mocker):
    """Mock all EvaluationPipeline dependencies."""
    mocks = {
        "validator": mocker.patch(
            "lightspeed_evaluation.pipeline.evaluation.pipeline.DataValidator"
        ),
        "metric_manager": mocker.patch(
            "lightspeed_evaluation.pipeline.evaluation.pipeline.MetricManager"
        ),
        "api_client": mocker.patch(
            "lightspeed_evaluation.pipeline.evaluation.pipeline.APIClient"
        ),
        "api_amender": mocker.patch(
            "lightspeed_evaluation.pipeline.evaluation.pipeline.APIDataAmender"
        ),
        "error_handler": mocker.patch(
            "lightspeed_evaluation.pipeline.evaluation.pipeline.EvaluationErrorHandler"
        ),
        "script_manager": mocker.patch(
            "lightspeed_evaluation.pipeline.evaluation.pipeline.ScriptExecutionManager"
        ),
        "metrics_evaluator": mocker.patch(
            "lightspeed_evaluation.pipeline.evaluation.pipeline.MetricsEvaluator"
        ),
        "processor": mocker.patch(
            "lightspeed_evaluation.pipeline.evaluation.pipeline.ConversationProcessor"
        ),
    }
    return mocks

Then simplify each test:

def test_initialization_success(self, mock_config_loader, mock_pipeline_dependencies):
    """Test successful pipeline initialization."""
    pipeline = EvaluationPipeline(mock_config_loader)

    assert pipeline.config_loader == mock_config_loader
    assert pipeline.system_config is not None
    assert pipeline.output_dir == "/tmp/test_output"

113-116: Consider verifying APIClient constructor arguments.

The test confirms the client is created but doesn't verify it receives the expected configuration. This would catch regressions if the pipeline changes how it initializes the client.

         pipeline = EvaluationPipeline(mock_config_loader)

         assert pipeline.api_client is not None
-        mock_api_client.assert_called_once()
+        mock_api_client.assert_called_once_with(
+            api_base="http://test.com",
+            endpoint_type="test",
+            # ... other expected args
+        )

331-341: Consider verifying the exception was actually handled (e.g., via logging).

The test verifies no exception propagates, but doesn't confirm the exception path was actually taken and logged. This could be a silent pass if save_evaluation_data is never called.

         mock_save.side_effect = Exception("Save error")

         pipeline = EvaluationPipeline(mock_config_loader)
         # Should not raise, just log warning
         results = pipeline.run_evaluation(sample_evaluation_data, "/tmp/original.yaml")

         assert results is not None
+        mock_save.assert_called_once()  # Verify the save was attempted

Optionally, you could also capture logs to verify warning was logged:

def test_save_amended_data_handles_exception(
    self, mock_config_loader, sample_evaluation_data, mocker, caplog
):
    # ... setup ...
    with caplog.at_level(logging.WARNING):
        results = pipeline.run_evaluation(sample_evaluation_data, "/tmp/original.yaml")
    
    assert "Save error" in caplog.text or any("save" in r.message.lower() for r in caplog.records)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 399eb69 and 016132a.

📒 Files selected for processing (6)

tests/unit/core/api/test_client.py (1 hunks)
tests/unit/core/api/test_streaming_parser.py (1 hunks)
tests/unit/pipeline/evaluation/test_pipeline.py (1 hunks)
tests/unit/pipeline/evaluation/test_processor.py (1 hunks)
tests/unit/runner/__init__.py (1 hunks)
tests/unit/runner/test_evaluation.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/runner/__init__.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/core/api/test_client.py
tests/unit/pipeline/evaluation/test_processor.py

🧠 Learnings (4)

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

tests/unit/core/api/test_streaming_parser.py

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

tests/unit/runner/test_evaluation.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

🧬 Code graph analysis (4)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

_parse_sse_line (63-72)

tests/unit/pipeline/evaluation/test_pipeline.py (5)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (267-297)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (70-125)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)

EvaluationPipeline (32-202)

src/lightspeed_evaluation/core/api/client.py (1)

query (71-105)

src/lightspeed_evaluation/core/system/validator.py (1)

validate_evaluation_data (143-161)

tests/unit/runner/test_evaluation.py (5)

tests/unit/pipeline/evaluation/test_pipeline.py (2)

test_run_evaluation_success (176-222)

mock_config_loader (16-27)

tests/unit/core/metrics/test_manager.py (1)

system_config (14-50)

src/lightspeed_evaluation/core/system/loader.py (1)

load_system_config (79-108)

src/lightspeed_evaluation/core/system/validator.py (1)

load_evaluation_data (89-141)

tests/unit/pipeline/evaluation/test_processor.py (1)

mock_config_loader (22-28)

tests/unit/core/api/test_client.py (3)

src/lightspeed_evaluation/core/api/client.py (1)

query (71-105)

src/lightspeed_evaluation/core/models/system.py (1)

APIConfig (117-156)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

🪛 GitHub Actions: Ruff

tests/unit/runner/test_evaluation.py

[error] 3-3: F401 'pytest' imported but unused. Use or remove the import. This can be fixed by removing the import or using it in the test.

tests/unit/pipeline/evaluation/test_processor.py

[error] 210-210: F841 Local variable 'results' is assigned to but never used. This can be fixed by removing the assignment or using the variable.

🔇 Additional comments (16)

tests/unit/runner/__init__.py (1)

1-1: LGTM!

Minor docstring update to reflect plural modules - no concerns.

tests/unit/core/api/test_streaming_parser.py (5)

13-17: LGTM!

Simple fixture correctly uses the mocker fixture per coding guidelines.

20-129: Comprehensive test coverage for streaming response parsing.

Good edge case coverage including error events, missing fields, and malformed input. The tests properly verify both success and failure paths.

131-172: LGTM!

Tests correctly verify default values for missing event (empty string) and missing data (empty dict), matching the implementation in streaming_parser.py.

175-220: LGTM!

Good coverage of tool call parsing edge cases, including validation that both tool_name and arguments are required, but empty arguments dict is acceptable.

223-253: LGTM!

Tests verify the expected output structure for tool sequence formatting.

tests/unit/runner/test_evaluation.py (2)

239-302: LGTM!

Good coverage of CLI entry point behavior, including default arguments, custom arguments, and error exit codes.

33-34: I was unable to access the repository directly due to a clone operation failure. However, I can provide an assessment based on the available information:

Verify the mock path by checking the import statement in src/lightspeed_evaluation/runner/evaluation.py.

From the retrieved learning, DataValidator is confirmed to be located in lightspeed_evaluation.core.system. The mock path "lightspeed_evaluation.core.system.DataValidator" appears structurally correct for the module location.

However, the effectiveness of this mock path depends on the specific import pattern used in run_evaluation:

If the code uses: from lightspeed_evaluation.core.system import DataValidator — the mock path is correct

If the code uses: import lightspeed_evaluation.core.system — the mock path may need adjustment

You should verify the actual import statement in src/lightspeed_evaluation/runner/evaluation.py to confirm the mock will intercept correctly. Check whether DataValidator is imported at module level or if it's accessed via the full module path.

tests/unit/pipeline/evaluation/test_processor.py (2)

21-65: LGTM!

Well-structured fixtures with proper use of spec parameter for type safety on mocks.

68-426: Comprehensive test coverage for ConversationProcessor.

Good coverage of:

Initialization and wiring

API-enabled and API-disabled paths

Script execution lifecycle (setup/cleanup)

Error handling and cascade failures

Both turn-level and conversation-level metrics

The tests align with the learned behavior that metric resolution happens upstream in process_conversation().

tests/unit/core/api/test_client.py (2)

10-21: LGTM!

Good fixture setup with clear naming. The streaming endpoint configuration aligns with the PR's focus on streaming behavior.

24-202: LGTM!

Good test coverage for APIClient including:

Both streaming and query endpoint types

Cache initialization and key generation

Header handling (Content-Type, Authorization)

Client lifecycle (initialization, close)

Request preparation with various parameters

Tests correctly use the mocker fixture per coding guidelines.

tests/unit/pipeline/evaluation/test_pipeline.py (4)

78-84: LGTM!

Good test for the early validation guard. No mocking needed since the ValueError is raised before any components are instantiated.

176-222: LGTM!

The test correctly verifies the happy path for evaluation runs. The assertions cover the essential behavior (results returned with expected status).

343-385: LGTM!

Good test coverage for the cleanup path. Properly mocks litellm.cache and asyncio.run dependencies, and verifies api_client.close() is called.

418-440: LGTM!

Clean test for verifying the output_dir override parameter works as expected.

tests/unit/pipeline/evaluation/test_processor.py

tests/unit/runner/test_evaluation.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tests/unit/core/api/test_client.py (1)

67-80: Consider more specific header verification.

The test verifies that headers.update was called but doesn't specifically validate the Authorization header content. While the current assertion confirms setup succeeds with an API key present, you could strengthen this test by inspecting the actual call arguments.

Optional enhancement:
     APIClient(basic_api_config)
 
-    # Verify headers were updated (should include Authorization header)
-    assert mock_client.headers.update.call_count >= 1
+    # Verify Authorization header was set
+    calls = mock_client.headers.update.call_args_list
+    assert any(
+        "Authorization" in str(call) and "Bearer test_secret_key" in str(call)
+        for call in calls
+    )

tests/unit/pipeline/evaluation/test_pipeline.py (1)

50-440: Consider fixture for repeated component mocking.

Many tests repeat the same 7-8 component patches (DataValidator, MetricManager, APIDataAmender, etc.). While the current approach is explicit and clear, you could reduce repetition with a fixture.

Optional enhancement:

@pytest.fixture
def mock_pipeline_components(mocker):
    """Mock all pipeline components."""
    return {
        'DataValidator': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.DataValidator"),
        'MetricManager': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.MetricManager"),
        'APIDataAmender': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.APIDataAmender"),
        'EvaluationErrorHandler': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.EvaluationErrorHandler"),
        'ScriptExecutionManager': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.ScriptExecutionManager"),
        'MetricsEvaluator': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.MetricsEvaluator"),
        'ConversationProcessor': mocker.patch("lightspeed_evaluation.pipeline.evaluation.pipeline.ConversationProcessor"),
    }

def test_initialization_success(self, mock_config_loader, mock_pipeline_components):
    """Test successful pipeline initialization."""
    pipeline = EvaluationPipeline(mock_config_loader)
    # ... assertions ...

However, the current explicit approach has its merits for test clarity and is perfectly acceptable.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 016132a and 5fa6286.

📒 Files selected for processing (6)

tests/unit/core/api/test_client.py (1 hunks)
tests/unit/core/api/test_streaming_parser.py (1 hunks)
tests/unit/pipeline/evaluation/test_pipeline.py (1 hunks)
tests/unit/pipeline/evaluation/test_processor.py (1 hunks)
tests/unit/runner/__init__.py (1 hunks)
tests/unit/runner/test_evaluation.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/unit/runner/test_evaluation.py
tests/unit/pipeline/evaluation/test_processor.py

🧰 Additional context used

📓 Path-based instructions (1)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

tests/unit/core/api/test_client.py
tests/unit/runner/__init__.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/pipeline/evaluation/test_pipeline.py

🧠 Learnings (1)

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

tests/unit/core/api/test_streaming_parser.py

🧬 Code graph analysis (3)

tests/unit/core/api/test_client.py (3)

src/lightspeed_evaluation/core/api/client.py (3)

APIClient (22-282)

query (71-105)

_prepare_request (107-122)

src/lightspeed_evaluation/core/models/system.py (1)

APIConfig (117-156)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

_parse_sse_line (63-72)

tests/unit/pipeline/evaluation/test_pipeline.py (2)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (267-297)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)

EvaluationPipeline (32-202)

validate_data (114-116)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: mypy
GitHub Check: tests (3.12)

🔇 Additional comments (17)

tests/unit/runner/__init__.py (1)

1-1: LGTM!

The plural form accurately reflects that this package contains multiple test modules.

tests/unit/core/api/test_streaming_parser.py (5)

13-17: LGTM!

Clean fixture for mocking streaming response objects.

20-129: Excellent comprehensive coverage of streaming response parsing!

The test suite thoroughly validates:

Complete streaming flows with conversation IDs and responses

Tool call handling (single and multiple)

Error conditions (missing final response, missing conversation ID, error events)

Edge cases (empty lines, non-data lines)

This provides strong confidence in the streaming parser's behavior.

131-173: Solid validation of SSE line parsing behavior.

These tests directly exercise the private _parse_sse_line function. While testing private functions is sometimes avoided in favor of testing through public APIs, this level of granularity is appropriate here given the critical nature of parsing logic and the specific contract (returning Optional[tuple[str, dict]]).

The tests properly validate default values for missing fields, which is important for resilient parsing.

175-221: Thorough validation of tool call parsing logic.

The tests properly validate:

Required fields (both tool_name and arguments must be present)

Valid edge case (empty arguments dict is acceptable)

Defensive behavior (returns None for invalid structures)

This ensures robust handling of tool call data from the streaming API.

223-253: LGTM!

Clean validation of tool sequence formatting. The tests confirm the expected structure where each tool call is wrapped in its own sequence (list), resulting in the list[list[dict]] format.

tests/unit/core/api/test_client.py (3)

10-21: LGTM!

The renamed fixture basic_api_config clearly indicates its purpose as a baseline streaming configuration. The defaults (cache disabled, streaming endpoint) are appropriate for isolated unit tests.

27-66: Solid initialization test coverage.

The tests properly validate:

Streaming endpoint configuration

Cache creation when enabled

Cache absence when disabled

Valid endpoint type acceptance

The mocking strategy correctly isolates the client from external dependencies.

81-202: Excellent comprehensive coverage of APIClient behavior.

The test suite thoroughly validates:

Error handling (uninitialized client)

Request preparation variants (basic, with conversation_id, with attachments)

Resource cleanup (client closing)

Cache key consistency (critical for cache correctness)

Header initialization

Both streaming and standard endpoint types

This provides strong confidence in the client's reliability across different configurations.

tests/unit/pipeline/evaluation/test_pipeline.py (8)

15-44: LGTM!

Clean, well-structured fixtures that provide baseline test data:

mock_config_loader sets up a realistic SystemConfig with API disabled by default

sample_evaluation_data provides minimal valid evaluation data for testing

50-84: Solid initialization validation.

The tests properly validate:

Successful pipeline initialization with all components

Defensive behavior when SystemConfig is missing

The extensive mocking is appropriate for unit testing the pipeline orchestrator, which coordinates many components.

86-142: LGTM!

The tests properly validate conditional API client creation:

Client is initialized when API is enabled

Client is None when API is disabled

This ensures the pipeline respects the configuration.

144-174: LGTM!

The test properly validates that the pipeline delegates data validation to the DataValidator component and correctly propagates the result.

176-253: Comprehensive evaluation flow testing.

The tests validate:

Successful evaluation flow with proper result propagation

Early failure when data validation fails (with appropriate error message)

This ensures the pipeline's control flow is correct for both success and failure scenarios.

255-341: Solid validation of data persistence behavior.

The tests properly validate:

Amended data is saved when API is enabled

Save failures are handled gracefully (logged but don't fail the evaluation)

This ensures data persistence is reliable without making the pipeline brittle.

343-416: LGTM!

The tests properly validate resource cleanup:

API client is closed when present

No errors occur when API client is absent

This ensures proper resource management in both configurations.

418-440: LGTM!

The test validates that the output directory can be overridden at pipeline construction, confirming the configuration is flexible as intended.

bsatapat-jpg · 2025-11-26T08:22:30Z

Increased test cases coverage 56% -> 70%
Total tests: 387 -> 486

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (10)

tests/unit/core/models/test_api_additional.py (4)

14-35: RAGChunk tests are solid; consider tightening the extra-field assertion

These tests nicely cover creation, optional score, and the forbidden extra-field behavior. If you want stricter guarantees, you could optionally assert the validation error content (e.g., using match="extra fields not permitted" or similar) to ensure the failure reason doesn’t regress, but the current checks are already functionally useful.

37-58: Good coverage of AttachmentData defaults and overrides; optional extra-field test

The tests correctly verify both default values and a custom (attachment_type, content_type) combination. For parity with RAGChunk, you might optionally add a small test that passes an unexpected field to confirm that extra fields are also rejected here, but not strictly necessary for this PR’s goals.

60-107: APIRequest tests cover main paths; consider also exercising create with invalid query

These tests sensibly cover:

Simple create usage,

Full parameter population,

Attachment string inputs being wrapped into attachment objects,

Direct model validation on empty query.

To fully exercise the public helper, you could optionally add a test that calls APIRequest.create(query="") and asserts validation failure as well, ensuring the factory path can’t bypass the min_length constraint. Current tests are still valid and helpful as-is.

109-166: APIResponse tests are comprehensive; a couple of minor edge cases could be added later

The suite nicely checks:

Basic construction with contexts,

Default empty contexts,

Tool call wiring,

from_raw_response happy path (including rag chunk extraction),

Required conversation_id enforcement.

If you decide to extend coverage further, you could add:

A from_raw_response case with no rag_chunks to assert contexts stays empty.

(Optionally) a case where response contains leading/trailing whitespace to lock in the .strip() behavior.

Not required for this PR, but would future-proof behavior around those edges.
tests/unit/core/output/test_generator.py (3)
11-37: Consider consolidating duplicate fixtures across test files.

The sample_results fixture is defined identically (or nearly so) in both test_generator.py and test_statistics.py. Consider moving shared fixtures to a conftest.py file at tests/unit/core/output/conftest.py to promote reusability and maintain consistency across test modules.

102-118: Consider strengthening the assertions.

Lines 117-118 use OR conditions that make the test lenient. The test could pass even if the JSON structure doesn't match expectations. Consider asserting the exact expected structure based on the actual implementation of _generate_json_summary.

For example:
-    assert "summary_stats" in data or "results" in data
-    assert data.get("total_evaluations") == 2 or len(data.get("results", [])) == 2
+    # Assert the actual expected structure
+    assert "summary_stats" in data
+    assert "results" in data
+    assert data["total_evaluations"] == 2
120-132: Consider strengthening the assertion.

Line 132 uses an OR condition that makes the test lenient. Consider asserting the specific expected content based on the actual implementation of _generate_text_summary.
tests/unit/core/system/test_setup.py (1)

66-96: Clarify “KeyError” vs “TypeError” scenario and consider slightly stronger assertions

config_data = {"environment": None} and {"environment": "invalid_type"} both currently trigger a TypeError during the **config_data.get("environment", {}) merge, so the “handles_key_error” test name/docstring are a bit misleading and the KeyError path is not actually exercised. You could either (a) rename the first test to reflect a generic invalid environment config, or (b) introduce a small custom mapping whose .get() raises KeyError if you really want coverage for that branch. Also, for the fallback-case tests, asserting all expected default keys (not just one) would make the behavior under failure more thoroughly specified.
tests/unit/core/api/test_client.py (2)
67-80: Optionally assert the actual API key header rather than just “some” header update

test_setup_client_with_api_key currently only checks that mock_client.headers.update was called at least once, which would still pass if unrelated headers were updated. If feasible, you could strengthen this by asserting that one of the calls contains the expected API-key-bearing header, for example by inspecting mock_client.headers.update.call_args_list for the configured key/value.

168-185: Tighten Content-Type assertion to require both header name and value

The test_client_initialization_sets_content_type_header assertion uses or, so it passes if either "Content-Type" or "application/json" appears in a call, which could give a false positive (e.g., an unrelated Accept: application/json header). You can make this more precise by requiring both:
-        calls = mock_client.headers.update.call_args_list
-        assert any(
-            "Content-Type" in str(call) or "application/json" in str(call)
-            for call in calls
-        )
+        calls = mock_client.headers.update.call_args_list
+        assert any(
+            "Content-Type" in str(call) and "application/json" in str(call)
+            for call in calls
+        )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fa6286 and e940489.

📒 Files selected for processing (18)

tests/unit/core/api/test_client.py (1 hunks)
tests/unit/core/api/test_streaming_parser.py (1 hunks)
tests/unit/core/llm/test_deepeval_manager.py (1 hunks)
tests/unit/core/llm/test_llm_manager.py (1 hunks)
tests/unit/core/models/test_api_additional.py (1 hunks)
tests/unit/core/models/test_data_additional.py (1 hunks)
tests/unit/core/models/test_system_additional.py (1 hunks)
tests/unit/core/output/test_final_coverage.py (1 hunks)
tests/unit/core/output/test_generator.py (1 hunks)
tests/unit/core/output/test_statistics.py (1 hunks)
tests/unit/core/script/test_manager_additional.py (1 hunks)
tests/unit/core/system/test_env_validator.py (1 hunks)
tests/unit/core/system/test_lazy_import.py (1 hunks)
tests/unit/core/system/test_setup.py (1 hunks)
tests/unit/pipeline/evaluation/test_pipeline.py (1 hunks)
tests/unit/pipeline/evaluation/test_processor.py (1 hunks)
tests/unit/runner/__init__.py (1 hunks)
tests/unit/runner/test_evaluation.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (12)

tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/runner/test_evaluation.py
tests/unit/core/models/test_data_additional.py
tests/unit/core/llm/test_deepeval_manager.py
tests/unit/core/output/test_final_coverage.py
tests/unit/core/system/test_lazy_import.py
tests/unit/core/llm/test_llm_manager.py
tests/unit/core/system/test_env_validator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/runner/init.py
tests/unit/pipeline/evaluation/test_processor.py
tests/unit/core/models/test_system_additional.py

🧰 Additional context used

📓 Path-based instructions (1)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

tests/unit/core/system/test_setup.py
tests/unit/core/api/test_client.py
tests/unit/core/output/test_generator.py
tests/unit/core/output/test_statistics.py
tests/unit/core/script/test_manager_additional.py
tests/unit/core/models/test_api_additional.py

🧬 Code graph analysis (4)

tests/unit/core/system/test_setup.py (2)

src/lightspeed_evaluation/core/models/system.py (1)

LoggingConfig (204-223)

src/lightspeed_evaluation/core/system/setup.py (2)

setup_environment_variables (10-28)

setup_logging (31-93)

tests/unit/core/output/test_statistics.py (2)

src/lightspeed_evaluation/core/output/statistics.py (3)

bootstrap_intervals (12-34)

calculate_basic_stats (37-63)

calculate_detailed_stats (66-86)

tests/unit/core/output/test_generator.py (1)

sample_results (12-37)

tests/unit/core/script/test_manager_additional.py (1)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-117)

tests/unit/core/models/test_api_additional.py (1)

src/lightspeed_evaluation/core/models/api.py (6)

RAGChunk (8-17)

AttachmentData (20-27)

APIRequest (30-77)

APIResponse (80-116)

create (50-77)

from_raw_response (95-116)

🔇 Additional comments (35)

tests/unit/core/script/test_manager_additional.py (1)

14-85: LGTM! Error handling and Path object tests are well-structured.

The timeout, subprocess error, unexpected error, Path object, and file validation tests correctly use pytest-mock for mocking and appropriately verify exception raising and basic functionality.

tests/unit/core/output/test_generator.py (13)

40-53: LGTM!

The fixture correctly uses the mocker fixture from pytest-mock as per coding guidelines.

59-65: LGTM!

The test correctly validates initialization and uses pytest's tmp_path fixture appropriately.

67-75: LGTM!

The test correctly validates statistics calculation with the expected counts from the fixture.

77-83: LGTM!

The test correctly validates the edge case of empty results.

85-100: LGTM!

The test correctly validates CSV generation including file creation and content verification.

134-138: LGTM!

The test correctly validates the get_output_directory method.

140-163: LGTM!

The test correctly mocks datetime using pytest-mock and validates file creation for all output formats.

165-183: LGTM!

The test correctly validates the edge case of generating reports with empty results.

185-199: LGTM!

The test correctly validates selective CSV-only report generation.

201-214: LGTM!

The test correctly validates selective JSON-only report generation.

216-229: LGTM!

The test correctly validates selective TXT-only report generation.

231-253: LGTM!

The test correctly validates CSV generation with all available columns configured.

255-269: LGTM!

The test correctly validates default report generation behavior when system config is not provided.

tests/unit/core/output/test_statistics.py (17)

14-54: LGTM!

The fixture is well-designed with diverse scenarios covering multiple conversations, metrics, and result types (PASS, FAIL, ERROR).

60-70: LGTM!

The test correctly validates basic bootstrap interval calculation. Using bootstrap_steps=100 (vs production default of 10000) is acceptable for unit test performance.

72-77: LGTM!

The test correctly validates error handling for invalid confidence values.

79-84: LGTM!

The test correctly validates error handling for negative confidence values.

86-94: LGTM!

The test correctly validates custom confidence level usage.

96-102: LGTM!

The test correctly validates custom bootstrap steps usage.

108-118: LGTM!

The test correctly validates basic statistics calculation with the expected counts and rates from the fixture.

120-130: LGTM!

The test correctly validates the edge case of empty results with all counts and rates at zero.

132-151: LGTM!

The test correctly validates the all-pass scenario using programmatic test data generation.

153-172: LGTM!

The test correctly validates the all-fail scenario using programmatic test data generation.

174-192: LGTM!

The test correctly validates the all-error scenario using programmatic test data generation.

198-207: LGTM!

The test correctly validates the structure of detailed statistics including both metric and conversation breakdowns.

209-214: LGTM!

The test correctly validates the edge case of empty results.

216-226: LGTM!

The test correctly validates metric-level breakdown with expected counts.

228-238: LGTM!

The test correctly validates conversation-level breakdown with expected counts.

240-248: LGTM!

The test correctly validates that percentage rates are calculated and included in detailed statistics.

250-266: LGTM!

The test correctly validates the single metric scenario.

tests/unit/core/system/test_setup.py (2)

15-65: Environment defaults/custom/override tests match implementation well

The three tests for default envs, custom additions, and overriding LITELLM_LOG closely mirror setup_environment_variables behavior and use mocker.patch.dict correctly to isolate os.environ. No issues here.

101-236: Logging configuration tests are comprehensive and aligned with setup_logging

The logging tests cover basic/DEBUG/ERROR/WARNING source levels, package-level overrides, invalid override handling via printed warning, and default noisy-package configuration, all consistent with the provided LoggingConfig and setup_logging implementation. Using logging.getLogger(...).level assertions is appropriate given logging.basicConfig(..., force=True) resets state each time.

tests/unit/core/api/test_client.py (2)

10-58: Streaming initialization and cache tests are structured well

The basic_api_config fixture plus the streaming and cache-enabled initialization tests give good coverage of APIClient construction, including endpoint type, timeout, and cache wiring via the Cache patch. Use of mocker.patch for httpx.Client and Cache avoids real I/O and keeps tests fast.

81-167: Request preparation, cache key, lifecycle, and standard-endpoint tests look solid

The tests around _prepare_request (including conversation IDs and attachments), cache-key generation consistency, enforcing that query fails when the client is uninitialized, verifying close() delegates to the underlying HTTP client, and confirming non-streaming (endpoint_type="query") initialization all exercise meaningful behavior without over-specifying implementation details. Accessing private helpers here is reasonable for thorough unit coverage.

Also applies to: 187-202

tests/unit/core/script/test_manager_additional.py

VladimirKadlec

Thanks for working on this!

Just be careful to not remove the current tests. There are many tests removed for some reason, the new tests are somewhat trivial in comparison with the current ones.

I checked just my test_processor.py but there are many other files affected.

tests/unit/pipeline/evaluation/test_processor.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/unit/pipeline/evaluation/test_processor.py (1)
436-525: Consider consolidating duplicate fixtures.

The _pr suffixed fixtures (lines 436-525) largely duplicate the earlier fixtures (lines 27-55). While the config_loader_pr has more detailed configuration, the other fixtures are identical. Consider:

Using parametrization if different configurations are needed

Reusing the original fixtures where possible

Clearly documenting why separate fixtures are necessary if they must remain

Apply this pattern to consolidate:
-@pytest.fixture
-def mock_metrics_evaluator_pr(mocker):
-    """Create a mock metrics evaluator."""
-    evaluator = mocker.Mock(spec=MetricsEvaluator)
-    ...
-    return evaluator
-
-# Similar for other _pr fixtures
+# Reuse the existing processor_components fixture and only create config_loader_pr if truly needed
tests/unit/core/output/test_generator.py (1)
394-410: Simplify test_filename_timestamp_format by dropping the unused datetime patch

This test calls _generate_csv_report with a preformatted "test_20240101_120000" base filename, so the patched datetime object is never used. You can simplify the test by removing the datetime patch block and its associated comment without losing coverage, since other tests already verify timestamp wiring in generate_reports.

For example:
     def test_filename_timestamp_format(self, tmp_path, mocker):
         """Test that generated filenames include proper timestamps."""
         results = []
 
-        mocker.patch("builtins.print")
-
-        handler = OutputHandler(output_dir=str(tmp_path), base_filename="test")
-
-        # Mock datetime to get predictable timestamps
-        mock_datetime = mocker.patch(
-            "lightspeed_evaluation.core.output.generator.datetime"
-        )
-        mock_datetime.now.return_value.strftime.return_value = "20240101_120000"
-
-        csv_file = handler._generate_csv_report(results, "test_20240101_120000")
+        mocker.patch("builtins.print")
+        handler = OutputHandler(output_dir=str(tmp_path), base_filename="test")
+        csv_file = handler._generate_csv_report(results, "test_20240101_120000")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e940489 and b94c83a.

📒 Files selected for processing (4)

tests/unit/core/api/test_client.py (2 hunks)
tests/unit/core/output/test_generator.py (1 hunks)
tests/unit/core/output/test_statistics.py (5 hunks)
tests/unit/pipeline/evaluation/test_processor.py (13 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

tests/unit/core/output/test_generator.py
tests/unit/core/output/test_statistics.py
tests/unit/pipeline/evaluation/test_processor.py
tests/unit/core/api/test_client.py

🧠 Learnings (7)

📚 Learning: 2025-07-16T13:20:45.006Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-07-29T05:20:28.916Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to tests/**/*.py : Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

🧬 Code graph analysis (3)

tests/unit/core/output/test_generator.py (1)

src/lightspeed_evaluation/core/output/generator.py (8)

OutputHandler (23-295)

_calculate_stats (70-79)

_generate_csv_report (133-169)

_generate_json_summary (171-206)

_generate_text_summary (208-291)

get_output_directory (293-295)

generate_reports (40-68)

_generate_individual_reports (81-103)

tests/unit/core/output/test_statistics.py (2)

src/lightspeed_evaluation/core/output/statistics.py (3)

bootstrap_intervals (12-34)

calculate_basic_stats (37-63)

calculate_detailed_stats (66-86)

tests/unit/core/output/test_generator.py (1)

sample_results (12-37)

tests/unit/core/api/test_client.py (3)

src/lightspeed_evaluation/core/models/system.py (1)

APIConfig (117-156)

src/lightspeed_evaluation/core/api/client.py (3)

APIClient (22-282)

query (71-105)

_prepare_request (107-122)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

🪛 GitHub Actions: Ruff

tests/unit/core/output/test_generator.py

[error] 293-293: ruff check failed: F841 Local variable 'mock_print' is assigned to but never used. Remove assignment to unused variable 'mock_print'.

[error] 309-309: ruff check failed: F841 Local variable 'mock_print' is assigned to but never used. Remove assignment to unused variable 'mock_print'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.12)
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)
GitHub Check: mypy

🔇 Additional comments (4)

tests/unit/pipeline/evaluation/test_processor.py (1)

1-1: Address reviewer feedback about removed tests.

Based on VladimirKadlec's comment, the original tests should not have been removed. The current structure with two test classes (TestConversationProcessor and TestConversationProcessorEvaluateTurn) may be an attempt to preserve both old and new tests, but please confirm with the reviewer that this addresses their concern about the removed tests.

tests/unit/core/api/test_client.py (1)

322-500: LGTM! Well-structured streaming configuration tests.

The new TestAPIClientConfiguration class provides comprehensive coverage for streaming endpoint initialization, caching, API key handling, request preparation, and client lifecycle management. The tests correctly use pytest-mock and follow the project's testing conventions.

tests/unit/core/output/test_generator.py (1)

11-270: OutputHandler test coverage and pytest-mock usage look solid

These tests exercise initialization, stats calculation, CSV/JSON/TXT generation, enabled_outputs behavior, csv_columns configuration, and timestamped filenames, both with and without a system_config. Using the mocker fixture everywhere (no unittest.mock) and relying on tmp_path for file IO all align with the repo’s testing guidelines. I don’t see functional issues here.

tests/unit/core/output/test_statistics.py (1)

14-537: Statistics tests comprehensively cover bootstrap and PASS/FAIL/ERROR breakdowns

The new suites around bootstrap_intervals, calculate_basic_stats, and calculate_detailed_stats look consistent with the implementation: you validate happy paths, invalid confidence handling, empty/all-pass/all-fail/all-error cases, mixed distributions, metric and conversation breakdowns, and score_statistics (including no-score scenarios). The sample_results fixture and explicit "PASS", "FAIL", and "ERROR" usages align with the established semantics for evaluation outcomes. Tests use pytest fixtures only (no unittest.mock), matching the repository’s testing guidelines.

Based on learnings, the distinction between FAIL and ERROR in the assertions matches the framework’s intended behavior.

tests/unit/core/output/test_generator.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/unit/core/output/test_generator.py (1)

288-314: Fix unused mock_print variables to satisfy Ruff F841.

Lines 293 and 309 assign mock_print but never use it. Since these tests only need to suppress output, patch print without binding the result:

     def test_output_handler_initialization_custom(self, tmp_path, mocker):
         """Test OutputHandler initialization with custom parameters."""
         system_config = mocker.Mock()
         system_config.llm.provider = "openai"

-        mock_print = mocker.patch("builtins.print")
+        mocker.patch("builtins.print")

         handler = OutputHandler(
             output_dir=str(tmp_path),
             base_filename="custom_eval",
             system_config=system_config,
         )

     def test_output_handler_creates_directory(self, tmp_path, mocker):
         """Test that OutputHandler creates output directory if it doesn't exist."""
         output_path = tmp_path / "new_output_dir"

-        mock_print = mocker.patch("builtins.print")
+        mocker.patch("builtins.print")

         handler = OutputHandler(output_dir=str(output_path))

🧹 Nitpick comments (11)

tests/unit/core/system/test_setup.py (3)
66-96: Consider tightening how warnings are asserted in env var error tests

The edge-case coverage for bad config_data["environment"] values is good, but the assertion:
captured = capsys.readouterr()
assert "Warning" in captured.out or "fallback" in captured.out
is fairly loose and tied to both:

The exact output stream (stdout vs stderr/logging), and

Specific wording in the message.

If the implementation uses logging.warning or changes the message text slightly, these tests may fail even though behavior is correct. Consider either:

Asserting only on side-effects (i.e., that defaults are set) and dropping the output assertion, or

If the implementation uses logging, switching to caplog and asserting on a structured warning record instead of free-form text.

101-165: Logging setup tests are comprehensive; only minor potential brittleness around .level

The tests cover:

Basic logger creation and name ("lightspeed_evaluation"),

Source/package level interaction,

Package overrides for httpx and urllib3,

Default noisy package behavior for matplotlib.

This is solid coverage. One minor consideration: asserting logger.level == logging.X and pkg_logger.level == logging.X assumes those loggers are always explicitly set to that level, not left as NOTSET with behavior controlled via propagation. If setup_logging ever changes to rely more on root configuration, getEffectiveLevel() could be a more future-proof assertion.

Not a required change, just something to keep in mind.

183-236: Good coverage of logging variants and default packages; verify default package list stays in sync

These tests nicely exercise:

Different source/package levels (ERROR, WARNING),

Custom format handling (ensuring no crash),

Behavior when an invalid override level is provided,

That a set of default noisy packages ("httpx", "requests", "LiteLLM", "ragas") are all configured.

One small maintainability point: the hard-coded packages list in test_setup_logging_applies_to_all_default_packages needs to stay in sync with whatever setup_logging defines as its default noisy packages. If that list changes in the implementation, the test will need updating. A brief comment in the test pointing back to the definition in setup_logging (or centralizing the list in a shared constant) could help future readers understand the coupling.

Overall, these tests look good.
tests/unit/core/llm/test_deepeval_manager.py (1)
22-32: Consider verifying LiteLLMModel call arguments.

The test verifies that LiteLLMModel is called once but doesn't verify the arguments passed. To ensure all parameters are correctly propagated from llm_params to the LiteLLM constructor, consider adding assertions for the call arguments.

Apply this diff to enhance the test:
     def test_initialization(self, llm_params, mocker):
         """Test manager initialization."""
         mock_model = mocker.patch(
             "lightspeed_evaluation.core.llm.deepeval.LiteLLMModel"
         )

         manager = DeepEvalLLMManager("gpt-4", llm_params)

         assert manager.model_name == "gpt-4"
         assert manager.llm_params == llm_params
-        mock_model.assert_called_once()
+        mock_model.assert_called_once_with(
+            model="gpt-4",
+            temperature=0.5,
+            max_tokens=1024,
+            timeout=120,
+            num_retries=5,
+        )
tests/unit/core/script/test_manager_additional.py (5)
14-58: Consider patching subprocess.run where it's used, not where it's defined.

The tests patch "subprocess.run" directly (lines 22, 37, 52). Following Python mocking best practices, you should patch where the object is looked up (in the manager module) rather than where it's defined.

This makes the tests more explicit and robust.

Apply this pattern to lines 22, 37, 52:
-        mock_run = mocker.patch("subprocess.run")
+        mock_run = mocker.patch("lightspeed_evaluation.core.script.manager.subprocess.run")
60-74: Consider patching subprocess.run where it's used.

Same recommendation as above: patch "lightspeed_evaluation.core.script.manager.subprocess.run" instead of "subprocess.run" on line 69 for best practices.

87-109: Consider moving the logging import to module level.

The import logging statement on line 89 is inside the test function. While functional, it's more idiomatic to place imports at the module level.

Additionally, line 103 has the same subprocess.run patching consideration mentioned in earlier comments.

Apply this pattern:
+import logging
+
 import subprocess

 import pytest
Then remove the inline import on line 89.

111-133: Consider moving the logging import to module level.

Same as above: the import logging on line 113 should be at module level. Also, line 127 has the patching consideration.

134-155: Consider moving the logging import to module level.

Same as above: the import logging on line 136 should be at module level. Also, line 150 has the patching consideration.
tests/unit/runner/test_evaluation.py (1)

237-300: CLI main tests are robust; just keep patch targets in sync with imports

The main tests nicely validate default/custom argument parsing and non‑zero exit on failure by patching run_evaluation. One thing to keep an eye on: across this file you patch objects like ConfigLoader, DataValidator, and EvaluationPipeline at specific module paths. If the import style in lightspeed_evaluation.runner.evaluation changes (e.g., from package‑level imports to submodule imports), these patch targets may need updating to continue intercepting the correct call‑sites.

tests/unit/pipeline/evaluation/test_processor.py (1)

842-867: Note about TurnData._invalid_metrics being shared across instances

The is_metric_invalid / add_invalid_metric tests validate behavior on a single TurnData instance and look correct. One thing to be aware of in the underlying model: TurnData currently defines _invalid_metrics: set[str] = set() as a class‑level attribute, so the invalid‑metric set is effectively shared across all TurnData instances, not per‑turn. If you ever start constructing multiple TurnData objects in the same process and rely on invalid metrics being isolated per turn, this could cause surprising cross‑contamination and might be worth revisiting in a follow‑up change.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b94c83a and 361546e.

📒 Files selected for processing (17)

tests/unit/core/api/test_client.py (2 hunks)
tests/unit/core/api/test_streaming_parser.py (1 hunks)
tests/unit/core/llm/test_deepeval_manager.py (1 hunks)
tests/unit/core/llm/test_llm_manager.py (1 hunks)
tests/unit/core/models/test_api_additional.py (1 hunks)
tests/unit/core/models/test_data_additional.py (1 hunks)
tests/unit/core/models/test_system_additional.py (1 hunks)
tests/unit/core/output/test_final_coverage.py (1 hunks)
tests/unit/core/output/test_generator.py (1 hunks)
tests/unit/core/output/test_statistics.py (5 hunks)
tests/unit/core/script/test_manager_additional.py (1 hunks)
tests/unit/core/system/test_env_validator.py (1 hunks)
tests/unit/core/system/test_lazy_import.py (1 hunks)
tests/unit/core/system/test_setup.py (1 hunks)
tests/unit/pipeline/evaluation/test_pipeline.py (1 hunks)
tests/unit/pipeline/evaluation/test_processor.py (7 hunks)
tests/unit/runner/test_evaluation.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (8)

tests/unit/core/system/test_env_validator.py
tests/unit/core/models/test_api_additional.py
tests/unit/core/models/test_data_additional.py
tests/unit/core/llm/test_llm_manager.py
tests/unit/core/models/test_system_additional.py
tests/unit/core/output/test_final_coverage.py
tests/unit/core/system/test_lazy_import.py
tests/unit/pipeline/evaluation/test_pipeline.py

🧰 Additional context used

📓 Path-based instructions (1)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

tests/unit/pipeline/evaluation/test_processor.py
tests/unit/core/output/test_statistics.py
tests/unit/core/api/test_client.py
tests/unit/core/llm/test_deepeval_manager.py
tests/unit/core/output/test_generator.py
tests/unit/runner/test_evaluation.py
tests/unit/core/script/test_manager_additional.py
tests/unit/core/system/test_setup.py
tests/unit/core/api/test_streaming_parser.py

🧠 Learnings (10)

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-07-29T05:20:28.916Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to tests/**/*.py : Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py
tests/unit/core/output/test_generator.py
tests/unit/runner/test_evaluation.py

📚 Learning: 2025-07-16T13:20:45.006Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: All new evaluation features should be added to `src/lightspeed_evaluation/` core framework

Applied to files:

tests/unit/runner/test_evaluation.py

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Do not add features to the legacy `lsc_agent_eval/` directory; use `src/lightspeed_evaluation/` instead

Applied to files:

tests/unit/runner/test_evaluation.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

tests/unit/core/api/test_streaming_parser.py

🧬 Code graph analysis (7)

tests/unit/pipeline/evaluation/test_processor.py (15)

src/lightspeed_evaluation/core/metrics/manager.py (4)

MetricManager (17-166)

resolve_metrics (24-48)

MetricLevel (10-14)

count_metrics_for_conversation (145-166)

src/lightspeed_evaluation/core/models/data.py (4)

EvaluationData (309-367)

EvaluationRequest (426-479)

EvaluationResult (370-409)

TurnData (35-306)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (271-301)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-117)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (70-125)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (2)

APIDataAmender (13-80)

amend_single_turn (20-69)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (4)

EvaluationErrorHandler (10-201)

mark_all_metrics_as_error (17-77)

mark_turn_metrics_as_error (79-125)

mark_cascade_failure (127-193)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (25-172)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (8)

ConversationProcessor (37-299)

ProcessorComponents (27-34)

process_conversation (46-179)

_evaluate_turn (181-214)

_evaluate_conversation (216-243)

_run_setup_script (245-267)

_run_cleanup_script (269-295)

get_metrics_summary (297-299)

tests/unit/pipeline/evaluation/test_pipeline.py (1)

mock_config_loader (16-27)

tests/unit/core/metrics/test_manager.py (1)

system_config (14-50)

src/lightspeed_evaluation/core/api/client.py (1)

query (71-105)

tests/unit/core/llm/test_deepeval_manager.py (1)

test_initialization (22-32)

tests/unit/core/output/test_generator.py (1)

test_initialization (59-65)

src/lightspeed_evaluation/core/llm/custom.py (1)

call (18-75)

tests/unit/core/output/test_statistics.py (2)

src/lightspeed_evaluation/core/output/statistics.py (3)

bootstrap_intervals (12-34)

calculate_basic_stats (37-63)

calculate_detailed_stats (66-86)

tests/unit/core/output/test_generator.py (1)

sample_results (12-37)

tests/unit/core/api/test_client.py (3)

src/lightspeed_evaluation/core/models/system.py (1)

APIConfig (117-156)

src/lightspeed_evaluation/core/api/client.py (3)

APIClient (22-282)

query (71-105)

_prepare_request (107-122)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

tests/unit/core/llm/test_deepeval_manager.py (1)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-42)

tests/unit/runner/test_evaluation.py (5)

tests/unit/pipeline/evaluation/test_pipeline.py (2)

test_run_evaluation_success (176-222)

mock_config_loader (16-27)

tests/unit/core/metrics/test_manager.py (1)

system_config (14-50)

src/lightspeed_evaluation/core/system/loader.py (1)

load_system_config (79-108)

src/lightspeed_evaluation/core/system/validator.py (1)

load_evaluation_data (92-144)

tests/unit/pipeline/evaluation/test_processor.py (1)

mock_config_loader (28-34)

tests/unit/core/script/test_manager_additional.py (1)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-117)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

_parse_sse_line (63-72)

🪛 GitHub Actions: Ruff

tests/unit/pipeline/evaluation/test_processor.py

[error] 553-553: F811 Redefinition of unused processor_components from line 38. Remove definition: processor_components.

tests/unit/core/output/test_generator.py

[error] 293-293: F841 Local variable mock_print is assigned to but never used. Remove assignment to unused variable mock_print.

[error] 309-309: F841 Local variable mock_print is assigned to but never used. Remove assignment to unused variable mock_print.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: tests (3.12)
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)
GitHub Check: mypy
GitHub Check: Pylinter

🔇 Additional comments (25)

tests/unit/core/system/test_setup.py (1)

15-65: Env var setup tests are clear and align with pytest/pytest-mock guidelines

Good use of mocker.patch.dict(os.environ, {}, clear=True) to fully isolate environment per test.

Scenarios for defaults, custom vars, and overrides are well-separated and readable.

Naming and structure follow the requested Test*/test_* conventions.

No changes needed here.

tests/unit/core/output/test_statistics.py (4)

57-102: LGTM!

The bootstrap interval tests appropriately cover:

Basic functionality with ordering validation

Invalid confidence values (both > 100 and negative)

Custom configuration parameters

Using reduced bootstrap steps (50-100) for faster test execution is a reasonable trade-off for unit tests.

105-273: LGTM!

Comprehensive test coverage for calculate_basic_stats:

Empty results handling

Homogeneous results (all PASS, all FAIL, all ERROR)

Mixed results with rate calculations

Single result edge case

The assertions correctly validate both counts and percentage rates.

276-537: LGTM!

Thorough test coverage for calculate_detailed_stats:

Metric and conversation breakdowns

Rate calculations at the detailed level

Score statistics (mean, median, std, min, max)

Edge cases including no-score ERROR results

Good use of pytest.approx for floating-point comparisons at line 509.

14-54: Well-structured fixture with diverse test data.

The sample_results fixture provides good variety:

Multiple conversations (conv1, conv2)

Multiple metrics (metric1, metric2)

All result types (PASS, FAIL, ERROR)

Both valid scores and None for ERROR case

tests/unit/core/output/test_generator.py (3)

11-53: LGTM!

Well-designed fixtures:

sample_results provides a minimal but representative test dataset with PASS/FAIL cases

mock_system_config properly uses the mocker fixture from pytest-mock as per coding guidelines

56-269: LGTM!

Comprehensive test coverage for OutputHandler:

Initialization with various parameters

Statistics calculation (with results and empty)

Report generation for all formats (CSV, JSON, TXT)

Proper datetime mocking for deterministic timestamps

Default behavior when system config is absent

316-410: LGTM!

Good additional test coverage:

CSV generation with specific result data and content verification

Column configuration respects system config

Timestamp formatting in filenames

The local import csv as csv_module (lines 350, 386) works fine within test functions.

tests/unit/core/llm/test_deepeval_manager.py (7)

1-6: LGTM!

The imports and module docstring are clean and follow pytest conventions.

8-16: LGTM!

The fixture provides comprehensive test parameters covering all configurable LLM options.

34-45: LGTM!

The test correctly verifies that the default temperature value (0.0) is applied when not provided in the parameters.

47-58: LGTM!

The test correctly verifies that the default num_retries value (3) is applied when not provided in the parameters.

60-71: LGTM!

The test correctly verifies that get_llm() returns the LiteLLMModel instance created during initialization.

73-84: LGTM!

The test thoroughly verifies that get_model_info() returns all expected configuration parameters with correct values.

86-94: LGTM!

The test correctly verifies the initialization message is printed to stdout using pytest's capsys fixture.

tests/unit/core/script/test_manager_additional.py (2)

1-9: LGTM!

Module imports and docstring are well-structured and follow the coding guidelines.

76-85: LGTM!

Good test verifying that directories are rejected with the appropriate error message.

tests/unit/core/api/test_streaming_parser.py (4)

13-128: Comprehensive coverage for parse_streaming_response looks solid

The tests exercise the key control-flow paths (happy path, tool calls, missing final response/ID, error events, empty/non‑data lines, multiple tool calls) and align with the expected SSE format (data: {..}). No issues from a correctness or style perspective.

131-173: _parse_sse_line tests match implementation semantics

The cases for valid JSON, invalid JSON → None, and missing event / data defaulting to "" / {} match the current implementation in streaming_parser._parse_sse_line. This is a good level of granularity for a private helper.

175-221: Tool‑call parsing edge cases are well covered

The _parse_tool_call tests cover the expected structure (valid dict), missing keys, empty arguments, and non‑dict tokens. This matches how the helper is intended to be used and should make future refactors safer.

223-253: Tool sequence formatting tests align with downstream expectations

_format_tool_sequences is validated for empty, single, and multiple tool calls, asserting the “list of lists” structure that the rest of the pipeline expects. This is clear and idiomatic pytest.

tests/unit/runner/test_evaluation.py (1)

6-236: Runner run_evaluation tests give good behavioral coverage

The tests cover success, output‑dir override, config load failure, evaluation‑data failure, error statistics reporting, and ensuring the pipeline is closed on exceptions. Mock wiring through ConfigLoader, DataValidator, EvaluationPipeline, OutputHandler, and calculate_basic_stats is clear, and everything stays in pytest + mocker. No blocking issues here.

tests/unit/pipeline/evaluation/test_processor.py (2)

27-291: ConversationProcessor high‑level tests are thorough and consistent

The new fixtures (mock_config_loader, sample_conv_data) and the TestConversationProcessor methods cover:

Initialization wiring,

No‑metrics short‑circuit,

Turn vs conversation metrics,

Setup/cleanup script behavior (success, failure, always‑cleanup),

API amendment behavior and cascade‑failure on API error,

Metrics summary forwarding.

They align well with ConversationProcessor.process_conversation, _run_setup_script, _run_cleanup_script, and get_metrics_summary logic in processor.py, and comply with the pytest + mocker guidelines.

375-427: _evaluate_turn unit tests correctly pin invalid‑metric and ordering behavior

The tests around _evaluate_turn (both the simpler one around line ~375 and the detailed TestConversationProcessorEvaluateTurn class) accurately reflect the implementation:

Invalid metrics produce ERROR results, log the expected message, and do not call metrics_evaluator.evaluate_metric.

Mixed valid/invalid lists preserve metric order in the results and only evaluate valid metrics.

Empty metric lists short‑circuit cleanly.

The generated EvaluationRequest instances are asserted in detail (ids, index, identifiers).

Handling of evaluate_metric returning None is exercised.

This is exactly the kind of focused coverage that will keep refactors safe.

Also applies to: 579-791

tests/unit/core/api/test_client.py (1)

22-500: Additional APIClient configuration/streaming tests are well‑aligned

The new basic_api_config fixture and TestAPIClientConfiguration class nicely extend coverage into:

Streaming endpoint initialization,

Cache construction and key stability,

Endpoint‑type validation,

API key header wiring,

Guarding against uninitialized clients,

Request preparation (basic, with conversation_id, with attachments),

Client close behavior,

Content‑Type header setup,

Standard vs streaming endpoint types.

All of this stays within pytest + mocker and matches the documented behavior of APIClient. No changes needed here.

tests/unit/pipeline/evaluation/test_processor.py

bsatapat-jpg · 2025-11-27T11:44:34Z

Now test case coverage increased:
56% -> 76%

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

tests/unit/core/llm/test_llm_manager.py (1)
142-142: Remove unused parameter basic_llm_config.

The basic_llm_config fixture parameter is declared but never used in this test. The test creates its own config instead.

Apply this diff to clean up the signature:
-    def test_initialization_generic_provider(self, basic_llm_config, mocker, capsys):
+    def test_initialization_generic_provider(self, mocker, capsys):
tests/unit/core/models/test_system_additional.py (5)

48-54: Consider more specific assertions.

The assertions config.provider is not None and config.model is not None only verify non-None values. Consider asserting against the actual expected default values (e.g., from DEFAULT_EMBEDDING_PROVIDER and DEFAULT_EMBEDDING_MODEL constants) for more robust validation.

70-76: Consider more specific assertions.

The isinstance checks for boolean fields and timeout > 0 provide minimal validation. Consider asserting against expected default values for more thorough testing.

116-119: Assertion doesn't verify the actual value.

Line 119 asserts len(config.csv_columns) >= 1 after setting csv_columns=["result"], which only confirms non-empty. Consider asserting the exact value: assert config.csv_columns == ["result"] for more precise validation.

125-131: Consider more specific assertions.

Similar to other default value tests, the generic assertions (isinstance, > 0, length checks) provide minimal validation. Consider checking against expected default values from the model constants.

155-161: Consider more specific assertions.

The is not None checks and isinstance assertion provide minimal validation. Consider checking against expected default values from the model constants for more thorough testing.
tests/unit/core/script/test_manager_additional.py (1)
14-58: Consider patching subprocess.run at the import location.

The tests patch "subprocess.run" globally. While this works, best practice is to patch where the function is imported and used.

Apply this pattern for more precise mocking:
-        mock_run = mocker.patch("subprocess.run")
+        mock_run = mocker.patch("lightspeed_evaluation.core.script.manager.subprocess.run")
This makes the mock more targeted and reduces the risk of affecting other tests or code that uses subprocess.

Note: This applies to all tests in this file that mock subprocess.run (lines 22, 37, 52, 69, 103, 127, 150).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 361546e and 60e5caa.

📒 Files selected for processing (17)

tests/unit/core/api/test_client.py (2 hunks)
tests/unit/core/api/test_streaming_parser.py (1 hunks)
tests/unit/core/llm/test_deepeval_manager.py (1 hunks)
tests/unit/core/llm/test_llm_manager.py (1 hunks)
tests/unit/core/models/test_api_additional.py (1 hunks)
tests/unit/core/models/test_data_additional.py (1 hunks)
tests/unit/core/models/test_system_additional.py (1 hunks)
tests/unit/core/output/test_final_coverage.py (1 hunks)
tests/unit/core/output/test_generator.py (1 hunks)
tests/unit/core/output/test_statistics.py (5 hunks)
tests/unit/core/script/test_manager_additional.py (1 hunks)
tests/unit/core/system/test_env_validator.py (1 hunks)
tests/unit/core/system/test_lazy_import.py (1 hunks)
tests/unit/core/system/test_setup.py (1 hunks)
tests/unit/pipeline/evaluation/test_pipeline.py (1 hunks)
tests/unit/pipeline/evaluation/test_processor.py (9 hunks)
tests/unit/runner/test_evaluation.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

tests/unit/core/models/test_data_additional.py
tests/unit/core/models/test_api_additional.py
tests/unit/core/system/test_env_validator.py
tests/unit/core/system/test_setup.py
tests/unit/runner/test_evaluation.py
tests/unit/core/llm/test_deepeval_manager.py
tests/unit/core/output/test_final_coverage.py
tests/unit/core/system/test_lazy_import.py
tests/unit/pipeline/evaluation/test_pipeline.py

🧰 Additional context used

📓 Path-based instructions (1)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

tests/unit/core/output/test_statistics.py
tests/unit/core/output/test_generator.py
tests/unit/core/models/test_system_additional.py
tests/unit/core/script/test_manager_additional.py
tests/unit/pipeline/evaluation/test_processor.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/api/test_client.py
tests/unit/core/llm/test_llm_manager.py

🧠 Learnings (10)

📚 Learning: 2025-07-16T13:20:45.006Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

tests/unit/core/models/test_system_additional.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

tests/unit/core/models/test_system_additional.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-07-29T05:20:28.916Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to tests/**/*.py : Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

tests/unit/core/api/test_streaming_parser.py

🧬 Code graph analysis (7)

tests/unit/core/output/test_statistics.py (1)

src/lightspeed_evaluation/core/output/statistics.py (3)

bootstrap_intervals (12-34)

calculate_basic_stats (37-63)

calculate_detailed_stats (66-86)

tests/unit/core/output/test_generator.py (1)

src/lightspeed_evaluation/core/output/generator.py (8)

OutputHandler (23-295)

_calculate_stats (70-79)

_generate_csv_report (133-169)

_generate_json_summary (171-206)

_generate_text_summary (208-291)

get_output_directory (293-295)

generate_reports (40-68)

_generate_individual_reports (81-103)

tests/unit/core/models/test_system_additional.py (1)

src/lightspeed_evaluation/core/models/system.py (6)

LLMConfig (36-75)

EmbeddingConfig (78-114)

APIConfig (117-156)

OutputConfig (159-201)

VisualizationConfig (226-252)

LoggingConfig (204-223)

tests/unit/core/script/test_manager_additional.py (1)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-117)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

_parse_sse_line (63-72)

tests/unit/core/api/test_client.py (3)

src/lightspeed_evaluation/core/models/system.py (1)

APIConfig (117-156)

src/lightspeed_evaluation/core/api/client.py (2)

query (71-105)

_prepare_request (107-122)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

tests/unit/core/llm/test_llm_manager.py (2)

src/lightspeed_evaluation/core/models/system.py (2)

LLMConfig (36-75)

SystemConfig (271-301)

src/lightspeed_evaluation/core/llm/manager.py (4)

LLMManager (10-117)

get_llm_params (95-103)

get_config (105-107)

from_llm_config (115-117)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: tests (3.13)
GitHub Check: tests (3.11)
GitHub Check: tests (3.12)
GitHub Check: Pylinter
GitHub Check: mypy

🔇 Additional comments (21)

tests/unit/core/llm/test_llm_manager.py (4)

1-7: LGTM!

The imports are clean and properly organized. All necessary modules are imported for the test suite.

9-19: LGTM!

The fixture provides a clean, reusable baseline configuration for tests, reducing code duplication.

25-140: LGTM!

Comprehensive provider coverage with consistent test patterns. The Azure tests appropriately cover both deployment scenarios, and all assertions correctly verify provider-specific model name formats.

160-245: LGTM!

Excellent coverage of the manager's API methods and behavior. The tests verify parameter propagation, config retrieval, class method constructors, and initialization output. All assertions are appropriate and thorough.

tests/unit/core/models/test_system_additional.py (3)

16-43: LGTM! Boundary validation tests are correct.

These tests properly validate the Pydantic field constraints defined in the LLMConfig model, covering edge cases for temperature, max_tokens, timeout, and num_retries.

133-144: Excellent test of tuple-to-list conversion.

The test correctly verifies that Pydantic converts the tuple (12, 8) to a list [12, 8], and the comment on line 144 provides helpful documentation of this behavior.

175-181: LGTM! Boolean toggle test is clear and correct.

The test properly verifies that the show_timestamps configuration can be set to both True and False, using two separate config instances for clarity.

tests/unit/core/api/test_streaming_parser.py (1)

1-253: Excellent test coverage for streaming parser!

The test suite comprehensively covers streaming response parsing, SSE line parsing, tool call parsing, and tool sequence formatting. Tests validate both successful parsing and error handling paths, including edge cases like missing fields, invalid JSON, and empty data.

tests/unit/core/api/test_client.py (2)

22-33: LGTM!

The basic_api_config fixture provides appropriate test configuration for streaming endpoint tests.

322-500: I encountered a technical issue accessing the repository to verify the duplication claim. Since I cannot inspect the codebase directly, I cannot confirm whether TestAPIClientConfiguration is actually duplicated in the file or provide definitive evidence either way.

Unable to verify duplication claim for TestAPIClientConfiguration due to repository access limitations.

The review comment asks to verify if the TestAPIClientConfiguration class appears twice in tests/unit/core/api/test_client.py. While the AI summary suggests duplication exists, I cannot access the repository to confirm this through code inspection. Manual verification is required to:

Search the full file for all class TestAPIClientConfiguration definitions

Confirm whether multiple definitions exist

If duplicates are found, remove the shadowed class definition(s)

tests/unit/pipeline/evaluation/test_processor.py (3)

27-72: Well-structured test fixtures.

The fixtures provide comprehensive mocking for configuration, processor components, and sample data. The default behavior configuration in processor_components is appropriate for test isolation.

74-481: Comprehensive test coverage for ConversationProcessor.

The test class thoroughly validates processor behavior across multiple scenarios including metric processing, script execution, API amendment, and error handling. Tests properly use mocks and fixtures to isolate behavior.

576-867: Excellent coverage for _evaluate_turn method.

The tests thoroughly validate metric evaluation behavior including handling of valid metrics, invalid metrics, empty metrics, and mixed scenarios. Tests properly verify both results and side effects like logging and evaluator calls.

tests/unit/core/script/test_manager_additional.py (1)

60-155: Good coverage for script execution edge cases.

The tests validate Path object handling, error conditions, and logging behavior. The tests appropriately verify return values and exception raising for various scenarios.

tests/unit/core/output/test_statistics.py (4)

14-54: Excellent test fixture design!

The sample_results fixture provides a well-structured set of test data covering multiple scenarios (PASS/FAIL/ERROR, multiple conversations, multiple metrics, score variations including None). This fixture enables comprehensive testing across multiple test methods.

57-103: Comprehensive bootstrap interval testing!

The test coverage for bootstrap_intervals is excellent:

Basic functionality validation (line ordering, bounds checking)

Invalid confidence values (both over 100 and negative)

Custom confidence levels and bootstrap steps

This thoroughly validates the function's behavior and error handling.

105-273: Excellent coverage of basic statistics scenarios!

The TestCalculateBasicStats class covers all important edge cases:

Normal mixed results

Empty results

All pass/fail/error scenarios

Single result edge case

The assertions correctly verify counts and percentage calculations (e.g., 2/4 = 50% pass rate). This comprehensive coverage ensures the statistics function handles all scenarios correctly.

276-537: Thorough detailed statistics validation!

The TestCalculateDetailedStats class provides excellent coverage:

Structure validation (by_metric, by_conversation)

Metric and conversation breakdowns

Rate calculations

Score statistics including edge cases (no scores)

Single and multiple metric/conversation scenarios

The test assertions correctly validate the expected breakdowns and statistical calculations. Great work on edge case coverage!

tests/unit/core/output/test_generator.py (3)

11-53: Well-designed fixtures for output handler testing!

The fixtures provide appropriate test data:

sample_results: includes query/response fields relevant to output generation testing

mock_system_config: configures output formats, CSV columns, and graph settings

These fixtures enable comprehensive testing of the OutputHandler's report generation capabilities.

56-270: Excellent OutputHandler test coverage!

The TestOutputHandler class thoroughly validates:

Initialization and configuration

Statistics calculation (with data and empty)

Individual report format generation (CSV, JSON, TXT)

End-to-end report generation with mocked timestamps

Partial format generation

Column configuration

Default behavior without config

The use of mocked datetime for predictable timestamps (lines 144-150, 169-175) is particularly good practice for testing filename generation.

272-411: Comprehensive initialization and configuration testing!

The TestOutputHandlerInitialization class provides thorough validation of:

Default vs. custom initialization

Directory creation

CSV content and column configuration

Filename timestamp formatting

Note: The mock_print usage is correctly handled throughout - line 277 assigns and uses it (line 286), while lines 293, 309, 341, 375, 398 just patch without assignment.

tests/unit/core/output/test_statistics.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

tests/unit/core/llm/test_llm_manager.py (1)
156-158: Consider strengthening the assertion for generic provider warning.

The OR condition allows the test to pass if either "generic" or "warning" appears in the output. If the production code is expected to print both, consider using AND to ensure the warning is meaningful. However, if only one is required, the current assertion is acceptable.
-        assert "generic" in captured.out.lower() or "warning" in captured.out.lower()
+        # If both should appear, use AND; otherwise keep current behavior
+        assert "generic" in captured.out.lower()
tests/unit/pipeline/evaluation/test_processor.py (3)

77-291: ConversationProcessor end‑to‑end tests look solid; any extra call‑level assertions are optional

The TestConversationProcessor methods collectively exercise the key flows: init wiring, no‑metrics short‑circuit, turn vs conversation metrics, setup/cleanup behavior (success + failure paths), API amendment, cascade failure, and metrics summary. The mocking pattern around MetricManager, ScriptExecutionManager, APIDataAmender, EvaluationErrorHandler, and MetricsEvaluator is consistent and readable.

If you ever want to tighten these further, you could (optionally) add expectations on metrics_evaluator.evaluate_metric call counts / call args in a couple of the process‑conversation tests to pin behavior more precisely, but it’s not required for correctness.

429-481: Consider asserting the warning log in test_run_cleanup_script_logs_warning_on_failure

The test currently only verifies that _run_cleanup_script doesn’t raise when run_script returns False, while the docstring says the failure “is logged as warning”. To fully exercise that contract, you could use caplog to assert that a WARNING‑level message was emitted, e.g. by adding caplog to the test signature and wrapping the call in caplog.at_level(logging.WARNING) before checking caplog.text for the expected substring.

This would strengthen the test without changing behavior.

553-574: PR‑specific processor fixtures work, but the naming could be clearer

processor_components_pr and the processor fixture nicely reuse the lower fixtures to focus the _evaluate_turn tests, and they avoid the previous processor_components redefinition issue.

Minor nit: the _pr suffix is a bit opaque in the test context; consider a more descriptive name like processor_components_eval_turn (and processor_eval_turn if you want to be explicit) to make the fixture’s purpose clearer to future readers.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60e5caa and f017e96.

📒 Files selected for processing (17)

tests/unit/core/api/test_client.py (2 hunks)
tests/unit/core/api/test_streaming_parser.py (1 hunks)
tests/unit/core/llm/test_deepeval_manager.py (1 hunks)
tests/unit/core/llm/test_llm_manager.py (1 hunks)
tests/unit/core/models/test_api_additional.py (1 hunks)
tests/unit/core/models/test_data_additional.py (1 hunks)
tests/unit/core/models/test_system_additional.py (1 hunks)
tests/unit/core/output/test_final_coverage.py (1 hunks)
tests/unit/core/output/test_generator.py (1 hunks)
tests/unit/core/output/test_statistics.py (6 hunks)
tests/unit/core/script/test_manager_additional.py (1 hunks)
tests/unit/core/system/test_env_validator.py (1 hunks)
tests/unit/core/system/test_lazy_import.py (1 hunks)
tests/unit/core/system/test_setup.py (1 hunks)
tests/unit/pipeline/evaluation/test_pipeline.py (1 hunks)
tests/unit/pipeline/evaluation/test_processor.py (9 hunks)
tests/unit/runner/test_evaluation.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (8)

tests/unit/core/models/test_data_additional.py
tests/unit/core/output/test_final_coverage.py
tests/unit/runner/test_evaluation.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/system/test_env_validator.py
tests/unit/core/system/test_setup.py
tests/unit/core/models/test_api_additional.py
tests/unit/core/system/test_lazy_import.py

🧰 Additional context used

📓 Path-based instructions (1)

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming convention test_*.py for files, test_* for functions, and Test* for classes

Files:

tests/unit/pipeline/evaluation/test_processor.py
tests/unit/pipeline/evaluation/test_pipeline.py
tests/unit/core/output/test_generator.py
tests/unit/core/llm/test_deepeval_manager.py
tests/unit/core/output/test_statistics.py
tests/unit/core/models/test_system_additional.py
tests/unit/core/llm/test_llm_manager.py
tests/unit/core/script/test_manager_additional.py
tests/unit/core/api/test_client.py

🧠 Learnings (9)

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-07-29T05:20:28.916Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:20:28.916Z
Learning: When reviewing code and raising concerns or suggestions, automatically resolve the discussion thread once the concern has been satisfactorily addressed by the user (e.g., through confirmation, explanation, or showing existing documentation) rather than leaving it for the user to resolve manually. This improves workflow efficiency.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-11-24T16:59:21.420Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-24T16:59:21.420Z
Learning: Applies to tests/**/*.py : Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking

Applied to files:

tests/unit/pipeline/evaluation/test_processor.py

📚 Learning: 2025-07-16T13:20:45.006Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

tests/unit/core/output/test_statistics.py

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

tests/unit/core/models/test_system_additional.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

tests/unit/core/models/test_system_additional.py

🧬 Code graph analysis (6)

tests/unit/pipeline/evaluation/test_pipeline.py (3)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (271-301)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (70-125)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)

EvaluationPipeline (32-205)

validate_data (117-119)

tests/unit/core/output/test_generator.py (1)

src/lightspeed_evaluation/core/output/generator.py (8)

OutputHandler (23-295)

_calculate_stats (70-79)

_generate_csv_report (133-169)

_generate_json_summary (171-206)

_generate_text_summary (208-291)

get_output_directory (293-295)

generate_reports (40-68)

_generate_individual_reports (81-103)

tests/unit/core/output/test_statistics.py (2)

src/lightspeed_evaluation/core/output/statistics.py (3)

bootstrap_intervals (12-34)

calculate_basic_stats (37-63)

calculate_detailed_stats (66-86)

tests/unit/core/output/test_generator.py (1)

sample_results (12-37)

tests/unit/core/models/test_system_additional.py (1)

src/lightspeed_evaluation/core/models/system.py (5)

LLMConfig (36-75)

EmbeddingConfig (78-114)

APIConfig (117-156)

OutputConfig (159-201)

LoggingConfig (204-223)

tests/unit/core/script/test_manager_additional.py (1)

src/lightspeed_evaluation/core/script/manager.py (1)

ScriptExecutionManager (14-117)

tests/unit/core/api/test_client.py (3)

src/lightspeed_evaluation/core/models/system.py (1)

APIConfig (117-156)

src/lightspeed_evaluation/core/api/client.py (3)

APIClient (22-282)

query (71-105)

_prepare_request (107-122)

src/lightspeed_evaluation/core/system/exceptions.py (1)

APIError (8-9)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.12)
GitHub Check: mypy
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)

🔇 Additional comments (12)

tests/unit/core/models/test_system_additional.py (1)

1-181: LGTM! Well-structured unit tests for system configuration models.

The tests properly validate Pydantic model configurations including boundary conditions (temperature min/max, timeout validation), default values, and custom configurations. Good coverage of LLMConfig, EmbeddingConfig, APIConfig, OutputConfig, VisualizationConfig, and LoggingConfig.

tests/unit/core/llm/test_llm_manager.py (1)

1-155: LGTM! Comprehensive LLM Manager tests with good provider coverage.

The tests properly validate initialization across multiple providers (OpenAI, Azure, WatsonX, Anthropic, Gemini, Vertex, Ollama, hosted_vllm) with correct use of pytest-mock. Good coverage of get_model_name, get_llm_params, get_config, and factory methods.

Also applies to: 159-245

tests/unit/core/output/test_statistics.py (1)

1-181: LGTM! Comprehensive statistics tests with retained bootstrap coverage.

The bootstrap interval tests (valid_confidence, edge_cases, confidence_levels) are properly retained as previously requested. Good coverage of calculate_basic_stats and calculate_detailed_stats with various scenarios including empty results, all-pass/fail/error, and mixed results.

tests/unit/core/api/test_client.py (1)

1-500: LGTM! Thorough API client tests covering streaming, error handling, and configuration.

The tests properly validate APIClient behavior including streaming endpoint handling, error scenarios (HTTP errors, timeouts, missing fields), tool call formatting, cache key generation, and client lifecycle management. Good use of pytest-mock for mocking httpx.Client and related dependencies.

tests/unit/core/llm/test_deepeval_manager.py (1)

1-94: LGTM! Well-structured DeepEval LLM Manager tests.

The tests properly validate DeepEvalLLMManager initialization with default and custom parameters, verify LiteLLMModel instantiation, and check get_llm and get_model_info methods. Good use of pytest-mock for isolating the LiteLLMModel dependency.

tests/unit/core/script/test_manager_additional.py (1)

1-155: LGTM! Script execution manager tests with good error handling coverage.

The tests properly validate timeout handling, subprocess errors, unexpected errors, and path object acceptance. The logging tests verify basic behavior as intended, with assertions confirmed acceptable per prior review discussion.

tests/unit/core/output/test_generator.py (1)

1-411: LGTM! Comprehensive output generator tests with proper fixture usage.

The tests thoroughly validate OutputHandler initialization, report generation (CSV, JSON, TXT), and configuration handling. The previously flagged unused mock_print variables have been properly addressed by removing the assignment where the return value isn't needed.
tests/unit/pipeline/evaluation/test_pipeline.py (3)
86-116: Verify endpoint_type value in test configuration.

The test sets endpoint_type="test" which may not be a supported value according to SUPPORTED_ENDPOINT_TYPES. Based on the APIConfig model, valid types are typically "streaming" or "query". Since the APIClient is mocked, this won't cause a runtime error, but using a valid value would make the test configuration more realistic.
     def test_create_api_client_when_enabled(self, mock_config_loader, mocker):
         """Test API client creation when enabled."""
         mock_config_loader.system_config.api.enabled = True
         mock_config_loader.system_config.api.api_base = "http://test.com"
-        mock_config_loader.system_config.api.endpoint_type = "test"
+        mock_config_loader.system_config.api.endpoint_type = "streaming"
343-386: Same endpoint_type issue in close test.

Similarly, endpoint_type="test" at line 347 should use a valid value like "streaming" or "query" for consistency.
     def test_close_with_api_client(self, mock_config_loader, mocker):
         """Test close method with API client."""
         mock_config_loader.system_config.api.enabled = True
         mock_config_loader.system_config.api.api_base = "http://test.com"
-        mock_config_loader.system_config.api.endpoint_type = "test"
+        mock_config_loader.system_config.api.endpoint_type = "streaming"
1-85: LGTM! Comprehensive evaluation pipeline tests with good component isolation.

The tests properly validate EvaluationPipeline initialization, data validation, evaluation runs, amended data saving, close behavior, and output directory override. Good use of pytest-mock for isolating the many pipeline components.

Also applies to: 117-342, 387-440
tests/unit/pipeline/evaluation/test_processor.py (2)

27-71: Core fixtures are well‑structured and follow pytest/pytest‑mock guidelines

mock_config_loader, processor_components, and sample_conv_data give a clear, minimal wiring of SystemConfig, ProcessorComponents, and a single‑turn EvaluationData, and they consistently use mocker instead of unittest.mock, which matches the repo’s testing guidelines. No changes needed here.

As per coding guidelines, using pytest fixtures + pytest‑mock is the preferred pattern.

605-717: Invalid / mixed metric tests for _evaluate_turn accurately capture behavior

The updated docstrings, logging setup via caplog, and result assertions for invalid, all‑invalid, and mixed metric lists align well with how _evaluate_turn is expected to behave: producing ERROR results for invalid metrics, preserving metric order, calling the evaluator only for valid metrics, and logging clear error messages.

These tests give good guardrails for future refactors of _evaluate_turn; no changes needed.

VladimirKadlec

LGTM, nice work 💪

tests/unit/core/api/test_client.py

tisnik

LGTM

bsatapat-jpg force-pushed the dev branch from df49bf4 to 016132a Compare November 26, 2025 07:24

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

tests/unit/pipeline/evaluation/test_processor.py Show resolved Hide resolved

tests/unit/runner/test_evaluation.py Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from 016132a to 5fa6286 Compare November 26, 2025 07:39

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

bsatapat-jpg force-pushed the dev branch from 5fa6286 to 3483d9c Compare November 26, 2025 08:20

LEADS-26: Increaded Unit test cases coverage

e940489

bsatapat-jpg force-pushed the dev branch from 3483d9c to e940489 Compare November 26, 2025 08:25

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

tests/unit/core/script/test_manager_additional.py Show resolved Hide resolved

tests/unit/core/script/test_manager_additional.py Show resolved Hide resolved

tests/unit/core/script/test_manager_additional.py Show resolved Hide resolved

VladimirKadlec suggested changes Nov 27, 2025

View reviewed changes

tests/unit/pipeline/evaluation/test_processor.py Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from ba47b9b to b94c83a Compare November 27, 2025 10:12

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

tests/unit/core/output/test_generator.py Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from b94c83a to 361546e Compare November 27, 2025 11:04

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

tests/unit/pipeline/evaluation/test_processor.py Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from 361546e to 60e5caa Compare November 27, 2025 11:42

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

bsatapat-jpg requested a review from VladimirKadlec November 27, 2025 11:54

VladimirKadlec suggested changes Nov 27, 2025

View reviewed changes

tests/unit/core/output/test_statistics.py Show resolved Hide resolved

Merge branch 'main' into dev

f017e96

bsatapat-jpg force-pushed the dev branch from 60e5caa to f017e96 Compare November 27, 2025 15:13

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

VladimirKadlec approved these changes Nov 28, 2025

View reviewed changes

tisnik reviewed Nov 28, 2025

View reviewed changes

tests/unit/core/api/test_client.py Show resolved Hide resolved

tisnik approved these changes Nov 30, 2025

View reviewed changes

tisnik merged commit 7665def into lightspeed-core:main Nov 30, 2025
15 checks passed

bsatapat-jpg deleted the dev branch December 1, 2025 05:02

coderabbitai bot mentioned this pull request Dec 3, 2025

[LEADS-40]: Get statistics about the token usage for lightspeed-evaluation #112

Merged

15 tasks

coderabbitai bot mentioned this pull request Dec 26, 2025

[LEADS-141] Add Latency Metrics to Evaluation Reports #127

Merged

15 tasks

Conversation

bsatapat-jpg commented Nov 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

bsatapat-jpg commented Nov 26, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bsatapat-jpg commented Nov 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bsatapat-jpg commented Nov 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 26, 2025 •

edited

Loading