Skip to content

LCORE-900: Added the parallel execution for multi-modal evaluation in…#92

Merged
tisnik merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev
Nov 6, 2025
Merged

LCORE-900: Added the parallel execution for multi-modal evaluation in…#92
tisnik merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev

Conversation

@bsatapat-jpg
Copy link
Collaborator

@bsatapat-jpg bsatapat-jpg commented Nov 4, 2025

The implementation should ensure:

  • Independent execution of model evaluations without data conflicts
  • Resource optimization to avoid overloading compute nodes
  • Synchronization of results for statistical significance checks and comparison graph generation

Board: https://issues.redhat.com/browse/LCORE-900

Summary by CodeRabbit

  • New Features

    • Parallel multi-provider evaluations with configurable worker count, per-model isolation, real-time progress, and consolidated result summaries; sequential fallback when workers=1.
  • Configuration

    • New max_workers setting and --max-workers CLI option with safe bounds, resource warnings, and guidance for tuning.
  • Documentation

    • Expanded parallel execution, resource management, progress tracking, error handling, and quick-start examples.
  • Tests

    • Added tests for max_workers handling, resource warnings, execution modes, and path/filesystem safety.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 4, 2025

Walkthrough

Adds configurable parallel execution for provider-model evaluations via a new max_workers setting (CLI/config), implements a ProcessPoolExecutor worker _run_evaluation_worker with per-worker isolated temp system configs, updates docs/tests for parallel flow, and enables gpt-4-turbo in config/multi_eval_config.yaml.

Changes

Cohort / File(s) Summary
Configuration
config/multi_eval_config.yaml
Adds gpt-4-turbo to providers.openai.models and sets max_workers: 4 under settings.
Documentation
docs/multi_provider_evaluation.md
Rewrites to document parallel execution, max_workers (CLI/config precedence), per-worker output isolation, progress tracking, result synchronization, error handling, and resource tuning guidance.
Core Parallel Execution Logic
script/run_multi_provider_eval.py
Adds _run_evaluation_worker and per-worker temp system config helpers; orchestrates parallel runs with ProcessPoolExecutor via _run_parallel_evaluations; exposes max_workers in MultiProviderEvaluationRunner constructor and CLI --max-workers; supports sequential fallback when max_workers==1; includes path-safety checks, structured result metadata, logging, duration tracking, and cleanup.
Tests
tests/script/test_run_multi_provider_eval.py
Adds tests for max_workers parsing/validation/clamping, resource-warning scenarios (thread count checks), sequential execution forcing in tests, and path sanitization/error handling coverage.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI / main()
    participant Runner as MultiProviderEvaluationRunner
    participant Pool as ProcessPoolExecutor
    participant Worker as _run_evaluation_worker
    participant Eval as Evaluation Engine

    CLI->>Runner: instantiate(..., max_workers)
    CLI->>Runner: run_evaluations()
    Runner->>Runner: determine max_workers (CLI > config > heuristic)

    alt Parallel (max_workers > 1)
        Runner->>Pool: create pool(max_workers)
        Runner->>Runner: build per-task configs (provider, model, paths)
        loop per task
            Pool->>Worker: submit(task_config)
            Worker->>Worker: create isolated temp system config
            Worker->>Eval: run evaluation (provider/model)
            Eval-->>Worker: return result dict
            Worker->>Worker: cleanup temp config
            Worker-->>Pool: return result metadata
        end
        Pool-->>Runner: collect results and failures
    else Sequential (max_workers == 1)
        Runner->>Runner: iterate tasks
        Runner->>Eval: run evaluation
        Eval-->>Runner: return result
    end

    Runner->>Runner: aggregate results, log durations/errors
    Runner-->>CLI: return list[dict] results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay extra attention to:
    • script/run_multi_provider_eval.py: worker lifecycle, temp config creation/cleanup, exception and failure propagation.
    • ProcessPoolExecutor usage: data serialization, avoiding shared mutable state, and safe logging per worker.
    • Path traversal / filesystem safety checks and per-worker output isolation.
    • Tests in tests/script/test_run_multi_provider_eval.py validating max_workers and resource-warning logic.

Possibly related PRs

Suggested reviewers

  • VladimirKadlec
  • asamal4
  • tisnik
  • lpiwowar

Poem

🐰 A rabbit in the heap,
Forks the work and bounds to keep,
Temp configs hop, then go —
Workers race, results flow,
Carrots saved in tidy rows 🥕

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: implementing parallel execution for multi-modal model evaluation, which is the primary focus across all modified files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
config/multi_eval_config.yaml (1)

11-11: Update the stale inline note.

Now that gpt-4-turbo is part of the active model list, the trailing comment claiming it is “Commented out for faster testing” is misleading. Please either drop the note or rephrase it to reflect the current intent.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a181e5 and 10e8b90.

📒 Files selected for processing (4)
  • config/multi_eval_config.yaml (2 hunks)
  • docs/multi_provider_evaluation.md (5 hunks)
  • script/run_multi_provider_eval.py (9 hunks)
  • tests/script/test_run_multi_provider_eval.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
config/**/*.{yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Use YAML for configuration files under config/

Files:

  • config/multi_eval_config.yaml
tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

  • tests/script/test_run_multi_provider_eval.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

  • tests/script/test_run_multi_provider_eval.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • tests/script/test_run_multi_provider_eval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-08T11:09:18.416Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

  • docs/multi_provider_evaluation.md
🪛 GitHub Actions: Black
tests/script/test_run_multi_provider_eval.py

[error] 1-1: Black formatting check failed. The file would be reformatted by Black.

script/run_multi_provider_eval.py

[error] 1-1: Black formatting check failed. The file would be reformatted by Black.

🪛 LanguageTool
docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: mypy
  • GitHub Check: tests (3.12)
🔇 Additional comments (2)
tests/script/test_run_multi_provider_eval.py (1)

438-445: The sandbox environment lacks both the uv package manager and the black formatter tool needed to verify or fix the formatting issues. I cannot automatically validate whether the code at lines 438-445 complies with Black's formatting rules.

The original review comment is valid and actionable. You'll need to run the Black formatter locally on your machine:

uv run black script/run_multi_provider_eval.py tests/script/test_run_multi_provider_eval.py

Or if you don't have uv installed, you can use:

black script/run_multi_provider_eval.py tests/script/test_run_multi_provider_eval.py
script/run_multi_provider_eval.py (1)

1-10: Run Black locally to resolve formatting issues.

The GitHub Actions check reports that script/run_multi_provider_eval.py and tests/script/test_run_multi_provider_eval.py would be reformatted by Black. Run Black locally and commit the resulting formatting updates to keep the pipeline green.

@bsatapat-jpg bsatapat-jpg force-pushed the dev branch 2 times, most recently from 8652aae to 5c68320 Compare November 4, 2025 08:53
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
script/run_multi_provider_eval.py (1)

232-243: Critical: Unresolved type coercion issue from previous review.

The past review correctly identified that max_workers loaded from YAML arrives as a string (e.g., "4"), causing TypeError: '>' not supported between instances of 'str' and 'int' at line 242. This issue remains unaddressed—the code still doesn't coerce max_workers to int before using it.

Apply the suggested fix from the previous review:

         if max_workers is not None:
             self.max_workers = max_workers
         elif "max_workers" in self.settings:
             self.max_workers = self.settings["max_workers"]
         else:
             # Default: leave one CPU free for system responsiveness
             cpu_count = multiprocessing.cpu_count()
             self.max_workers = max(1, cpu_count - 1)

+        if not isinstance(self.max_workers, int):
+            try:
+                self.max_workers = int(self.max_workers)
+            except (TypeError, ValueError) as exc:
+                raise ValueError(
+                    f"max_workers must be an integer, got {self.max_workers!r}"
+                ) from exc
+
         # Ensure at least 1 worker
         self.max_workers = max(1, self.max_workers)
🧹 Nitpick comments (1)
script/run_multi_provider_eval.py (1)

40-173: Consider refactoring to reduce code duplication.

The _run_evaluation_worker function duplicates substantial logic from _run_single_evaluation (lines 403-492), including sanitization, path creation, temp config handling, evaluation execution, and cleanup. This duplication makes maintenance harder and increases the risk of inconsistencies.

Consider extracting shared logic into a common helper function that both methods can call. For example:

def _execute_evaluation_task(
    self, 
    provider_name: str, 
    provider_id: str, 
    model: str,
    logger_instance: logging.Logger
) -> dict[str, Any]:
    """Common evaluation logic shared by worker and sequential paths."""
    # Shared sanitization, path creation, evaluation, cleanup logic
    ...

This would centralize the logic and make it easier to maintain consistency between sequential and parallel execution paths.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10e8b90 and 8652aae.

📒 Files selected for processing (4)
  • config/multi_eval_config.yaml (2 hunks)
  • docs/multi_provider_evaluation.md (5 hunks)
  • script/run_multi_provider_eval.py (9 hunks)
  • tests/script/test_run_multi_provider_eval.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/script/test_run_multi_provider_eval.py
  • config/multi_eval_config.yaml
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-08T11:09:18.416Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

  • docs/multi_provider_evaluation.md
🪛 LanguageTool
docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: mypy
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.12)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8652aae and 5c68320.

📒 Files selected for processing (4)
  • config/multi_eval_config.yaml (2 hunks)
  • docs/multi_provider_evaluation.md (5 hunks)
  • script/run_multi_provider_eval.py (9 hunks)
  • tests/script/test_run_multi_provider_eval.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
config/**/*.{yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Use YAML for configuration files under config/

Files:

  • config/multi_eval_config.yaml
tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

  • tests/script/test_run_multi_provider_eval.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

  • tests/script/test_run_multi_provider_eval.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-08T11:09:18.416Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

  • docs/multi_provider_evaluation.md
🪛 LanguageTool
docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some nit comment, please test as we discussed on slack.
LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
script/run_multi_provider_eval.py (1)

65-70: Restore --verbose behavior in worker processes.

logging.basicConfig(... level=logging.INFO, force=True) inside the worker overrides the log level selected by the CLI. As soon as we enter parallel mode, every worker downgrades to INFO, so --verbose no longer produces DEBUG logs where you need them most. Please bubble the parent’s effective log level into the worker before reconfiguring logging.

-def _run_evaluation_worker(
+def _run_evaluation_worker(
     provider_name: str,
     provider_id: str,
     model: str,
     system_config_path: str,
     eval_data_path: str,
-    output_base: str,
+    output_base: str,
+    log_level: int = logging.INFO,
 ) -> dict[str, Any]:
@@
-    logging.basicConfig(
-        level=logging.INFO,
+    logging.basicConfig(
+        level=log_level,
         format=f"%(asctime)s - [Worker-{os.getpid()}] - %(levelname)s - %(message)s",
         datefmt="%Y-%m-%d %H:%M:%S",
         force=True,  # Override existing configuration
     )
@@
-        with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
+        effective_level = logging.getLogger().getEffectiveLevel()
+        with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
@@
-                future = executor.submit(
+                future = executor.submit(
                     _run_evaluation_worker,
                     config["provider_name"],
                     config["provider_id"],
                     config["model"],
                     str(self.system_config_path),
                     str(self.eval_data_path),
                     str(self.output_base),
+                    effective_level,
                 )

Also applies to: 573-586

🧹 Nitpick comments (1)
docs/multi_provider_evaluation.md (1)

377-382: Specify the fenced block language for lint compliance.

The warning snippet is missing a language hint, tripping MD040. Please mark it as plain text.

-```
+```text
 WARNING - High resource usage detected: 4 workers × 50 threads = 200 concurrent threads 
 on 8 CPUs. Consider reducing max_workers or core.max_threads in system.yaml to avoid 
 resource contention.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c68320 and 920ea4f.

📒 Files selected for processing (4)
  • config/multi_eval_config.yaml (2 hunks)
  • docs/multi_provider_evaluation.md (5 hunks)
  • script/run_multi_provider_eval.py (9 hunks)
  • tests/script/test_run_multi_provider_eval.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • config/multi_eval_config.yaml
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

  • tests/script/test_run_multi_provider_eval.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

  • tests/script/test_run_multi_provider_eval.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-08T11:09:18.416Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/**/*.py : Use structured logging with appropriate levels

Applied to files:

  • script/run_multi_provider_eval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

  • script/run_multi_provider_eval.py
🧬 Code graph analysis (1)
tests/script/test_run_multi_provider_eval.py (1)
script/run_multi_provider_eval.py (1)
  • MultiProviderEvaluationRunner (176-1189)
🪛 GitHub Actions: Tests
tests/script/test_run_multi_provider_eval.py

[warning] 264-264: High resource usage detected: 2 workers × 10 threads = 20 concurrent threads on 4 CPUs. Consider reducing max_workers or core.max_threads to avoid resource contention.

script/run_multi_provider_eval.py

[warning] 264-264: High resource usage detected: 2 workers × 10 threads = 20 concurrent threads on 4 CPUs. Consider reducing max_workers or core.max_threads in system.yaml to avoid resource contention.

🪛 LanguageTool
docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

🪛 markdownlint-cli2 (0.18.1)
docs/multi_provider_evaluation.md

322-322: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


377-377: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/multi_provider_evaluation.md (1)

377-381: Add language hint to warning block.

Markdownlint (MD040) flags the warning snippet because the fenced code block is missing a language tag; adding one keeps lint quiet and retains consistent monospace rendering. Annotating it as text (or similar) should be enough.

-```
+```text
 WARNING - High resource usage detected: 4 workers × 50 threads = 200 concurrent threads 
 on 8 CPUs. Consider reducing max_workers or core.max_threads in system.yaml to avoid 
 resource contention.
-```
+```
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 920ea4f and 71bcef1.

📒 Files selected for processing (4)
  • config/multi_eval_config.yaml (2 hunks)
  • docs/multi_provider_evaluation.md (5 hunks)
  • script/run_multi_provider_eval.py (9 hunks)
  • tests/script/test_run_multi_provider_eval.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • config/multi_eval_config.yaml
  • tests/script/test_run_multi_provider_eval.py
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

  • docs/multi_provider_evaluation.md
  • script/run_multi_provider_eval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-08T11:09:18.416Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

  • docs/multi_provider_evaluation.md
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/**/*.py : Use structured logging with appropriate levels

Applied to files:

  • script/run_multi_provider_eval.py
🪛 LanguageTool
docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

🪛 markdownlint-cli2 (0.18.1)
docs/multi_provider_evaluation.md

322-322: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


377-377: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: tests (3.11)

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit f8ba383 into lightspeed-core:main Nov 6, 2025
15 checks passed
@bsatapat-jpg bsatapat-jpg deleted the dev branch November 6, 2025 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants