LCORE-900: Added the parallel execution for multi-modal evaluation in… by bsatapat-jpg · Pull Request #92 · lightspeed-core/lightspeed-evaluation

bsatapat-jpg · 2025-11-04T08:24:04Z

The implementation should ensure:

Independent execution of model evaluations without data conflicts
Resource optimization to avoid overloading compute nodes
Synchronization of results for statistical significance checks and comparison graph generation

Board: https://issues.redhat.com/browse/LCORE-900

Summary by CodeRabbit

New Features
- Parallel multi-provider evaluations with configurable worker count, per-model isolation, real-time progress, and consolidated result summaries; sequential fallback when workers=1.
Configuration
- New max_workers setting and --max-workers CLI option with safe bounds, resource warnings, and guidance for tuning.
Documentation
- Expanded parallel execution, resource management, progress tracking, error handling, and quick-start examples.
Tests
- Added tests for max_workers handling, resource warnings, execution modes, and path/filesystem safety.

coderabbitai · 2025-11-04T08:24:22Z

Walkthrough

Adds configurable parallel execution for provider-model evaluations via a new max_workers setting (CLI/config), implements a ProcessPoolExecutor worker _run_evaluation_worker with per-worker isolated temp system configs, updates docs/tests for parallel flow, and enables gpt-4-turbo in config/multi_eval_config.yaml.

Changes

Cohort / File(s)	Summary
Configuration `config/multi_eval_config.yaml`	Adds `gpt-4-turbo` to `providers.openai.models` and sets `max_workers: 4` under `settings`.
Documentation `docs/multi_provider_evaluation.md`	Rewrites to document parallel execution, `max_workers` (CLI/config precedence), per-worker output isolation, progress tracking, result synchronization, error handling, and resource tuning guidance.
Core Parallel Execution Logic `script/run_multi_provider_eval.py`	Adds `_run_evaluation_worker` and per-worker temp system config helpers; orchestrates parallel runs with `ProcessPoolExecutor` via `_run_parallel_evaluations`; exposes `max_workers` in `MultiProviderEvaluationRunner` constructor and CLI `--max-workers`; supports sequential fallback when `max_workers==1`; includes path-safety checks, structured result metadata, logging, duration tracking, and cleanup.
Tests `tests/script/test_run_multi_provider_eval.py`	Adds tests for `max_workers` parsing/validation/clamping, resource-warning scenarios (thread count checks), sequential execution forcing in tests, and path sanitization/error handling coverage.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI / main()
    participant Runner as MultiProviderEvaluationRunner
    participant Pool as ProcessPoolExecutor
    participant Worker as _run_evaluation_worker
    participant Eval as Evaluation Engine

    CLI->>Runner: instantiate(..., max_workers)
    CLI->>Runner: run_evaluations()
    Runner->>Runner: determine max_workers (CLI > config > heuristic)

    alt Parallel (max_workers > 1)
        Runner->>Pool: create pool(max_workers)
        Runner->>Runner: build per-task configs (provider, model, paths)
        loop per task
            Pool->>Worker: submit(task_config)
            Worker->>Worker: create isolated temp system config
            Worker->>Eval: run evaluation (provider/model)
            Eval-->>Worker: return result dict
            Worker->>Worker: cleanup temp config
            Worker-->>Pool: return result metadata
        end
        Pool-->>Runner: collect results and failures
    else Sequential (max_workers == 1)
        Runner->>Runner: iterate tasks
        Runner->>Eval: run evaluation
        Eval-->>Runner: return result
    end

    Runner->>Runner: aggregate results, log durations/errors
    Runner-->>CLI: return list[dict] results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay extra attention to:
- script/run_multi_provider_eval.py: worker lifecycle, temp config creation/cleanup, exception and failure propagation.
- ProcessPoolExecutor usage: data serialization, avoiding shared mutable state, and safe logging per worker.
- Path traversal / filesystem safety checks and per-worker output isolation.
- Tests in tests/script/test_run_multi_provider_eval.py validating max_workers and resource-warning logic.

Possibly related PRs

LCORE-834: Added script to run evaluation across multiple providers and models #83 — Modifies the same runner (script/run_multi_provider_eval.py) and introduces related parallel/runner changes.

Suggested reviewers

VladimirKadlec
asamal4
tisnik
lpiwowar

Poem

🐰 A rabbit in the heap,
Forks the work and bounds to keep,
Temp configs hop, then go —
Workers race, results flow,
Carrots saved in tidy rows 🥕

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main change: implementing parallel execution for multi-modal model evaluation, which is the primary focus across all modified files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

config/multi_eval_config.yaml (1)

11-11: Update the stale inline note.

Now that gpt-4-turbo is part of the active model list, the trailing comment claiming it is “Commented out for faster testing” is misleading. Please either drop the note or rephrase it to reflect the current intent.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a181e5 and 10e8b90.

📒 Files selected for processing (4)

config/multi_eval_config.yaml (2 hunks)
docs/multi_provider_evaluation.md (5 hunks)
script/run_multi_provider_eval.py (9 hunks)
tests/script/test_run_multi_provider_eval.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

config/**/*.{yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Use YAML for configuration files under config/

Files:

config/multi_eval_config.yaml

tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

tests/script/test_run_multi_provider_eval.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

tests/script/test_run_multi_provider_eval.py

🧠 Learnings (6)

📓 Common learnings

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

📚 Learning: 2025-07-16T12:07:29.169Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

tests/script/test_run_multi_provider_eval.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-08T11:09:18.416Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

docs/multi_provider_evaluation.md

🪛 GitHub Actions: Black

tests/script/test_run_multi_provider_eval.py

[error] 1-1: Black formatting check failed. The file would be reformatted by Black.

script/run_multi_provider_eval.py

[error] 1-1: Black formatting check failed. The file would be reformatted by Black.

🪛 LanguageTool

docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: mypy
GitHub Check: tests (3.12)

🔇 Additional comments (2)

tests/script/test_run_multi_provider_eval.py (1)
438-445: The sandbox environment lacks both the uv package manager and the black formatter tool needed to verify or fix the formatting issues. I cannot automatically validate whether the code at lines 438-445 complies with Black's formatting rules.

The original review comment is valid and actionable. You'll need to run the Black formatter locally on your machine:
uv run black script/run_multi_provider_eval.py tests/script/test_run_multi_provider_eval.py
Or if you don't have uv installed, you can use:
black script/run_multi_provider_eval.py tests/script/test_run_multi_provider_eval.py
script/run_multi_provider_eval.py (1)

1-10: Run Black locally to resolve formatting issues.

The GitHub Actions check reports that script/run_multi_provider_eval.py and tests/script/test_run_multi_provider_eval.py would be reformatted by Black. Run Black locally and commit the resulting formatting updates to keep the pipeline green.

script/run_multi_provider_eval.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

script/run_multi_provider_eval.py (1)

232-243: Critical: Unresolved type coercion issue from previous review.

The past review correctly identified that max_workers loaded from YAML arrives as a string (e.g., "4"), causing TypeError: '>' not supported between instances of 'str' and 'int' at line 242. This issue remains unaddressed—the code still doesn't coerce max_workers to int before using it.

Apply the suggested fix from the previous review:

         if max_workers is not None:
             self.max_workers = max_workers
         elif "max_workers" in self.settings:
             self.max_workers = self.settings["max_workers"]
         else:
             # Default: leave one CPU free for system responsiveness
             cpu_count = multiprocessing.cpu_count()
             self.max_workers = max(1, cpu_count - 1)

+        if not isinstance(self.max_workers, int):
+            try:
+                self.max_workers = int(self.max_workers)
+            except (TypeError, ValueError) as exc:
+                raise ValueError(
+                    f"max_workers must be an integer, got {self.max_workers!r}"
+                ) from exc
+
         # Ensure at least 1 worker
         self.max_workers = max(1, self.max_workers)

🧹 Nitpick comments (1)

script/run_multi_provider_eval.py (1)
40-173: Consider refactoring to reduce code duplication.

The _run_evaluation_worker function duplicates substantial logic from _run_single_evaluation (lines 403-492), including sanitization, path creation, temp config handling, evaluation execution, and cleanup. This duplication makes maintenance harder and increases the risk of inconsistencies.

Consider extracting shared logic into a common helper function that both methods can call. For example:
def _execute_evaluation_task(
    self, 
    provider_name: str, 
    provider_id: str, 
    model: str,
    logger_instance: logging.Logger
) -> dict[str, Any]:
    """Common evaluation logic shared by worker and sequential paths."""
    # Shared sanitization, path creation, evaluation, cleanup logic
    ...
This would centralize the logic and make it easier to maintain consistency between sequential and parallel execution paths.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10e8b90 and 8652aae.

📒 Files selected for processing (4)

config/multi_eval_config.yaml (2 hunks)
docs/multi_provider_evaluation.md (5 hunks)
script/run_multi_provider_eval.py (9 hunks)
tests/script/test_run_multi_provider_eval.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/script/test_run_multi_provider_eval.py
config/multi_eval_config.yaml

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-08T11:09:18.416Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

docs/multi_provider_evaluation.md

🪛 LanguageTool

docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: mypy
GitHub Check: tests (3.13)
GitHub Check: tests (3.11)
GitHub Check: tests (3.12)

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8652aae and 5c68320.

📒 Files selected for processing (4)

config/multi_eval_config.yaml (2 hunks)
docs/multi_provider_evaluation.md (5 hunks)
script/run_multi_provider_eval.py (9 hunks)
tests/script/test_run_multi_provider_eval.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

config/**/*.{yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Use YAML for configuration files under config/

Files:

config/multi_eval_config.yaml

tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

tests/script/test_run_multi_provider_eval.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

tests/script/test_run_multi_provider_eval.py

🧠 Learnings (5)

📓 Common learnings

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-08T11:09:18.416Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

docs/multi_provider_evaluation.md

🪛 LanguageTool

docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

config/multi_eval_config.yaml

script/run_multi_provider_eval.py

VladimirKadlec

Added some nit comment, please test as we discussed on slack.
LGTM

script/run_multi_provider_eval.py

… the evaluation framework

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

script/run_multi_provider_eval.py (1)

65-70: Restore --verbose behavior in worker processes.

logging.basicConfig(... level=logging.INFO, force=True) inside the worker overrides the log level selected by the CLI. As soon as we enter parallel mode, every worker downgrades to INFO, so --verbose no longer produces DEBUG logs where you need them most. Please bubble the parent’s effective log level into the worker before reconfiguring logging.

-def _run_evaluation_worker(
+def _run_evaluation_worker(
     provider_name: str,
     provider_id: str,
     model: str,
     system_config_path: str,
     eval_data_path: str,
-    output_base: str,
+    output_base: str,
+    log_level: int = logging.INFO,
 ) -> dict[str, Any]:
@@
-    logging.basicConfig(
-        level=logging.INFO,
+    logging.basicConfig(
+        level=log_level,
         format=f"%(asctime)s - [Worker-{os.getpid()}] - %(levelname)s - %(message)s",
         datefmt="%Y-%m-%d %H:%M:%S",
         force=True,  # Override existing configuration
     )
@@
-        with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
+        effective_level = logging.getLogger().getEffectiveLevel()
+        with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
@@
-                future = executor.submit(
+                future = executor.submit(
                     _run_evaluation_worker,
                     config["provider_name"],
                     config["provider_id"],
                     config["model"],
                     str(self.system_config_path),
                     str(self.eval_data_path),
                     str(self.output_base),
+                    effective_level,
                 )

Also applies to: 573-586

🧹 Nitpick comments (1)

docs/multi_provider_evaluation.md (1)
377-382: Specify the fenced block language for lint compliance.

The warning snippet is missing a language hint, tripping MD040. Please mark it as plain text.
-```
+```text
 WARNING - High resource usage detected: 4 workers × 50 threads = 200 concurrent threads 
 on 8 CPUs. Consider reducing max_workers or core.max_threads in system.yaml to avoid 
 resource contention.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c68320 and 920ea4f.

📒 Files selected for processing (4)

config/multi_eval_config.yaml (2 hunks)
docs/multi_provider_evaluation.md (5 hunks)
script/run_multi_provider_eval.py (9 hunks)
tests/script/test_run_multi_provider_eval.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

config/multi_eval_config.yaml

🧰 Additional context used

📓 Path-based instructions (2)

tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

tests/script/test_run_multi_provider_eval.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

tests/script/test_run_multi_provider_eval.py

🧠 Learnings (7)

📓 Common learnings

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-08T11:09:18.416Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/**/*.py : Use structured logging with appropriate levels

Applied to files:

script/run_multi_provider_eval.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

script/run_multi_provider_eval.py

🧬 Code graph analysis (1)

tests/script/test_run_multi_provider_eval.py (1)

script/run_multi_provider_eval.py (1)

MultiProviderEvaluationRunner (176-1189)

🪛 GitHub Actions: Tests

tests/script/test_run_multi_provider_eval.py

[warning] 264-264: High resource usage detected: 2 workers × 10 threads = 20 concurrent threads on 4 CPUs. Consider reducing max_workers or core.max_threads to avoid resource contention.

script/run_multi_provider_eval.py

[warning] 264-264: High resource usage detected: 2 workers × 10 threads = 20 concurrent threads on 4 CPUs. Consider reducing max_workers or core.max_threads in system.yaml to avoid resource contention.

🪛 LanguageTool

docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

🪛 markdownlint-cli2 (0.18.1)

docs/multi_provider_evaluation.md

322-322: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

377-377: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/multi_provider_evaluation.md (1)
377-381: Add language hint to warning block.

Markdownlint (MD040) flags the warning snippet because the fenced code block is missing a language tag; adding one keeps lint quiet and retains consistent monospace rendering. Annotating it as text (or similar) should be enough.
-```
+```text
 WARNING - High resource usage detected: 4 workers × 50 threads = 200 concurrent threads 
 on 8 CPUs. Consider reducing max_workers or core.max_threads in system.yaml to avoid 
 resource contention.
-```
+```

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 920ea4f and 71bcef1.

📒 Files selected for processing (4)

config/multi_eval_config.yaml (2 hunks)
docs/multi_provider_evaluation.md (5 hunks)
script/run_multi_provider_eval.py (9 hunks)
tests/script/test_run_multi_provider_eval.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

config/multi_eval_config.yaml
tests/script/test_run_multi_provider_eval.py

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

docs/multi_provider_evaluation.md
script/run_multi_provider_eval.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-08T11:09:18.416Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:16-20
Timestamp: 2025-09-08T11:09:18.416Z
Learning: The maintainer asamal4 prefers sample configurations in config/system.yaml to show realistic usage patterns (e.g., API enabled: true) rather than conservative defaults, as teams are expected to customize the configuration for their specific environments.

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

docs/multi_provider_evaluation.md

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/**/*.py : Use structured logging with appropriate levels

Applied to files:

script/run_multi_provider_eval.py

🪛 LanguageTool

docs/multi_provider_evaluation.md

[style] ~13-~13: ‘with success’ might be wordy. Consider a shorter alternative.
Context: ...ut sanitization - Comprehensive summary with success/failure statistics and statistical comp...

(EN_WORDINESS_PREMIUM_WITH_SUCCESS)

🪛 markdownlint-cli2 (0.18.1)

docs/multi_provider_evaluation.md

322-322: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

377-377: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests (3.11)

tisnik

LGTM

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

script/run_multi_provider_eval.py Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch 2 times, most recently from 8652aae to 5c68320 Compare November 4, 2025 08:53

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

config/multi_eval_config.yaml Show resolved Hide resolved

script/run_multi_provider_eval.py Show resolved Hide resolved

VladimirKadlec approved these changes Nov 4, 2025

View reviewed changes

script/run_multi_provider_eval.py Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from 5c68320 to 920ea4f Compare November 4, 2025 10:12

LCORE-900: Added the parallel execution for multi-modal evaluation in…

71bcef1

… the evaluation framework

bsatapat-jpg force-pushed the dev branch from 920ea4f to 71bcef1 Compare November 4, 2025 10:18

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

tisnik approved these changes Nov 6, 2025

View reviewed changes

tisnik merged commit f8ba383 into lightspeed-core:main Nov 6, 2025
15 checks passed

bsatapat-jpg deleted the dev branch November 6, 2025 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCORE-900: Added the parallel execution for multi-modal evaluation in…#92

LCORE-900: Added the parallel execution for multi-modal evaluation in…#92
tisnik merged 1 commit intolightspeed-core:mainfrom
bsatapat-jpg:dev

bsatapat-jpg commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 4, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

VladimirKadlec left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

tisnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bsatapat-jpg commented Nov 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bsatapat-jpg commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 4, 2025 •

edited

Loading